4Geeks logo
About us

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Data Science and Machine Learning - 16 wks

Full-Stack Software Developer - 16w

Search from all Lessons

Social & live learning

The most efficient way to learn: Join a cohort with classmates just like you, live streams, impromptu coding sessions, live tutorials with real experts, and stay motivated.

← Back to Lessons
Edit on Github
Open in Collab

Exploring K-Nearest Neighbors

KNN in Python

Next we will see how we can implement this model in Python. To do this, we will use the scikit-learn library.

KNN for classification

To exemplify the implementation of a KNN for classification we will use the data set that we have been using in the previous modules and that has been normalized because this type of model requires it.

Step 1. Reading the processed dataset

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_iris.csv")

X = total_data.drop("specie", axis = 1)
y = total_data["specie"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()
Out[1]:
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
22-1.5065211.249201-1.567576-1.315444
15-0.1736743.090775-1.283389-1.052180
651.0380050.0982170.3648960.264142
11-1.2641850.788808-1.226552-1.315444
42-1.7488560.328414-1.397064-1.315444

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model.

To ensure the correct functioning of this model and avoid errors, the standardization of the data is necessary because if the features are not on the same scale, those with larger magnitudes can dominate the distance and affect the result of the algorithm.

For example, if we have two characteristics: age (with values between 0 and 100) and annual income (with values between 0 and 100,000), the difference in scale between the two variables could cause annual income to have a disproportionate impact on the distance by ignoring the importance of age.

Standardizing the data helps all characteristics contribute equally to the distance, which can improve the performance of the KNN algorithm. The choice between normalization and Min-Max standardization will depend on the behavior of the variables and how they affect model performance. If we have features with different scales and range, Min-Max is the best alternative. If on the other hand they have the same or similar scale, normalization is the most appropriate.

Next we will visualize the relationship between the variables in the dataset (we have chosen three to make a 3D graph, since we cannot take more and plot them; 4D graphs do not exist):

In [2]:
# We add the species name for the plot

total_data["specie"] = total_data["specie"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
In [3]:
import plotly.express as px

fig = px.scatter_3d(total_data, x = "petal width (cm)", y = "petal length (cm)", z = "sepal width (cm)", color = "specie", width = 1000, height = 500,
                    size = total_data["petal length (cm)"].abs(), color_discrete_sequence=["#E58139", "#39E581", "#8139E5"])
camera = dict(
    up = dict(x = 1, y = 3.5, z = 0),
    eye = dict(x = 2, y = 0, z = 0)
)

fig.update_layout(scene_camera = camera)
fig.show()