Python

Machine Learning

Exploring K-Nearest Neighbors

KNN in Python¶

Next we will see how we can implement this model in Python. To do this, we will use the scikit-learn library.

KNN for classification¶

To exemplify the implementation of a KNN for classification, we will use the data set that we have been using in the previous modules, and that has been normalized because this type of model requires it.

Step 1. Reading the processed dataset¶

In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split

total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_iris.csv")

X = total_data.drop("specie", axis = 1)
y = total_data["specie"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()

Out[1]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
22	-1.506521	1.249201	-1.567576	-1.315444
15	-0.173674	3.090775	-1.283389	-1.052180
65	1.038005	0.098217	0.364896	0.264142
11	-1.264185	0.788808	-1.226552	-1.315444
42	-1.748856	0.328414	-1.397064	-1.315444

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model.

To ensure the correct functioning of this model and avoid errors, the standardization of the data is necessary because if the features are not on the same scale, those with larger magnitudes can dominate the distance and affect the result of the algorithm.

For example, if we have two characteristics: age (with values between 0 and 100) and annual income (with values between 0 and 100,000), the difference in scale between the two variables could cause annual income to have a disproportionate impact on the distance by ignoring the importance of age.

Standardizing the data helps all characteristics contribute equally to the distance, which can improve the performance of the KNN algorithm. The choice between normalization and Min-Max standardization will depend on the behavior of the variables and how they affect model performance. If we have features with different scales and ranges, Min-Max is the best alternative. If, on the other hand, they have the same or similar scale, normalization is the most appropriate.

Next, we will visualize the relationship between the variables in the dataset (we have chosen three to make a 3D graph, since we cannot take more and plot them; 4D graphs do not exist):

In [2]:

# We add the species name for the plot

total_data["specie"] = total_data["specie"].map({0: "setosa", 1: "versicolor", 2: "virginica"})

In [3]:

import plotly.express as px

fig = px.scatter_3d(total_data, x = "petal width (cm)", y = "petal length (cm)", z = "sepal width (cm)", color = "specie", width = 1000, height = 500,
                    size = total_data["petal length (cm)"].abs(), color_discrete_sequence=["#E58139", "#39E581", "#8139E5"])
camera = dict(
    up = dict(x = 1, y = 3.5, z = 0),
    eye = dict(x = 2, y = 0, z = 0)
)

fig.update_layout(scene_camera = camera)
fig.show()

The 3D graph allows us to analyze the separation and distribution of the classes as a function of the combinatorics of the 3 variables. It is clearly seen that given a point and the predictor variables, we would be able, taking advantage of the power and predictive capacity of a KNN model, to realize an adequate accuracy.

For more information, we could calculate a dot plot for the relationship between the variables two by two (this would have to be done in the EDA):

In [4]:

import matplotlib.pyplot as plt
import seaborn as sns

fig, axis = plt.subplots(2, 3, figsize = (15, 7))

palette = ["#E58139", "#39E581", "#8139E5"]
sns.scatterplot(ax = axis[0, 0], data = total_data, x = "sepal length (cm)", y = "sepal width (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[0, 1], data = total_data, x = "sepal length (cm)", y = "petal length (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[0, 2], data = total_data, x = "sepal length (cm)", y = "petal width (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[1, 0], data = total_data, x = "sepal width (cm)", y = "petal length (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[1, 1], data = total_data, x = "sepal width (cm)", y = "petal width (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[1, 2], data = total_data, x = "petal length (cm)", y = "petal width (cm)", hue = "specie", palette = palette)

plt.tight_layout()

plt.show()

No description has been provided for this image

Comparing the predictors one by one (to make it more graphical and explicit) the separation as a function of class values is better observed. Therefore, the KNN model is also very appropriate to solve the problem.

Step 2: Initialization and training of the model¶

In [5]:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(X_train, y_train)

Out[5]:

KNeighborsClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The training time of a model will depend, first of all, on the size of the dataset (instances and features), and also on the model type and its configuration.

Step 3: Model prediction¶

Once the model has been trained, it can be used to predict with the test data set.

In [6]:

y_pred = model.predict(X_test)
y_pred

Out[6]:

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

With raw data, it is very difficult to know whether the model is getting it right or not. To do this, we must compare it with reality. There are many metrics to measure the effectiveness of a model in predicting, including accuracy, which is the fraction of predictions that the model makes correctly.

In [7]:

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

Out[7]:

1.0

The model is perfect!

Step 4: Saving the model¶

Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future, it is necessary to store it in our directory, together with the seed.

In [8]:

from pickle import dump

dump(model, open("knn_classifier_default.sav", "wb"))

Adding an explanatory name to the model is vital, since in the case of losing the code that has generated it we will know what configuration it has (in this case we say default because we have not customized any of the hyperparameters of the model, we have left the ones that the function has by default).

KNN for regression¶

To exemplify the implementation of a KNN algorithm, we will show how to generate a dataset that meets our needs.

Step 1. Reading the processed dataset¶

In [9]:

import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples = 1000, n_features = 4, noise = 1, random_state = 42)
X = pd.DataFrame(X, columns = ["Var1", "Var2", "Var3", "Var4"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()

Out[9]:

	Var1	Var2	Var3	Var4
29	-0.518270	0.357113	1.477894	-0.219672
535	0.457687	-2.120700	-0.606865	-2.238231
695	-0.224633	0.940771	-0.982487	-0.989628
557	0.360648	-0.320298	1.643378	-2.077812
836	-0.307962	-0.144519	-0.792420	-0.675178

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. We will also split the predictors of the features.

In regression, it is also necessary to standardize the data. In this case, they are already standardized. As we have done before, we will draw again the 3D graph and the relationships one by one of the features of the artificially generated dataset:

In [10]:

import plotly.express as px

total_data = X.copy()
total_data["target"] = y

fig = px.scatter_3d(total_data, x = "Var1", y = "Var2", z = "Var3", color = "target", width = 1000, height = 500,
                    size = total_data["Var4"].abs())
camera = dict(
    up = dict(x = 1, y = 3.5, z = 0),
    eye = dict(x = 2, y = 0, z = 0)
)

fig.update_layout(scene_camera = camera)
fig.show()

In [11]:

import matplotlib.pyplot as plt
import seaborn as sns

fig, axis = plt.subplots(2, 3, figsize = (15, 7))

palette = sns.color_palette("gnuplot2_r", as_cmap=True)
sns.scatterplot(ax = axis[0, 0], data = total_data, x = "Var1", y = "Var2", hue = "target", palette = palette)
sns.scatterplot(ax = axis[0, 1], data = total_data, x = "Var1", y = "Var3", hue = "target", palette = palette)
sns.scatterplot(ax = axis[0, 2], data = total_data, x = "Var1", y = "Var4", hue = "target", palette = palette)
sns.scatterplot(ax = axis[1, 0], data = total_data, x = "Var2", y = "Var3", hue = "target", palette = palette)
sns.scatterplot(ax = axis[1, 1], data = total_data, x = "Var2", y = "Var4", hue = "target", palette = palette)
sns.scatterplot(ax = axis[1, 2], data = total_data, x = "Var3", y = "Var4", hue = "target", palette = palette)

plt.tight_layout()

plt.show()

We see that for most variables, a certain differentiating pattern is established and that regression can yield good results.

Step 2: Initialization and training of the model¶

In [12]:

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()
model.fit(X_train, y_train)

Out[12]:

KNeighborsRegressor()

Step 3: Model prediction¶

Once the model has been trained, it can be used to predict with the test data set.

In [13]:

y_pred = model.predict(X_test)
y_pred

Out[13]:

array([-147.41721871,   12.7128117 ,  -36.12302539,  -21.93648933,
       -144.32582093,   -4.62694737,   25.33274569,  -32.72285068,
         58.56266885,  155.66838297,  -91.57904794, -209.25552065,
       -125.0049947 ,  -40.94453209,   77.06000418,   -2.31024234,
         89.23243529,  -87.10430605,  -12.75608929, -107.40101528,
         59.5285574 ,  -73.2172528 ,   12.89565   , -152.20708656,
         68.31889013, -150.10454511,  -34.63535979,  -46.82258216,
        137.590877  ,   84.72199868, -115.13576847,  -23.02387497,
         99.53383414,  221.35135356,   26.10949006,  -78.30179293,
       -153.46745873,  155.37084184,   97.19477898,  156.55504796,
         33.52863468,  -53.06285465,   -2.71200404,  138.92351044,
        -28.8475082 ,  -54.68111873,    0.39307033,   14.64850846,
        -18.68501556,   82.45571704,  171.35059264,  -39.28831728,
         50.92959341, -140.09104896,   55.2744179 ,  -45.23494761,
        -55.32923476,   40.641162  ,   13.65687921,   79.13259557,
         24.21322023,  205.77188903,  162.12907189, -214.64288553,
        -48.12750922,   57.84638636,  -48.11409853,  132.86152958,
         78.94155338,  173.47253872,  -63.43007483,  -56.28153853,
        -20.57233102,  -41.89870694,   27.50335358,   56.21961522,
        -67.01690518,   65.86291158,   35.82856198,   93.27327307,
       -135.42551895,  -23.28890253,  100.60637024, -169.75802585,
         16.80237377, -145.92720763,    7.65154432,   70.31085359,
       -138.08858909,  -20.97488562, -202.84423981, -186.29010982,
        118.36374156,  -46.46130478,    3.04367133,   20.90597603,
         72.59324354,  160.86426614,   -2.64495043,  -69.61344901,
         55.47044742,  125.32284865,    6.14957185, -135.19505319,
         33.55728658,   87.92606067,  106.99087434,  164.64474228,
         59.48733257,  -83.81266187, -122.13094227,    0.53719761,
        236.86271787,  -67.3417169 ,   25.8694997 , -174.7047103 ,
        -80.57484935, -135.11344691,  224.10044744,  -66.70764275,
          1.56024283, -105.47176455,   52.03445355,   95.74231157,
         16.19497672,   39.93612672,  -44.40873841,  -27.81008649,
       -177.51227362,   56.16766096,  206.66743023,   20.80449655,
         52.2654225 ,  166.1518485 ,  -53.23874069,   21.70964561,
         34.02812113,    4.15932292,    7.3331871 ,  -50.75381914,
       -213.20370255,   15.77528335,  116.86174989,  196.84267769,
         26.6297435 , -125.17028769,  -79.75376986,   -2.84775957,
        -75.91484381, -183.30498253,  125.48685399,  -97.87501807,
        -50.80763193,  -54.4973921 ,   63.03342149,   23.2467635 ,
        -20.74887764,   75.23614017,   37.61678359,  -81.40887724,
       -130.38933037,  191.29075295, -124.56063375, -108.24640512,
         -0.58051144,   20.05594007,   -9.4429406 ,  -76.12710791,
         90.35499728,   85.84797897,   14.17483093,  -16.96825839,
        -14.64975853, -107.57040037,   96.11298658,  -47.20140352,
          5.71159723,  -59.80612262,  101.68961644, -151.40241857,
         39.95250414, -136.80702629, -155.86579813,  102.57850137,
        108.23524256,   21.45005089, -216.76995688,   92.26877057,
        -48.18787188,   46.04898414,   76.48356855,  -72.32331055,
       -207.22300555,  104.51122023,   62.90183088, -165.52273551,
         64.05278101,   98.47222997,  121.36946183,   13.74458948])

To calculate the effectiveness of the model we will use the mean squared error (MSE):

In [14]:

from sklearn.metrics import mean_squared_error, r2_score

print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
print(f"Coefficient of determination: {r2_score(y_test, y_pred)}")

Mean Squared Error: 564.8367646867368
Coefficient of determination: 0.9547914514799766

The model is very close to perfect.

Step 4: Saving the model¶

Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future, it is necessary to store it in our directory, together with the seed.

In [15]:

dump(model, open("knn_regressor_default.sav", "wb"))