Next we will see how we can implement this model in Python. To do this, we will use the scikit-learn
library.
import pandas as pd
from sklearn.model_selection import train_test_split
total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_iris.csv")
X = total_data.drop("specie", axis = 1)
y = total_data["specie"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.head()
The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model.
To ensure the correct functioning of this model and avoid errors, the standardization of the data is necessary because if the features are not on the same scale, those with larger magnitudes can dominate the distance and affect the result of the algorithm.
For example, if we have two characteristics: age (with values between 0 and 100) and annual income (with values between 0 and 100,000), the difference in scale between the two variables could cause annual income to have a disproportionate impact on the distance by ignoring the importance of age.
Standardizing the data helps all characteristics contribute equally to the distance, which can improve the performance of the KNN algorithm. The choice between normalization and Min-Max standardization will depend on the behavior of the variables and how they affect model performance. If we have features with different scales and ranges, Min-Max is the best alternative. If, on the other hand, they have the same or similar scale, normalization is the most appropriate.
Next, we will visualize the relationship between the variables in the dataset (we have chosen three to make a 3D graph, since we cannot take more and plot them; 4D graphs do not exist):
# We add the species name for the plot
total_data["specie"] = total_data["specie"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
import plotly.express as px
fig = px.scatter_3d(total_data, x = "petal width (cm)", y = "petal length (cm)", z = "sepal width (cm)", color = "specie", width = 1000, height = 500,
size = total_data["petal length (cm)"].abs(), color_discrete_sequence=["#E58139", "#39E581", "#8139E5"])
camera = dict(
up = dict(x = 1, y = 3.5, z = 0),
eye = dict(x = 2, y = 0, z = 0)
)
fig.update_layout(scene_camera = camera)
fig.show()
The 3D graph allows us to analyze the separation and distribution of the classes as a function of the combinatorics of the 3 variables. It is clearly seen that given a point and the predictor variables, we would be able, taking advantage of the power and predictive capacity of a KNN model, to realize an adequate accuracy.
For more information, we could calculate a dot plot for the relationship between the variables two by two (this would have to be done in the EDA):
import matplotlib.pyplot as plt
import seaborn as sns
fig, axis = plt.subplots(2, 3, figsize = (15, 7))
palette = ["#E58139", "#39E581", "#8139E5"]
sns.scatterplot(ax = axis[0, 0], data = total_data, x = "sepal length (cm)", y = "sepal width (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[0, 1], data = total_data, x = "sepal length (cm)", y = "petal length (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[0, 2], data = total_data, x = "sepal length (cm)", y = "petal width (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[1, 0], data = total_data, x = "sepal width (cm)", y = "petal length (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[1, 1], data = total_data, x = "sepal width (cm)", y = "petal width (cm)", hue = "specie", palette = palette)
sns.scatterplot(ax = axis[1, 2], data = total_data, x = "petal length (cm)", y = "petal width (cm)", hue = "specie", palette = palette)
plt.tight_layout()
plt.show()
Comparing the predictors one by one (to make it more graphical and explicit) the separation as a function of class values is better observed. Therefore, the KNN model is also very appropriate to solve the problem.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X_train, y_train)
The training time of a model will depend, first of all, on the size of the dataset (instances and features), and also on the model type and its configuration.
Once the model has been trained, it can be used to predict with the test data set.
y_pred = model.predict(X_test)
y_pred
With raw data, it is very difficult to know whether the model is getting it right or not. To do this, we must compare it with reality. There are many metrics to measure the effectiveness of a model in predicting, including accuracy, which is the fraction of predictions that the model makes correctly.
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
The model is perfect!
Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future, it is necessary to store it in our directory, together with the seed.
from pickle import dump
dump(model, open("knn_classifier_default.sav", "wb"))
Adding an explanatory name to the model is vital, since in the case of losing the code that has generated it we will know what configuration it has (in this case we say default
because we have not customized any of the hyperparameters of the model, we have left the ones that the function has by default).
To exemplify the implementation of a KNN algorithm, we will show how to generate a dataset that meets our needs.
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples = 1000, n_features = 4, noise = 1, random_state = 42)
X = pd.DataFrame(X, columns = ["Var1", "Var2", "Var3", "Var4"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.head()
The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. We will also split the predictors of the features.
In regression, it is also necessary to standardize the data. In this case, they are already standardized. As we have done before, we will draw again the 3D graph and the relationships one by one of the features of the artificially generated dataset:
import plotly.express as px
total_data = X.copy()
total_data["target"] = y
fig = px.scatter_3d(total_data, x = "Var1", y = "Var2", z = "Var3", color = "target", width = 1000, height = 500,
size = total_data["Var4"].abs())
camera = dict(
up = dict(x = 1, y = 3.5, z = 0),
eye = dict(x = 2, y = 0, z = 0)
)
fig.update_layout(scene_camera = camera)
fig.show()
import matplotlib.pyplot as plt
import seaborn as sns
fig, axis = plt.subplots(2, 3, figsize = (15, 7))
palette = sns.color_palette("gnuplot2_r", as_cmap=True)
sns.scatterplot(ax = axis[0, 0], data = total_data, x = "Var1", y = "Var2", hue = "target", palette = palette)
sns.scatterplot(ax = axis[0, 1], data = total_data, x = "Var1", y = "Var3", hue = "target", palette = palette)
sns.scatterplot(ax = axis[0, 2], data = total_data, x = "Var1", y = "Var4", hue = "target", palette = palette)
sns.scatterplot(ax = axis[1, 0], data = total_data, x = "Var2", y = "Var3", hue = "target", palette = palette)
sns.scatterplot(ax = axis[1, 1], data = total_data, x = "Var2", y = "Var4", hue = "target", palette = palette)
sns.scatterplot(ax = axis[1, 2], data = total_data, x = "Var3", y = "Var4", hue = "target", palette = palette)
plt.tight_layout()
plt.show()
We see that for most variables, a certain differentiating pattern is established and that regression can yield good results.
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor()
model.fit(X_train, y_train)
Once the model has been trained, it can be used to predict with the test data set.
y_pred = model.predict(X_test)
y_pred
To calculate the effectiveness of the model we will use the mean squared error (MSE):
from sklearn.metrics import mean_squared_error, r2_score
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
print(f"Coefficient of determination: {r2_score(y_test, y_pred)}")
The model is very close to perfect.
Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future, it is necessary to store it in our directory, together with the seed.
dump(model, open("knn_regressor_default.sav", "wb"))
Adding an explanatory name to the model is vital, since in the case of losing the code that has generated it we will know what configuration it has (in this case we say default
because we have not customized any of the hyperparameters of the model, we have left the ones that the function has by default).