Next we will see how we can implement this model in Python. To do this, we will use the scikit-learn
library.
import pandas as pd
from sklearn.model_selection import train_test_split
total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_iris.csv")
X = total_data.drop("specie", axis = 1)
y = total_data["specie"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.head()
The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model.
To ensure the correct functioning of this model and avoid errors, the standardization of the data is necessary because if the features are not on the same scale, those with larger magnitudes can dominate the distance and affect the result of the algorithm.
For example, if we have two characteristics: age (with values between 0 and 100) and annual income (with values between 0 and 100,000), the difference in scale between the two variables could cause annual income to have a disproportionate impact on the distance by ignoring the importance of age.
Standardizing the data helps all characteristics contribute equally to the distance, which can improve the performance of the KNN algorithm. The choice between normalization and Min-Max standardization will depend on the behavior of the variables and how they affect model performance. If we have features with different scales and range, Min-Max is the best alternative. If on the other hand they have the same or similar scale, normalization is the most appropriate.
Next we will visualize the relationship between the variables in the dataset (we have chosen three to make a 3D graph, since we cannot take more and plot them; 4D graphs do not exist):
# We add the species name for the plot
total_data["specie"] = total_data["specie"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
import plotly.express as px
fig = px.scatter_3d(total_data, x = "petal width (cm)", y = "petal length (cm)", z = "sepal width (cm)", color = "specie", width = 1000, height = 500,
size = total_data["petal length (cm)"].abs(), color_discrete_sequence=["#E58139", "#39E581", "#8139E5"])
camera = dict(
up = dict(x = 1, y = 3.5, z = 0),
eye = dict(x = 2, y = 0, z = 0)
)
fig.update_layout(scene_camera = camera)
fig.show()