Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Full-Stack Software Developer - 16w

Data Science and Machine Learning - 16 wks

Search from all Lessons


LoginGet Started
← Back to Lessons

Weekly Coding Challenge

Every week, we pick a real-life project to build your portfolio and get ready for a job. All projects are built with ChatGPT as co-pilot!

Start the Challenge

Podcast: Code Sets You Free

A tech-culture podcast where you learn to fight the enemies that blocks your way to become a successful professional in tech.

Listen the podcast
Edit on Github
Open in Colab

Exploring K-Nearest Neighbors

KNN in Python

Next we will see how we can implement this model in Python. To do this, we will use the scikit-learn library.

KNN for classification

To exemplify the implementation of a KNN for classification we will use the data set that we have been using in the previous modules and that has been normalized because this type of model requires it.

Step 1. Reading the processed dataset

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_iris.csv")

X = total_data.drop("specie", axis = 1)
y = total_data["specie"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()
Out[1]:
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
22-1.5065211.249201-1.567576-1.315444
15-0.1736743.090775-1.283389-1.052180
651.0380050.0982170.3648960.264142
11-1.2641850.788808-1.226552-1.315444
42-1.7488560.328414-1.397064-1.315444

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model.

To ensure the correct functioning of this model and avoid errors, the standardization of the data is necessary because if the features are not on the same scale, those with larger magnitudes can dominate the distance and affect the result of the algorithm.

For example, if we have two characteristics: age (with values between 0 and 100) and annual income (with values between 0 and 100,000), the difference in scale between the two variables could cause annual income to have a disproportionate impact on the distance by ignoring the importance of age.

Standardizing the data helps all characteristics contribute equally to the distance, which can improve the performance of the KNN algorithm. The choice between normalization and Min-Max standardization will depend on the behavior of the variables and how they affect model performance. If we have features with different scales and range, Min-Max is the best alternative. If on the other hand they have the same or similar scale, normalization is the most appropriate.

Next we will visualize the relationship between the variables in the dataset (we have chosen three to make a 3D graph, since we cannot take more and plot them; 4D graphs do not exist):

In [2]:
# We add the species name for the plot

total_data["specie"] = total_data["specie"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
In [3]:
import plotly.express as px

fig = px.scatter_3d(total_data, x = "petal width (cm)", y = "petal length (cm)", z = "sepal width (cm)", color = "specie", width = 1000, height = 500,
                    size = total_data["petal length (cm)"].abs(), color_discrete_sequence=["#E58139", "#39E581", "#8139E5"])
camera = dict(
    up = dict(x = 1, y = 3.5, z = 0),
    eye = dict(x = 2, y = 0, z = 0)
)

fig.update_layout(scene_camera = camera)
fig.show()