Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Search from all Lessons


LoginGet Started
← Back to Lessons
Edit on Github
Open in Colab

Exploring Naive Bayes

Naive Bayes in Python

Next we will see how we can implement this model in Python. To do so, we will use the scikit-learn library.

To exemplify the implementation of a boosting algorithm for classification, we will use the same dataset as in the case of decision trees, random forests, and boosting.

Step 1. Reading the processed dataset

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y = True, as_frame = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()
Out[1]:
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
224.63.61.00.2
155.74.41.50.4
656.73.14.41.4
114.83.41.60.2
424.43.21.30.2

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. Furthermore, it is not necessary for the predictor variables to be normalized, since these models are based on Bayes' theorem and make specific assumptions about the distribution of the data, but are not directly affected by the scale of the features.

Step 2: Initialization and training of the model

In [3]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)
Out[3]:
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The training time of a model will depend, first of all, on the size of the dataset (instances and features), and also on the model type and its configuration.

Step 3: Model prediction

Once the model has been trained, it can be used to predict with the test data set.

In [4]:
y_pred = model.predict(X_test)
y_pred
Out[4]:
array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

With raw data, it is very difficult to know whether the model is getting it right or not. To do this, we must compare it with reality. There are many metrics to measure the effectiveness of a model in predicting, including accuracy, which is the fraction of predictions that the model makes correctly.

In [5]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)
Out[5]:
1.0

The model is perfect!

Step 4: Saving the model

Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future, it is necessary to store it in our directory.

In [6]:
from pickle import dump

dump(model, open("naive_bayes_default.sav", "wb"))

Adding an explanatory name to the model is vital because, in the event of losing the code that generated it, we will understand its configuration (in this case, we use default because we haven't customized any of the model's hyperparameters; we've kept the defaults of the function).