Next we will see how we can implement this model in Python. To do so, we will use the scikit-learn library.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y = True, as_frame = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.head()
The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. In addition, the predictor variables do not need to be normalized since decision trees are not affected by the scale of the data because of the way they work: they make decisions based on certain feature thresholds, regardless of their scale.
One way to visualize the relationship of variables to the target is by using a new type of graph, the parallel_coordinates. It is a data visualization technique used to plot multivariate numerical variables:
import pandas as pd
total_data = X
total_data["Name"] = y
pd.plotting.parallel_coordinates(total_data, "Name", color = ("#E58139", "#39E581", "#8139E5"))
Class 0 being an iris setosa, 1 an iris versicolor and 2 an iris virginica, we can appreciate how the values of the predictors are distributed in the different flower classes. This can be a good graph to include in our EDA analysis from now on, as it helps us to identify patterns and correlations between variables.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state = 42)
model.fit(X_train, y_train)
Once the model has been trained correctly, we can visualize the tree with the same library. This visualization represents all the steps that the model has followed until the construction of the tree. Moreover, it is done in levels, from left to right:
import matplotlib.pyplot as plt
from sklearn import tree
fig = plt.figure(figsize=(15,15))
tree.plot_tree(model, feature_names = list(X_train.columns), class_names = ["0", "1", "2"], filled = True)
plt.show()
The training time of a model will depend, first of all, on the size of the dataset (instances and features), and also on the model type and its configuration.
Once the model has been trained, it can be used to predict with the test data set.
y_pred = model.predict(X_test)
y_pred
With raw data, it is very difficult to know whether the model is getting it right or not. To do this, we must compare it with reality. There are many metrics to measure the effectiveness of a model in predicting, including accuracy, which is the fraction of predictions that the model makes correctly.
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
The model is perfect!
Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future, it is necessary to store it in our directory, together with the seed.
from pickle import dump
dump(model, open("decision_tree_classifier_default_42.sav", "wb"))
Adding an explanatory name to the model is vital because, in the event of losing the code that generated it, we will know several important details. Firstly, we will understand its configuration (in this case, we use default because we haven't customized any of the model's hyperparameters; we've kept the defaults of the function). Secondly, we will have the seed necessary to replicate the random components of the model, indicated by adding a number to the filename, such as 42.
To exemplify the implementation of a regression tree, we will use a data set with a few instances that has been previously treated with a full EDA.
train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_petrol_consumption_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_petrol_consumption_test.csv")
train_data.head()
The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. We will also split the predictors of the features.
X_train = train_data.drop(["Petrol_Consumption"], axis = 1)
y_train = train_data["Petrol_Consumption"]
X_test = test_data.drop(["Petrol_Consumption"], axis = 1)
y_test = test_data["Petrol_Consumption"]
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state = 42)
model.fit(X_train, y_train)
Once the model has been trained, it can be used to predict with the test data set.
y_pred = model.predict(X_test)
y_pred
To calculate the effectiveness of the model, we will use the mean squared error (MSE):
from sklearn.metrics import mean_squared_error
print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future, it is necessary to store it in our directory, together with the seed.
dump(model, open("decision_tree_regressor_default_42.sav", "wb"))
Adding an explanatory name to the model is vital because, in the event of losing the code that generated it, we will know several important details. Firstly, we will understand its configuration (in this case, we use default because we haven't customized any of the model's hyperparameters; we've kept the defaults of the function). Secondly, we will have the seed necessary to replicate the random components of the model, indicated by adding a number to the filename, such as 42.