Next we will see how we can implement this model in Python. To do so, we will use the scikit-learn
library.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y = True, as_frame = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.head()
The train set will be used to train the model, while the test will be used to evaluate its degree of effectiveness. Furthermore, it is not necessary for the predictor variables to be normalized, since random forests, and therefore decision trees, are not affected by the scale of the data because of the way they work: they make decisions based on certain feature thresholds, regardless of their scale.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state = 42)
model.fit(X_train, y_train)
Once the model has trained correctly, we can visualize the random forest with the same library. This visualization will show each complete derived tree:
import matplotlib.pyplot as plt
from sklearn import tree
fig, axis = plt.subplots(2, 2, figsize = (15, 15))
# We show the first 4 trees out of the 100 generated (default)
tree.plot_tree(model.estimators_[0], ax = axis[0, 0], feature_names = list(X_train.columns), class_names = ["0", "1", "2"], filled = True)
tree.plot_tree(model.estimators_[1], ax = axis[0, 1], feature_names = list(X_train.columns), class_names = ["0", "1", "2"], filled = True)
tree.plot_tree(model.estimators_[2], ax = axis[1, 0], feature_names = list(X_train.columns), class_names = ["0", "1", "2"], filled = True)
tree.plot_tree(model.estimators_[3], ax = axis[1, 1], feature_names = list(X_train.columns), class_names = ["0", "1", "2"], filled = True)
plt.show()
The training time of a model will depend, first of all, on the size of the dataset (instances and features), and also on the number of trees we want our random forest to have.
Once the model has been trained, it can be used to predict with the test data set.
y_pred = model.predict(X_test)
y_pred
With raw data it is very difficult to know whether the model is getting it right or not. To do this, we must compare it with reality. There are a large number of metrics to measure the effectiveness of a model in predicting, including accuracy, which is the fraction of predictions that the model made correctly.
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
The model is perfect!
Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future it is necessary to store it in our directory, together with the seed.
from pickle import dump
dump(model, open("random_forest_classifier_default_42.sav", "wb"))
Adding an explanatory name to the model is vital, since in the case of losing the code that has generated it we will know, on the one hand, what configuration it has (in this case we say default
because we have not customized any of the hyperparameters of the model, we have left those that the function has by default) and also the seed to replicate the random components of the model, which in this case we do it adding a number to the file name, the 42
.
To exemplify the implementation of a random forest we will use a data set with few instances and that has been previously treated with a full EDA. We will use the same data set as in the case of decision trees.
import pandas as pd
train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_petrol_consumption_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_petrol_consumption_test.csv")
train_data.head()
X_train = train_data.drop(["Petrol_Consumption"], axis = 1)
y_train = train_data["Petrol_Consumption"]
X_test = test_data.drop(["Petrol_Consumption"], axis = 1)
y_test = test_data["Petrol_Consumption"]
The train set will be used to train the model, while the test will be used to evaluate its degree of effectiveness. Furthermore, it is not necessary for the predictor variables to be normalized, since random forests, and therefore decision trees, are not affected by the scale of the data because of the way they work: they make decisions based on certain feature thresholds, regardless of their scale.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state = 42)
model.fit(X_train, y_train)
Once the model has been trained, it can be used to predict with the test sample of the dataset.
y_pred = model.predict(X_test)
y_pred
To calculate the effectiveness of the model we will use the mean squared error (MSE):
from sklearn.metrics import mean_squared_error
print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
Once we have the model we were looking for (presumably after hyperparameter optimization), to be able to use it in the future it is necessary to store it in our directory, together with the seed.
dump(model, open("random_forest_regressor_default_42.sav", "wb"))
Adding an explanatory name to the model is vital, since in the case of losing the code that has generated it we will know, on the one hand, what configuration it has (in this case we say default
because we have not customized any of the hyperparameters of the model, we have left those that the function has by default) and also the seed to replicate the random components of the model, which in this case we do it adding a number to the file name, the 42
.