โ† Back to Lessons
Open in Colab

Exploring Boosting Algorithm

Boosting in Pythonยถ

Next we will see how we can implement this model in Python. To do this, we will use the xgboost library.

Boosting for classificationยถ

To exemplify the implementation of a boosting algorithm for classification, we will use the same data set as in the case of decision trees and random forest.

Step 1. Reading the processed datasetยถ

Inย [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y = True, as_frame = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()
Out[1]:
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
224.63.61.00.2
155.74.41.50.4
656.73.14.41.4
114.83.41.60.2
424.43.21.30.2

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. In addition, the predictor variables do not need to be normalized, since the decision trees that make up the XGBoost models are not affected by the scale of the data because of the way they work: they make decisions based on certain feature thresholds, regardless of their scale.

However, if models other than decision trees are added for boosting, data standardization is necessary.

Step 2: Initialization and training of the modelยถ

Inย [2]:
from xgboost import XGBClassifier

model = XGBClassifier(random_state = 42)
model.fit(X_train, y_train)
Out[2]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              objective='multi:softprob', predictor=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The training time of a model will depend, first of all, on the size of the dataset (instances and features), and also on the model type and its configuration.

Step 3: Model predictionยถ

Once the model has been trained, it can be used to predict with the test sample of the dataset.

Inย [3]:
y_pred = model.predict(X_test)
y_pred
Out[3]:
array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

With raw data, it is very difficult to know whether the model is getting it right or not. To do this, we must compare it with reality. There are many metrics to measure the effectiveness of a model in predicting, including accuracy, which is the fraction of predictions that the model makes correctly.

Inย [4]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)
Out[4]:
1.0

The model is perfect!

Step 4: Saving the modelยถ

Once we have the model we were looking for (presumably after hyperparameter optimization), in order to be able to use it in the future, it is necessary to store it in our directory, together with the seed.

Inย [5]:
model.save_model("xgb_classifier_default_42.json")

Adding an explanatory name to the model is vital because, in the event of losing the code that generated it, we will know several important details. Firstly, we will understand its configuration (in this case, we use default because we haven't customized any of the model's hyperparameters; we've kept the defaults of the function). Secondly, we will have the seed necessary to replicate the random components of the model, indicated by adding a number to the filename, such as 42.

Boosting for regressionยถ

To exemplify the implementation of a boosting algorithm for regression, we will use the same data set as in the case of decision trees and random forests.

Step 1. Reading the processed datasetยถ

Inย [6]:
import pandas as pd

train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_petrol_consumption_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_petrol_consumption_test.csv")

train_data.head()
Out[6]:
Petrol_taxAverage_incomePaved_HighwaysPopulation_Driver_licence(%)Petrol_Consumption
08.0444785770.529464
17.5487023510.529414
28.05319118680.451344
37.0434539050.672968
47.5335741210.547628
Inย [7]:
X_train = train_data.drop(["Petrol_Consumption"], axis = 1)
y_train = train_data["Petrol_Consumption"]
X_test = test_data.drop(["Petrol_Consumption"], axis = 1)
y_test = test_data["Petrol_Consumption"]

The train set will be used to train the model, while the test set will be used to evaluate the effectiveness of the model. In addition, the predictor variables do not need to be normalized, since the decision trees that make up the XGBoost models are not affected by the scale of the data because of the way they work: they make decisions based on certain feature thresholds, regardless of their scale.

However, if models other than decision trees are added for boosting, data standardization is necessary.

Step 2: Initialization and training of the modelยถ

Inย [8]:
from xgboost import XGBRegressor

model = XGBRegressor(random_state = 42)
model.fit(X_train, y_train)
Out[8]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=42, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Step 3: Model predictionยถ

Once the model has been trained, it can be used to predict with the test sample of the dataset.

Inย [9]:
y_pred = model.predict(X_test)
y_pred
Out[9]:
array([577.82245, 551.888  , 602.6116 , 616.01685, 490.87717, 613.6035 ,
       560.9523 , 932.38385, 552.1893 , 647.02783], dtype=float32)

To calculate the effectiveness of the model we will use the mean squared error (MSE):

Inย [10]:
from sklearn.metrics import mean_squared_error

print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
Mean squared error: 12803.160311029944

Step 4: Saving the modelยถ

Once we have the model we were looking for (presumably after hyperparameter optimization), in order to be able to use it in the future, it is necessary to store it in our directory, together with the seed.

Inย [11]:
model.save_model("xgb_regressor_default_42.json")

Adding an explanatory name to the model is vital because, in the event of losing the code that generated it, we will know several important details. Firstly, we will understand its configuration (in this case, we use default because we haven't customized any of the model's hyperparameters; we've kept the defaults of the function). Secondly, we will have the seed necessary to replicate the random components of the model, indicated by adding a number to the filename, such as 42.