Hyperparameter optimization (HPO) is a mechanism for approximating a version of a model with high performance and effectiveness. These hyperparameters, unlike model parameters, are set by the engineer prior to training.
A hyperparameter is a configuration variable external to the model that is used to train it. Depending on the model, we can find a multitude of hyperparameters:
A parameter of a model is the set of features that are optimized to train it and that shape its learning. These values are not accessible by us as developers. For example, in the case of a linear regression, these parameters will be the slope and the intercept, for example.
With the training dataset and a learning algorithm (such as the one we saw above about gradient descent), we manage to alter these values and let the model know how to classify or predict the cases.
However, a hyperparameter, in contrast, is established before the training phase and allows the developer to create a context and prepare the model.
Parameter | Hyperparameter |
---|---|
Indispensable for predictions. | Indispensable for initializing the model parameters, which will be optimized later. |
They are estimated by learning algorithms (gradient descent, Adam, Adagrad...). | They are estimated by the optimization method. |
They are not set manually. | They are set manually. |
| The final value is obtained after the learning phase and will determine the accuracy of the model and how it will predict new data. | The choice of these values will determine how efficient the training will be. It also has a big impact on the parameter optimization process.
Normally, we do not know the optimal values for the hyperparameters that would generate the best model results. Therefore, it is a vital and important step to include this in any Machine Learning model building process.
There are several strategies to carry it out. First, we train a base model:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_test.csv")
X_train = train_data.drop(["Survived"], axis = 1)
y_train = train_data["Survived"]
X_test = test_data.drop(["Survived"], axis = 1)
y_test = test_data["Survived"]
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
base_accuracy = accuracy_score(y_test, y_pred)
base_accuracy
As we can see, the "base" accuracy, using the default configuration of the model, is 84.7%. Let's see if we can improve these results using the different techniques.
The grid search is a method that performs an exhaustive search through a specific (manually set) subset of values and then tries all possible combinations until the best of the models is found.
from sklearn.model_selection import GridSearchCV
# We define the parameters that we want to adjust by hand
hyperparams = {
"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
"penalty": ["l1", "l2", "elasticnet", None],
"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]
}
# We initialize the grid
grid = GridSearchCV(model, hyperparams, scoring = "accuracy", cv = 5)
grid
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
grid.fit(X_train, y_train)
print(f"Best hyperparameters: {grid.best_params_}")
As we can see, the parameters optimized using this technique are:
C
: 10penalty
: l1solver
: liblinearIn addition, we must always use the training data set to adjust it. Now we just have to repeat the training by setting these parameters in the model:
model_grid = LogisticRegression(penalty = "l1", C = 10, solver = "liblinear")
model_grid.fit(X_train, y_train)
y_pred = model_grid.predict(X_test)
grid_accuracy = accuracy_score(y_test, y_pred)
grid_accuracy
We observed an improvement of just under 1%, but this in a real-world dataset is a huge win!
In addition, we have used three of the many hyperparameters that this model accepts. We could build a much more complex grid (and one that would take longer to run) to improve the results.
As points in favor, we can find:
However, the following negative points should be noted:
The random search is similar to the previous one but, instead of testing all possible combinations of previously established hyperparameter values, this methodology randomly selects combinations of hyperparameters to test.
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# We define the parameters we want to adjust
hyperparams = {
"C": np.logspace(-4, 4, 20),
"penalty": ["l1", "l2", "elasticnet", None],
"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]
}
# We initialize the random search
random_search = RandomizedSearchCV(model, hyperparams, n_iter = 100, scoring = "accuracy", cv = 5, random_state = 42)
random_search
random_search.fit(X_train, y_train)
print(f"Best hyperparameters: {random_search.best_params_}")
As we can see, the parameters optimized using this technique are:
C
: 29.7635penalty
: l2solver
: lbfgsIn addition, we can see in the logs that there have been some errors due to incompatibilities between attributes (values of one attribute that are incompatible with values of another). This is handled by the estimation function itself, and we should not worry, since it will always return the best solution without errors.
With this new hyperparameterization, we retrain the model:
model_random_search = LogisticRegression(penalty = "l2", C = 29.7635, solver = "lbfgs")
model_random_search.fit(X_train, y_train)
y_pred = model_random_search.predict(X_test)
random_search_accuracy = accuracy_score(y_test, y_pred)
random_search_accuracy
As we can see, it yields the same level of accuracy as the previous strategy. This means that with the means and hyperparameters that we have tried to optimize, we are at a local maximum, that is, we would have to repeat the optimization strategy including other hyperparameters to improve the model results, since only playing with the penalty
, C
and solver
we are not going to improve the model more than it is already.
As points in favor, we can find:
As unfavorable points, we can find:
random_state
) is fixed.Both are hyperparameter search techniques and can be useful in different situations. Mesh search is more appropriate when we have a small, well-defined set of hyperparameters, and random search is more useful when there is a large hyperparameter space and/or we do not have a clear idea of what might be the best values to optimize.