Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Search from all Lessons


LoginGet Started
← Back to Lessons
Edit on Github
Open in Colab

Model Hyperparameters Optimization

Hyperparameter optimization

Hyperparameter optimization (HPO) is a mechanism for approximating a version of a model with high performance and effectiveness. These hyperparameters, unlike model parameters, are set by the engineer prior to training.

What is a hyperparameter?

A hyperparameter is a configuration variable external to the model that is used to train it. Depending on the model, we can find a multitude of hyperparameters:

  • Learning rate in gradient descent.
  • Number of iterations in gradient descent.
  • Number of layers in a Neural Network.
  • Number of neurons per layer in a Neural Network.
  • Number of clusters (k) in a k-NN model.

Difference between parameter and hyperparameter

A parameter of a model is the set of features that are optimized to train it and that shape its learning. These values are not accessible by us as developers. For example, in the case of a linear regression, these parameters will be the slope and the intercept, for example.

With the training dataset and a learning algorithm (such as the one we saw above about gradient descent), we manage to alter these values and let the model know how to classify or predict the cases.

However, a hyperparameter, in contrast, is established before the training phase and allows the developer to create a context and prepare the model.

ParameterHyperparameter
Indispensable for predictions.Indispensable for initializing the model parameters, which will be optimized later.
They are estimated by learning algorithms (gradient descent, Adam, Adagrad...).They are estimated by the optimization method.
They are not set manually.They are set manually.

| The final value is obtained after the learning phase and will determine the accuracy of the model and how it will predict new data. | The choice of these values will determine how efficient the training will be. It also has a big impact on the parameter optimization process.

Hyperparameter optimization process

Normally, we do not know the optimal values for the hyperparameters that would generate the best model results. Therefore, it is a vital and important step to include this in any Machine Learning model building process.

There are several strategies to carry it out. First, we train a base model:

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_test.csv")

X_train = train_data.drop(["Survived"], axis = 1)
y_train = train_data["Survived"]
X_test = test_data.drop(["Survived"], axis = 1)
y_test = test_data["Survived"]

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

base_accuracy = accuracy_score(y_test, y_pred)
base_accuracy
Out[1]:
0.8473282442748091

As we can see, the "base" accuracy, using the default configuration of the model, is 84.7%. Let's see if we can improve these results using the different techniques.

The grid search is a method that performs an exhaustive search through a specific (manually set) subset of values and then tries all possible combinations until the best of the models is found.

In [2]:
from sklearn.model_selection import GridSearchCV

# We define the parameters that we want to adjust by hand
hyperparams = {
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "penalty": ["l1", "l2", "elasticnet", None],
    "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]
}

# We initialize the grid
grid = GridSearchCV(model, hyperparams, scoring = "accuracy", cv = 5)
grid
Out[2]:
GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'penalty': ['l1', 'l2', 'elasticnet', None],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [3]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

grid.fit(X_train, y_train)

print(f"Best hyperparameters: {grid.best_params_}")
Best hyperparameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

As we can see, the parameters optimized using this technique are:

  • C: 10
  • penalty: l1
  • solver: liblinear

In addition, we must always use the training data set to adjust it. Now we just have to repeat the training by setting these parameters in the model:

In [4]:
model_grid = LogisticRegression(penalty = "l1", C = 10, solver = "liblinear")
model_grid.fit(X_train, y_train)
y_pred = model_grid.predict(X_test)

grid_accuracy = accuracy_score(y_test, y_pred)
grid_accuracy
Out[4]:
0.851145038167939

We observed an improvement of just under 1%, but this in a real-world dataset is a huge win!

In addition, we have used three of the many hyperparameters that this model accepts. We could build a much more complex grid (and one that would take longer to run) to improve the results.

Pros and cons of this strategy

As points in favor, we can find:

  • Exhaustiveness: It tests all possible combinations of hyperparameters within the provided grid, so if the optimal combination is within it, this methodology will find it.
  • Reproducibility: Due to its deterministic (non-random) nature, the same result will always be obtained with the same parameters and input.

However, the following negative points should be noted:

  • Efficiency: It is very computationally expensive. It can be time consuming and resource intensive, especially if the number of hyperparameters is large and/or the range of values is wide.
  • It does not guarantee the best results, since it depends on the hyperparameters and the values set by the developer.

The random search is similar to the previous one but, instead of testing all possible combinations of previously established hyperparameter values, this methodology randomly selects combinations of hyperparameters to test.

In [5]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# We define the parameters we want to adjust
hyperparams = {
    "C": np.logspace(-4, 4, 20),
    "penalty": ["l1", "l2", "elasticnet", None],
    "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]
}

# We initialize the random search
random_search = RandomizedSearchCV(model, hyperparams, n_iter = 100, scoring = "accuracy", cv = 5, random_state = 42)
random_search
Out[5]:
RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=100,
                   param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
       4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
       2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
       1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
       5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),
                                        'penalty': ['l1', 'l2', 'elasticnet',
                                                    None],
                                        'solver': ['newton-cg', 'lbfgs',
                                                   'liblinear', 'sag',
                                                   'saga']},
                   random_state=42, scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [6]:
random_search.fit(X_train, y_train)

print(f"Best hyperparameters: {random_search.best_params_}")
Best hyperparameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 29.763514416313132}

As we can see, the parameters optimized using this technique are:

  • C: 29.7635
  • penalty: l2
  • solver: lbfgs

In addition, we can see in the logs that there have been some errors due to incompatibilities between attributes (values of one attribute that are incompatible with values of another). This is handled by the estimation function itself, and we should not worry, since it will always return the best solution without errors.

With this new hyperparameterization, we retrain the model:

In [7]:
model_random_search = LogisticRegression(penalty = "l2", C = 29.7635, solver = "lbfgs")
model_random_search.fit(X_train, y_train)
y_pred = model_random_search.predict(X_test)

random_search_accuracy = accuracy_score(y_test, y_pred)
random_search_accuracy
Out[7]:
0.851145038167939

As we can see, it yields the same level of accuracy as the previous strategy. This means that with the means and hyperparameters that we have tried to optimize, we are at a local maximum, that is, we would have to repeat the optimization strategy including other hyperparameters to improve the model results, since only playing with the penalty, C and solver we are not going to improve the model more than it is already.

Pros and cons of this strategy

As points in favor, we can find:

  • Efficiency: it is generally faster than grid search, since it does not try all possible combinations, but randomly selects a specific number of them.
  • It can be closer to global optimization when selecting random values, since there is no fixed grid of them.

As unfavorable points, we can find:

  • Randomness. It does not guarantee the same solution in each run, unless a seed (random_state) is fixed.
  • It is not exhaustive: You may not try the best combination of hyperparameters if you are unlucky with random selection.

When to use each strategy?

Both are hyperparameter search techniques and can be useful in different situations. Mesh search is more appropriate when we have a small, well-defined set of hyperparameters, and random search is more useful when there is a large hyperparameter space and/or we do not have a clear idea of what might be the best values to optimize.