4Geeks logo
About us

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Data Science and Machine Learning - 16 wks

Full-Stack Software Developer - 16w

Search from all Lessons

Social & live learning

The most efficient way to learn: Join a cohort with classmates just like you, live streams, impromptu coding sessions, live tutorials with real experts, and stay motivated.

← Back to Lessons
Edit on Github
Open in Collab

Model Hyperparameters Optimization

Hyperparameter optimization

Hyperparameter optimization (HPO) is a mechanism for approximating a version of a model with high performance and effectiveness. These hyperparameters, unlike model parameters, are set by the engineer prior to training.

What is a hyperparameter?

A hyperparameter is a configuration variable external to the model that is used to train it. Depending on the model, we can find a multitude of hyperparameters:

  • Learning rate in gradient descent.
  • Number of iterations in gradient descent.
  • Number of layers in a Neural Network.
  • Number of neurons per layer in a Neural Network.
  • Number of clusters (k) in a k-NN model.

Difference between parameter and hyperparameter

A parameter of a model are the features that are optimized to train it and that shape its learning. These values are not accessible by us as developers. For example, in the case of a linear regression, these parameters will be the slope and the intercept, for example.

With the training dataset and a learning algorithm (such as the one we saw above about gradient descent), we manage to alter these values and let the model know how to classify or predict the cases.

However, a hyperparameter, in contrast, is established before the training phase and allows the developer to create a context and prepare the model.

Indispensable for predictionsIndispensable for initializing the model parameters, which will be optimized later
They are estimated by learning algorithms (gradient descent, Adam, Adagrad...)They are estimated by the optimization method
They are not set manuallyThey are set manually

| The final value is obtained after the learning phase and will decide the accuracy of the model and how it will predict new data | The choice of these values will decide how efficient the training will be. It also has a big impact on the parameter optimization process

Hyperparameter optimization process

Normally, we do not know the optimal values for the hyperparameters that would generate the best of the model results. Therefore, it is a vital and important step to include this step in any Machine Learning model building.

There are several strategies to carry it out. First, we train a base model:

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_test.csv")

X_train = train_data.drop(["Survived"], axis = 1)
y_train = train_data["Survived"]
X_test = test_data.drop(["Survived"], axis = 1)
y_test = test_data["Survived"]

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

base_accuracy = accuracy_score(y_test, y_pred)

As we can see, the "base" accuracy, using the default configuration of the model is 84.7%. Let's see if we can improve these results using the different techniques.

The grid search is a method that performs an exhaustive search through a specific (manually set) subset of values and then tries all possible combinations until the best of the models is found.

In [2]:
from sklearn.model_selection import GridSearchCV

# We define the parameters by hand that we want to adjust
hyperparams = {
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "penalty": ["l1", "l2", "elasticnet", None],
    "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]

# We initialize the grid
grid = GridSearchCV(model, hyperparams, scoring = "accuracy", cv = 5)
GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'penalty': ['l1', 'l2', 'elasticnet', None],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [3]:
def warn(*args, **kwargs):
import warnings
warnings.warn = warn

grid.fit(X_train, y_train)

print(f"Best hyperparameters: {grid.best_params_}")
Best hyperparameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

As we can see, the parameters optimized using this technique are:

  • C: 10
  • penalty: l1
  • solver: liblinear

In addition, we must always use the training data set to adjust it. Now we just have to repeat the training by setting these parameters in the model:

In [4]:
model_grid = LogisticRegression(penalty = "l1", C = 10, solver = "liblinear")
model_grid.fit(X_train, y_train)
y_pred = model_grid.predict(X_test)

grid_accuracy = accuracy_score(y_test, y_pred)

We observed an improvement of just under 1%, but this in a real-world dataset is a huge win!

In addition, we have used three of the many hyperparameters that this model accepts. We could build a much more complex grid (and one that would take longer to run) to improve the results.

Pros and cons of this strategy

As points in favor we can find:

  • Exhaustiveness: It tests all possible combinations of hyperparameters within the provided grid, so if the optimal combination is within it, this methodology will find it.
  • Reproducibility: Due to its deterministic (non-random) nature, the same result will always be obtained with the same parameters and input.

However, the following negative points should be noted:

  • Efficiency: It is very computationally expensive. It can be time consuming and resource intensive, especially if the number of hyperparameters is large and/or the range of values is wide.
  • It does not guarantee the best results, since it depends on the hyperparameters and the values set by the developer.

The random search is similar to the previous one but, instead of testing all possible combinations of previously established hyperparameter values, this methodology randomly selects combinations of hyperparameters to test.

In [5]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# We define the parameters we want to adjust
hyperparams = {
    "C": np.logspace(-4, 4, 20),
    "penalty": ["l1", "l2", "elasticnet", None],
    "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]

# We initialize the random search
random_search = RandomizedSearchCV(model, hyperparams, n_iter = 100, scoring = "accuracy", cv = 5, random_state = 42)
RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=100,
                   param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
       4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
       2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
       1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
       5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),
                                        'penalty': ['l1', 'l2', 'elasticnet',
                                        'solver': ['newton-cg', 'lbfgs',
                                                   'liblinear', 'sag',
                   random_state=42, scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [6]:
random_search.fit(X_train, y_train)

print(f"Best hyperparameters: {random_search.best_params_}")
Best hyperparameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 29.763514416313132}

As we can see, the parameters optimized using this technique are:

  • C: 29.7635
  • penalty: l2
  • solver: lbfgs

In addition, we can see in the logs that there have been some errors due to incompatibilities between attributes (values of one attribute that are incompatible with values of another). This is handled by the estimation function itself and we should not worry, since it will always return the best solution without errors.

With this new hyperparameterization, we retrain the model:

In [7]:
model_random_search = LogisticRegression(penalty = "l2", C = 29.7635, solver = "lbfgs")
model_random_search.fit(X_train, y_train)
y_pred = model_random_search.predict(X_test)

random_search_accuracy = accuracy_score(y_test, y_pred)

As we can see, it yields the same level of accuracy as the previous strategy. This means that with the means and hyperparameters that we have tried to optimize we are at a local maximum, that is, we would have to repeat the optimization strategy including other hyperparameters to improve the model results, since only playing with the penalty, C and solver we are not going to improve the model more than it is already.

Pros and cons of this strategy

As points in favor we can find:

  • Efficiency: it is generally faster than grid search, since it does not try all possible combinations, but randomly selects a specific number of them.
  • It can be closer to global optimization when selecting random values, since there is no fixed grid of them.

As unfavorable points we can find:

  • Randomness. It does not guarantee the same solution in each run, unless a seed (random_state) is fixed.
  • It is not exhaustive: You may not try the best combination of hyperparameters if you are unlucky with random selection.

When to use each strategy?

Both are hyperparameter search techniques and can be useful in different situations. Mesh search is more appropriate when we have a small, well-defined set of hyperparameters, and random search is more useful when there is a large hyperparameter space and/or we do not have a clear idea of what might be the best values to optimize.