Hyperparameter optimization (*HPO*) is a mechanism for approximating a version of a model with high performance and effectiveness. These hyperparameters, unlike model parameters, are set by the engineer prior to training.

A **hyperparameter** is a configuration variable external to the model that is used to train it. Depending on the model, we can find a multitude of hyperparameters:

- Learning rate in gradient descent.
- Number of iterations in gradient descent.
- Number of layers in a Neural Network.
- Number of neurons per layer in a Neural Network.
- Number of clusters (k) in a k-NN model.

A **parameter** of a model is the set of features that are optimized to train it and that shape its learning. These values are not accessible by us as developers. For example, in the case of a linear regression, these parameters will be the slope and the intercept, for example.

With the training dataset and a learning algorithm (such as the one we saw above about *gradient descent*), we manage to alter these values and let the model know how to classify or predict the cases.

However, a **hyperparameter**, in contrast, is established before the training phase and allows the developer to create a context and prepare the model.

Parameter | Hyperparameter |
---|---|

Indispensable for predictions. | Indispensable for initializing the model parameters, which will be optimized later. |

They are estimated by learning algorithms (gradient descent, Adam, Adagrad...). | They are estimated by the optimization method. |

They are not set manually. | They are set manually. |

| The final value is obtained after the learning phase and will determine the accuracy of the model and how it will predict new data. | The choice of these values will determine how efficient the training will be. It also has a big impact on the parameter optimization process.

Normally, we do not know the optimal values for the hyperparameters that would generate the best model results. Therefore, it is a vital and important step to include this in any Machine Learning model building process.

There are several strategies to carry it out. First, we train a base model:

In [1]:

```
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
train_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/machine-learning-content/master/assets/clean_titanic_test.csv")
X_train = train_data.drop(["Survived"], axis = 1)
y_train = train_data["Survived"]
X_test = test_data.drop(["Survived"], axis = 1)
y_test = test_data["Survived"]
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
base_accuracy = accuracy_score(y_test, y_pred)
base_accuracy
```

Out[1]:

As we can see, the "base" accuracy, using the default configuration of the model, is 84.7%. Let's see if we can improve these results using the different techniques.

The **grid search** is a method that performs an exhaustive search through a specific (manually set) subset of values and then tries all possible combinations until the best of the models is found.

In [2]:

```
from sklearn.model_selection import GridSearchCV
# We define the parameters that we want to adjust by hand
hyperparams = {
"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
"penalty": ["l1", "l2", "elasticnet", None],
"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]
}
# We initialize the grid
grid = GridSearchCV(model, hyperparams, scoring = "accuracy", cv = 5)
grid
```

Out[2]:

In [3]:

```
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
grid.fit(X_train, y_train)
print(f"Best hyperparameters: {grid.best_params_}")
```

As we can see, the parameters optimized using this technique are:

`C`

: 10`penalty`

: l1`solver`

: liblinear

In addition, we must always use the training data set to adjust it. Now we just have to repeat the training by setting these parameters in the model:

In [4]:

```
model_grid = LogisticRegression(penalty = "l1", C = 10, solver = "liblinear")
model_grid.fit(X_train, y_train)
y_pred = model_grid.predict(X_test)
grid_accuracy = accuracy_score(y_test, y_pred)
grid_accuracy
```

Out[4]:

We observed an improvement of just under 1%, but this in a real-world dataset is a huge win!

In addition, we have used three of the many hyperparameters that this model accepts. We could build a much more complex grid (and one that would take longer to run) to improve the results.

As points in favor, we can find:

- Exhaustiveness: It tests all possible combinations of hyperparameters within the provided grid, so if the optimal combination is within it, this methodology will find it.
- Reproducibility: Due to its deterministic (non-random) nature, the same result will always be obtained with the same parameters and input.

However, the following negative points should be noted:

- Efficiency: It is very computationally expensive. It can be time consuming and resource intensive, especially if the number of hyperparameters is large and/or the range of values is wide.
- It does not guarantee the best results, since it depends on the hyperparameters and the values set by the developer.

The **random search** is similar to the previous one but, instead of testing all possible combinations of previously established hyperparameter values, this methodology randomly selects combinations of hyperparameters to test.

In [5]:

```
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# We define the parameters we want to adjust
hyperparams = {
"C": np.logspace(-4, 4, 20),
"penalty": ["l1", "l2", "elasticnet", None],
"solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]
}
# We initialize the random search
random_search = RandomizedSearchCV(model, hyperparams, n_iter = 100, scoring = "accuracy", cv = 5, random_state = 42)
random_search
```

Out[5]:

In [6]:

```
random_search.fit(X_train, y_train)
print(f"Best hyperparameters: {random_search.best_params_}")
```

As we can see, the parameters optimized using this technique are:

`C`

: 29.7635`penalty`

: l2`solver`

: lbfgs

In addition, we can see in the logs that there have been some errors due to incompatibilities between attributes (values of one attribute that are incompatible with values of another). This is handled by the estimation function itself, and we should not worry, since it will always return the best solution without errors.

With this new *hyperparameterization*, we retrain the model:

In [7]:

```
model_random_search = LogisticRegression(penalty = "l2", C = 29.7635, solver = "lbfgs")
model_random_search.fit(X_train, y_train)
y_pred = model_random_search.predict(X_test)
random_search_accuracy = accuracy_score(y_test, y_pred)
random_search_accuracy
```

Out[7]:

As we can see, it yields the same level of accuracy as the previous strategy. This means that with the means and hyperparameters that we have tried to optimize, we are at a **local maximum**, that is, we would have to repeat the optimization strategy including other hyperparameters to improve the model results, since only playing with the `penalty`

, `C`

and `solver`

we are not going to improve the model more than it is already.

As points in favor, we can find:

- Efficiency: it is generally faster than grid search, since it does not try all possible combinations, but randomly selects a specific number of them.
- It can be closer to global optimization when selecting random values, since there is no fixed grid of them.

As unfavorable points, we can find:

- Randomness. It does not guarantee the same solution in each run, unless a seed (
`random_state`

) is fixed. - It is not exhaustive: You may not try the best combination of hyperparameters if you are unlucky with random selection.

Both are hyperparameter search techniques and can be useful in different situations. Mesh search is more appropriate when we have a small, well-defined set of hyperparameters, and random search is more useful when there is a large hyperparameter space and/or we do not have a clear idea of what might be the best values to optimize.