Technology

Understanding And Doing Hyper-Parameter Tuning For Machine Learning: A Beginner’s Guide

  •  
  •  
  •  
  •  
  •  
  •  

Hyper-parameter tuning is an integral part of the machine learning model development process.

By comparing the predictive capacity of logistic regression models with various hyperparameter values, I demonstrate the significance of hyperparameter tuning.

First and foremost.
What are hyperparameters, and how do you use them? — What is it?

Hyperparameter vs. parameter

  • The dataset is used to estimate parameters. They are a component of the model. A logistic regression model is represented by the equation below. Theta is a vector that contains the model’s parameters. hyperparameter1

To aid in the estimation of model parameters, hyperparameters are manually set. They aren’t accounted for in the final model.
Hyperparameters in logistic regression examples

Examples of hyperparameters in logistic regression

  1. Level of learning (). Gradient descent is one method of training a logistic regression model. The gradient descent algorithm relies heavily on the learning rate (). It establishes the amount by which the parameter theta adjusts with each iteration.

gradient-descent-for-parameter

Do you need a refresher on how to descend a gradient? Read this article to learn everything there is to know about linear regression and gradient descent.

  1. () is the regularisation parameter. The regularisation parameter () is a constant in the cost function’s “penalty” name. Regularization is the process of adding this penalty to the cost function. L1 and L2 regularisation are the two forms of regularisation. They vary in the penalty equation.

hyperparameter2

The cost function in linear regression is simply the number of squared errors. When you add an L2 regularisation word, you get:

Cost-fxns-for-linear-regression

The binary cross-entropy, or log loss, function is used as the cost function in logistic regression. When you add an L2 regularisation word, you get:

hyperparameter3

What is the purpose of regularisation?
A model is expected to find a weight for each function while it is being trained. Each weight is a component of the theta vector. Since having a weight for a function now incurs a cost, the model is incentivized to shift weights closer to 0 for some features. As a result, regularisation reduces the complexity of a model, preventing overfitting.

What’s the best way to tune hyperparameters? — The method

Let’s talk about the tuning process now that we know WHAT to tune.
Tuning hyperparameters can be done in a variety of ways. Grid Search and Random Search are two of them.

Grid Lookup

We present a list of values for each hyperparameter in a grid search. The model is then evaluated for each combination of the values in this list.
This is how the pseudocode will look:

penalty = ['none, 'l1', 'l2']
lambda = [0.001, 0.1, 1, 5, 10]
alpha = [0.001, 0.01, 0.1]hyperparameters = [penalty, lambda, alpha]# grid_values is a list of all possible combinations of penalty, lambda, and alpha
grid_values = list(itertools.product(*hyperparameters))scores = []for combination in grid_values:
   # create a logistic regression classifier
   classifier = MyLogisticRegression(penalty=combination[0], ...)   # train the model with training data
   classifier.fit(X_train, y_train)   # score the model with test data
   score = classifier.score(X_test, y_test)
   scores.append([ combination, score])# Use scores to determine which combination had the best score
print(scores)

section.)

(In fact, we’d look at many different types of “scores,” such as accuracy, F1 score, and so on.) I’ll go over these in more detail in a later section.)

Random Lookup
We don’t have a list of hyperparameters for a random search. Instead, we provide distribution for each hyperparameter to the searcher. To find the best value, the search algorithm tries various combinations of values at random. Random search is much more effective for large sets of hyperparameters.

We must first process the data before we can train the model. In essence, I had to:

  1. Remove columns that aren’t necessary, such as “Name.”
  2. Remove the two rows with a missing meaning for “Embark.”
  3. With a guess, fill in the missing age values (there is 177 total). The guess, in this case, is based on the presence of “Parch” — a number of parents and children on board.
  4. Use one-hot encoding to transform categorical variables.
  5. On my Jupyter Notebook, you will find the code for these data processing steps as well as a more comprehensive description.
  6. This is my last DataFrame after all of the data processing.

hyperparameter4

For model training and evaluation, divide the data into a training set and a test set.
Let’s look at how learning rate alpha and regularisation affect model results.

Why is it necessary to tune hyperparameters? — The rationale
The accuracy and F1 score of a model are affected by hyperparameter tuning, as you’ll see shortly. Are you unsure what these numbers mean? See my previous Titanic article for definitions.

Regularization’s effect
To suit and test my results, I used SciKit-LogisticRegression Learn’s classifier. There are several solvers to choose from, each with its own convergence algorithm. I’m going to use the “saga” solver as an example. It’s the only solver that can handle L1, L2, and no regularisation at all.
Note: Notice that instead of the regularisation parameter, the classifier in Scikit-LogisticRegression Learn’s takes in a “C,” which is the inverse of regularisation power. Consider it as 1/.

enalty = ["none", "l1", "l2"] and C = [0.05, 0.1, 0.5, 1, 5] .

from sklearn.model_selection import GridSearchCVclf = LogisticRegression(solver='saga', max_iter=5000, random_state=0)param_grid = { 'penalty': ['none', 'l1', 'l2'], 'C': [0.05, 0.1, 0.5, 1, 5] }grid_search = GridSearchCV(clf, param_grid=param_grid)grid_search.fit(X, y)result = grid_search.cv_results_


To get the model’s score for each combination of penalty, I used SciKit-GridSearchCV. Learn’s

hyperparameter5

The best results were obtained using L2 regularisation with a C of 0.1!
Side note #1: I have used SciKit-RandomizedSearchCV Learn’s to implement a random search algorithm. If you’re interested, the example can be found in my Jupyter Notebook.

Side note #2: I’m sure you found that no regularisation outperformed L1, and in many instances, the difference between no regularisation and L2 was insignificant. My best guess is that SciKit Learn’s LogisticRegression already works well without regularisation. Nonetheless, regularisation resulted in some progress.
Regularization is essential in the SGDClassifier, as we’ll see later.
I then compared several output metrics without regularisation and with L2 regularisation side by side.

tuned = LogisticRegression(solver='saga', penalty='l2', C=0.1, max_iter=5000, random_state=2)not_tuned = LogisticRegression(solver='saga', penalty='none', max_iter=5000, random_state=2)tuned.fit(X_train, y_train)
not_tuned.fit(X_train, y_train)y_pred_tuned = tuned.predict(X_test)
y_pred_not_tuned = not_tuned.predict(X_test)data = {
    'accuracy': [accuracy_score(y_test, y_pred_tuned), accuracy_score(y_test, y_pred_not_tuned)],
    'precision': [precision_score(y_test, y_pred_tuned), precision_score(y_test, y_pred_not_tuned)],
    'recall': [recall_score(y_test, y_pred_tuned), recall_score(y_test, y_pred_not_tuned)],
    'f1 score': [f1_score(y_test, y_pred_tuned), f1_score(y_test, y_pred_not_tuned)]
}pd.DataFrame.from_dict(data, orient='index', columns=['tuned', 'not tuned'])


pd.DataFrame.from_dict(data, orient=’index’, columns=[‘tuned’, ‘not tuned’])

In every metric except recall, tuned outperformed untuned. If you need a refresher on what these metrics mean, read this blog post.
The impact of the learning rate (and regularization)


Effect of learning rate (and regularization)
I used SciKit Learn’s SGDClassifier to see how different learning rates affect model results (stochastic gradient descent classifier). The LogisticRegression classifier does not allow me to adjust the learning rate.

SGDClassifier has three parameters that we can change: alpha, learning rate, and eta0. Please bear with me as I explain the terms.
The form of learning rate (“optimal” vs. “constant”) is defined by the learning rate.

When the learning rate is set to “constant,” eta0 is the algorithm’s learning rate. Normally, I refer to eta0 as alpha.
The regularisation term is multiplied by the alpha constant. When the learning rate is “optimal,” it’s also used to measure the learning rate. The object of alpha is to serve what is commonly referred to as lambda.
As a result, there are many options for setting the learning rate in SGDClassifier. Set learning rate=’constant’ and eta0=the learning rate you want for a constant learning rate. Set learning rate=’optimal’ if you want a dynamic learning rate (that changes depending on the phase you’re on). eta0 is not used in the case of “optimal,” and alpha serves as both a regularisation power and a constant in computing the dynamic learning rate at each step.
The grid search algorithm for finding the best hyperparameters is shown below (for constant learning rate). I’ve set the maximum iteration to 50,000 and am using the “constant” learning rate.
from sklearn.linear_model import SGDClassifier
import matplotlib.pyplot as plt
sgd = SGDClassifier(loss=”log”, penalty=”l2″, max_iter=50000, random_state=100)
param_grid = {
‘eta0’: [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
‘learning_rate’: [‘constant’],
‘alpha’: [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]
}
grid_search = GridSearchCV(sgd, param_grid=param_grid)
grid_search.fit(X, y)
result = grid_search.cv_results_

With a score of 0.7176, the searcher chose alpha (here, regularisation strength) of 0.1 and eta0 (learning rate) of 0.0001 as the best params.
For a few different values of regularisation power, I plotted accuracy vs. learning rate (eta0) (alpha). As you can see, both the learning rate and the regularisation intensity have a huge impact on the success of a model.


For a 0.00001 learning rate, the accuracy is very poor. This is most likely due to the algorithm converging too slowly during gradient descent; we’re nowhere near the minimum after 50000 iterations. For a high learning rate, the accuracy is also poor (0.1 & 1). Overshooting is most likely to blame. A more scaled plot of all the alphas is shown below.

Regularization strength (alpha) is also essential for accuracy. There is a wide range of accuracy for any given learning rate (eta0), depending on the alpha value.
In machine learning models, learning rate and regularisation are just two hyperparameters. Hyperparameters are unique to each machine learning algorithm.


  •  
  •  
  •  
  •  
  •  
  •