Hyperparameter Optimization with Bayesian Optimization — Intro and Step-by-Step Implementation from Scratch | by Farzad Mahmoodinobar

By Jessie Hobb On Mar 28, 2023

A step-by-step tutorial to build Bayesian optimization from the grounds up

Hyperparameter optimization has become a necessary step in most machine learning pipelines and probably the most well-known “learning” approach towards hyperparameter optimization is Bayesian Optimization. The task intended to help choose a set of optimal parameters for the cost (or objective) function in a learning algorithm is called hyperparameter optimization. These parameters can be data-driven (e.g. various training data combinations) or model-driven (e.g. number of layers in a neural network, learning rate, optimizer, batch size, etc.). In state-of-the-art complex machine learning models with deep architectures, due to the number of combinations that parameters can take and also the interplay among such parameters, hyperparameter optimization is not a trivial computational task.

In this post, we are going to talk about Bayesian Optimization as a hyperparameter optimization approach that has a memory and learns from each iteration of parameter tuning. Then we will build a Bayesian optimizer from scratch, without the use of any specific libraries.

Let’s get started!

Conventional hyperparameter optimization methodologies, such as grid and random search require calculating a given model’s cost function multiple times to find the most optimized combination of hyperparameters. Since many modern machine learning architectures include a large number of hyperparameters (e.g. deep neural networks), calculating cost function becomes computationally expensive, decreasing the appeal of conventional methodologies such as grid search. In such scenarios, Bayesian optimization has become one of the common hyperparameter optimization methodologies, since it can find an optimized solution with significantly lower number of iterations compared to traditional approaches such as Grid and Random Search, thanks to learning from each iteration.

This post will mainly focus on Bayesian optimization. If you are interested in learning more about other hyperparameter optimization strategies, such as Grid Search and Random Search, feel free to visit the following post:

Bayesian optimization can seem conceptually complicated but once implemented, it will feel much simpler. In this section, I am going to provide a conceptual overview of how Bayesian Optimization works and then we will implement this for better understanding.

Bayesian optimization leverages Bayesian technique to set a prior over the objective function, then adding some new information to get a posterior function. A prior represents what we know before the new information is available and a posterior represents what we know about the objective function, given the new information. More specifically, a sample of the search space (i.e. a set of hyperparameters in this context) is collected and then objective function is calculated for the given sample (i.e. model is trained and evaluated). Since objective function is not readily accessible, a “surrogate function” is used as a Bayesian approximation of the objective function. Surrogate function is then updated with the information from the previous sample to get us from prior to posterior. Posterior, representing our best knowledge of the objective function at that point in time, is then used to guide an “acquisition function”. Acquisition function (e.g. Expected Improvement) optimizes the conditional probability of the locations within the search space to acquire new samples from the search space that are more likely to optimize the original cost function.

To continue with the Expected Improvement example, the acquisition function computes the Expected Improvement for each point in the grid of hyperparameters and returns the one with the largest value. Then the newly-collected sample will run through the cost function, posterior is updated and the process repeats until an acceptable optimized point of the objective function is reached, a good-enough result is produced, or resources are exhausted.

This section will focus on a step-by-step implementation of Bayesian Optimization through seven steps. I will list the steps here first and then provide detailed explanation for each, along with the implementation code blocks.

Import libraries
Define the objective (or cost) function
Define parameter bounds
Define the acquisition function
Initialize samples and the surrogate function
Run the Bayesian optimization loop
Return results

Let’s dive in!

Step 1 — Import Libraries

We start with importing some of the necessary libraries, as follows:

numpy for numerical computing, which is one of the usual suspects in Data Science
scipy.stats is a library for statistical functions
load_iris is a function from scikit-learn that loads the Iris dataset
GaussianProcessRegressor is a class from scikit-learn that implements a Gaussian process regression model
Matern is a class from scikit-learn that implements the Matern kernel for the Gaussian process

import numpy as np
import scipy.stats as sps
from sklearn.datasets import load_iris
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern

With our libraries imported, let’s go ahead and define the objective function.

Step 2: Define Objective Function

The objective function takes a set of hyperparameters parameters of C and gamma as input and returns the negative accuracy of a support vector classifier with RBF kernel on the Iris dataset. C is the regularization parameter and gamma is the kernel coefficient for rbf, poly and sigmoid. Details of the kernel coefficient are not critical for our process and can be found here. Then, we load the Iris dataset using load_iris and we split the data into train and test sets. Once the data is ready, a support vector classifier is trained and the negative accuracy on the test set is returned.

def objective(params):
C, gamma = params
X, y = load_iris(return_X_y=True)
np.random.seed(0)
indices = np.random.permutation(len(X))
X_train = X[indices[:100]]
y_train = y[indices[:100]]
X_test = X[indices[100:]]
y_test = y[indices[100:]]
from sklearn.svm import SVC
clf = SVC(C=C, gamma=gamma)
clf.fit(X_train, y_train)
return -clf.score(X_test, y_test)

Step 3: Define Parameter Bounds

At this step, we define the boundaries of the search space for the hyperparameters. We create a NumPy array bounds of shape (2,2), where each row corresponds to a hyperparameter and each column corresponds to the lower and upper bound of that hyperparameter. In our case, the first hyperparameter is C and the second one is gamma, both of which are used in training the support vector classifier.

The purpose of setting bounds is to restrict the hyperparameter search space to avoid testing values that are unlikely to be optimal and to focus the optimization on the most promising regions of the hyperparameter space. We defined the boundaries randomly for this exercise but in tasks where the range of hyperparameters is known, this becomes an important factor.

bounds = np.array([[1e-3, 1e3], [1e-5, 1e-1]])

Step 4: Define Acquisition Function

This step defines the acquisition function that we had discussed before and determines the next point to be evaluated in the search space. In this specific example, the Acquisition Function is the Expected Improvement (EI) function. It measures the expected improvement in the objective function over the current best observation, given the current Surrogate Model (Gaussian process). The Acquisition Function is defined as follows:

The Gaussian process predicts the mean and standard deviation at the point x using gp.predict()
The function finds the best objective function value observed so far (f_best)
Calculates the improvement over f_best as improvement = f_best — mu
Calculates the standard score Z = improvement/sigma if sigma is positive, or sets Z to 0 if sigma is 0
Uses the cumulative distribution function of the standard normal distribution (sps.norm.cdf) and the probability density function (sps.norm.pdf) to calculate the expected improvement (ei) at point x
Returns the expected improvement

def acquisition(x):
mu, sigma = gp.predict(x.reshape(1,-1), return_std=True)
f_best = np.min(y_samples)
improvement = f_best - mu
with np.errstate(divide='warn'):
Z = improvement / sigma if sigma > 0 else 0 
ei = improvement * sps.norm.cdf(Z) + sigma * sps.norm.pdf(Z) 
ei[sigma == 0.0] == 0.0 
return ei

Step 5: Initialize Samples and Surrogate Function

Before starting the Bayesian optimization loop, we need to initialize the Gaussian process surrogate model with some initial samples. As discussed previously, the surrogate function is used to approximate the unknown objective function in order to optimize it efficiently. The Gaussian process is a probabilistic model that defines a prior over functions. It allows for Bayesian inference to be used to update the model as new data is acquired. Specifically, x_samples are initial points randomly sampled from the search space defined by the bounds array. The y_samples are the corresponding objective function evaluations of these initial points. These samples are used to train the Gaussian process and improve its surrogate modeling.

The steps we take here are:

We set the number of initial samples (n_init_samples) to 5
Generate n_init_samples random samples from the parameter space defined by bounds using np.random.uniform (bounds[:, 0] and bounds[:, 1] extract the lower and upper bounds for each parameter, respectively). The size parameter specifies the shape of the output array, which is (n_init_samples, bounds.shape[0]) in this case. This generates an array of shape (n_init_samples, 2) where each row represents a set of parameters
Evaluate the objective function objective(x) for each set of parameters in x_samples using a list comprehension and store the results in the array y_samples
Initialize a Gaussian process regressor with a Matern kernel with parameter nu=2.5, which is a smoothness parameter that controls the flexibility of the model. Matern kernel is a popular choice for Bayesian optimization due to its ability to handle noisy and discontinuous functions

n_init_samples = 5 
x_samples = np.random.uniform(bounds[:, 0], bounds[:, 1], size=(n_init_samples, bounds.shape[0])) 
y_samples = np.array([objective(x) for x in x_samples]) 
kernel = Matern(nu=2.5) 
gp = GaussianProcessRegressor(kernel=kernel)

Step 6: Run Bayesian Optimization Loop

We have finally arrived at the Bayesian optimization loop. In this step, the Bayesian optimization loop is run for a specified number of iterations (n_iter). In each iteration, the Gaussian process model is updated with the existing samples (i.e. x_samples and y_samples), using the gp.fit() method. Then, the next sample to be evaluated by the objective function is chosen by optimizing the acquisition function using a large number of random points (i.e. x_random_points), generated from the parameter space. The acquisition function is evaluated at each of these points and the point with the maximum acquisition function value is chosen as the next sample (i.e. x_next). The acquisition function value at this point is recorded as the best_acq_value. Finally, the objective function is evaluated at the chosen point and the resulting value is added to the existing samples by updating x_samples and y_samples. The process is repeated for the specified number of iterations (i.e. n_iter) and the results for each iteration are printed.

# Run Bayesian optimization loop for n_iter iterations 
n_iter = 10 
for i in range(n_iter): 
# Update Gaussian process with existing samples 
gp.fit(x_samples, y_samples) # Find the next sample by optimizing acquisition function  
x_next = None  
best_acq_value = -np.inf  
# Sample a large number of random points from the parameter space  
n_random_points=10000  
x_random_points=np.random.uniform(bounds[:, 0], bounds[:, 1], size=(n_random_points,bounds.shape[0])) 
# Evaluate acquisition function at each point and find the maximum  
acq_values=np.array([acquisition(x) for x in x_random_points])  
max_acq_index=np.argmax(acq_values)  
max_acq_value=acq_values[max_acq_index]  
if max_acq_value > best_acq_value:  
best_acq_value=max_acq_value  
x_next=x_random_points[max_acq_index]  
print(f"Iteration {i+1}: next sample is {x_next}, acquisition value is {best_acq_value}")  
# Evaluate objective function at next sample and add it to existing samples  
y_next=objective(x_next)  
x_samples=np.vstack((x_samples,x_next))  
y_samples=np.append(y_samples,y_next)

Step 7: Print Results

Lastly, we print the best parameters and the best accuracy found during the Bayesian optimization loop. The best parameters are the ones that correspond to the minimum value of the objective function, which is why np.argmin is used to find the index of the sample with the minimum value of y_samples.

# Print final results   
best_index=np.argmin(y_samples)   
best_x=x_samples[best_index]   
best_y=y_samples[best_index]   print(f"Best parameters: C={best_x[0]}, gamma={best_x[1]}")   
print(f"Best accuracy: {-best_y}")

Final results of running this process are as follows: