Deep learning Hyper-parameters tuning using GPy/GPyOpt
Introduction
I remember when I started studying machine learning and saw this countless parameters, the first question jumped in my head, How can I find the exact needed value? at first I thought its a secret recipe private to a experienced deep learning practitioner. But as I move on in my learning journey at Holberton School, I figured out a beautiful and clever technique that I would like to share it with you.
Abstract
Finding the right number of layers, or the number of nodes in each layer, either a lot of other hyper-parameters, it could be a headache. Until now in my learning journey I used to think about it as a strike of luck to find the right fit. But today we will build a deep neural network (not much deep) in order to solve a classic computer vision problem to recognize a handwritten digits from the MNIST datasets. But first we need to figure out the math behind the whole concept of how to optimize parameter using Gaussian Process, and Bayesian Optimization.
Gaussian Process
Starting from the Gaussian distribution which is a Normal distribution defined by two parameters, the mean μ to express our expectation, and the standard deviation σ to show how much we are certain about this expectation.

But for multivariate Gaussian we extend the μ and σ to a high dimension vector space.

The Gaussian Process is a Gaussian distribution over a function, the function is maps a specific domain X to another domain Y, with a mean vector contains the expected output of that function and the Σ covariance describe how much we are close to the truth. so as much we are close to the truth Y in our expectation as much the σ is close to zero.

Based on the assumption that values close to each other in the X domain could be close in mapped co-domain Y. So building a new covariance matrix Σ could construct a new function in the way that neighbor values have larger correlation than values with larger distance between them.
To build the new covariance matrix we need a Kernel which would be responsible of assigning a high correlation based on distance between values. There is a lot of kernel: Radial Basis Function, Squared Exponential Kernel …
Use the Radial Basis Function is my pick for today and here the function:

The Gaussian process are used to find a prior distribution of the function that could be describe our data conditional on the observed samples in form of posterior functions that fit the data. Let’s say we have some outputs of function f and we want to find the unknown outputs f*.

Given the prior Gaussian, which we assume to be the marginal distribution, we can compute the conditional distribution as we have observed f.


Bayesian optimization
When we don’t have an explicit expression for a particular function, sampling at random points is the only way for evaluation of this function. In case the sampling is very expensive we need a clever optimization technique as Bayesian optimization which attempt the global optimum in few steps. A surrogate model such us Gaussian Process is useful in order to measure outcomes easily.
The Bayesian optimization is an iterative process which plan the next move based on the actual improvement. This suggestions is done by an Acquisition Function and we can see a lot of them as
- Maximum Probability Improvement (MPI)

- Expected Improvement (EI) …

Digit recognition model
One of the most known problem in computer vision community is recognize handwritten digits. and we will take advantage of MNIST datasets. and as our first step we need to import our tools.
Keras mnist comes with pre-partitioned datasets. so we need to assign and normalize features, also encoding labels:
Our neural network is a simple model with 3 hidden layers each layer contain a n number of units and use the ReLU activation function. the output layer is a Softmax layer with 10 categories to predict. the model use Adam optimization with learning rate λ, two exponential decay rate β1 and β2.
Building black-box function
The black-box function is a function that takes the tuned hyper-parameters and give a significant metrics in order to evaluation acquisition. so what could be better than the accuracy to evaluate a model. Our hyper-parameters are:
- learning rate: is an important parameters has a significant impact on the yield of our model.
The learning rate is perhaps the most important hyper-parameter. If you have time to tune only one hyper-parameter, tune the learning rate. Deep learning
- beta_1: The exponential decay rate for the first moment estimates.
- beta_2: The exponential decay rate for the second-moment estimates
- nodes: Is the number of nodes in each hidden layer, the number of nodes reflect the capacity of the model.
We can control whether a model is more likely to overfit or underfit by altering its capacity.
- batch_size: Is our way to control stability of the model
- epochs: As the number of epochs increases, more number of times the weight are changed in the neural network and the curve goes from underfitting to optimal to overfitting curve.
Optimization
In order to succeed our optimization we need to set our bounds space. is the way to fix the possible range that our parameters could reside in.
Now we need to get help from GPy and GPyOpt to optimize our model based on the surrogate model Gaussian Process, and using the Expectation Improvement acquisition function.
Results
The result is quite impressive. and our model reach a 98% accuracy recognizing hand written digits from MNIST datasets. in the following graph we can find the convergence plot, and GpyOpt report.

As you can see our optimization leads us to the best hyper-parameters combination in order to reach a 97.81 % accuracy. and for our model the best minimum location is:
- lr = 0.0024
- beta_1 = 0.9
- beta_2 = 0.9999
- batch_size = 200
- epochs = 8
- first layer nodes = 256
- second layer nodes = 512
- third layer nodes = 128
Final thoughts
I find it a useful way to find the best hyper-parameters combination by implementing a clever mathematical background based on probabilistic assumption. In our previous experience we see how we can mine the best parameters and reach a decent validation accuracy.