Deep convolutional neural network based on Krizhevsky et. al.‘s 2012 paper

Issam Sebri
6 min readFeb 4, 2021

Introduction

The ImageNet Large Scale Visual Recognition Challenge — LSVRC described like one of the most known challenge in machine learning community, where many of university teams, researchers, and laboratories compete for the best Machine learning model capable recognizing set of high resolution images. In 2012 version the contest was about 15 million labeled images for more then 1000 output categories.

This article walk through the first winner and the top-1 and top-5 error rates recorded in ImageNet LSVRC contest. Krishevsky network or ‘AlexNet’. It’s primary built by Krishovsky and published in collaboration with Ilya Sutskever, and Krizhevsky’s doctoral advisor Geoffrey Hinton.

Procedures

Krishevsky network or “AlexNet” it has been recognized by the remarkable depth of the convolutional neural network a necessary trade between efficiency and complexity gained by an intelligent use of two combined GPUs. “AlexNet” uses a bunch of techniques to reduce the error rate, speed up the training process, or avoid overfitting.

Datasets

ImageNet (1) provide over 15 Million of high resolution labeled images belongs to 22.000 categories. In the LSVRC 2010 contest version it’s only use a part of the dataset by 1.2 million training images, validation and test data was consist of 200,000 images conduct to 1000 categories (2).

The dataset consist of variable-resolution since it was taken from web and individual photograph which level up the challenge where needs a fixed dimension for all the inputs to be 256*256*3 as an RGB images. So the action was scale the shortest dimension (Height or width) to the 256, then crop the rest of the other dimension.

The original image had the shape of (250, 400, 3) after scaling it becomes (256, 400, 3) then crop to fit the (256, 256, 3).

The architecture

  • Architecture:

The ‘AlexNet’ was consist of five convolutional layers and three full connected layers runs on two GPUs, which makes second, fourth and fifth layers a private layers on each GPU only connected to the previous layer. First layer feeds the two GPU sub-layers, as the third read from the two sub-layers. With and input of (224, 224, 3) matrix. with an initial Kernel of shape (11, 11, 3).

  • ReLU activation:

The network used the ReLU activation function as described in the paper is the fastest converging activation function compared to the Tanh function. which is quite understandable choice, To assign weights using backpropagation, you normally calculate the gradient of the loss function and apply the chain rule for hidden layers, meaning you need the derivative of the activation functions. ReLU is a ramp function where you have a flat part where the derivative is 0, and a skewed part where the derivative is 1. This makes the math really easy. If you use the hyperbolic tangent you might run into the fading gradient problem, meaning if x is smaller than -2 or bigger than 2, the derivative gets really small and your network might not converge, or you might end up having a dead neuron that does not fire anymore.

  • Normalization:

For normalization ‘AlexNet’ use a bit old version of ‘Batch normalization’ called the ‘Local Response Normalization’ which is quite useful for the model to reduce a 1.5% of the error rate.

If you get rid to understand this formula and you are not aware of old notation try to replace k=0, alpha=1 and beta=1/2, you will get L2 normalization.

May be if your a Tensorflow fun you probably couldn’t find a builtin function for this kind of normalization, otherwise Caffe and Lasagne get their propre implementation

  • Pooling

The standard pooling is using a side by side pooling which mean if we use the same dimension for kernel (filter) of size 2 and strides of size 2 we get something like this.

Kernel = 2, stride = 2

But the pooling technique used in the ‘AlexNet’ is made an overlapping patches, ‘it mention that if we take the stride.size ≤ kernel.size we end up with overlapping pooling patches’.

Kernel = 2, stride = 1

Reducing Overfitting

Learning from a large-scale dataset always makes our model vulnerable to overfitting and exposed to bias against the training data. but “Alexnet” used two techniques to reduce overfitting.

  • Data augmentation:

Data augmentation is the way to find more derivatives from the existing dataset to get rid of being attached to certain noise pattern. First playing with images by mixing the dataset by their translation and horizontal reflection.

The second techniques is by adjust image brightness or appearance using a PCA (Principal Component Analysis) applied to the RGB channels a great algorithm based on re-scaling or change dimension of a matrix without loosing so much data by following the same existing distribution pattern. To see it in action here a small example applied on our ‘Robot’ image.

We should mention that the working team make this augmentation data out of GPU boundary, and use the CPU for this computation. which helps the network to preserve training time.

  • Dropout

As the time of the contest drop out is a new born technique helps to avoid overfitting. Is the way to shut down the biased neuron that exceed a threshold of predicted probability in the two way forward and backward propagation. beside can be a great technique for reducing bias it can help for reducing training time too.

Results

During the contest of LSVRC contest 2010 version. ‘AlexNet’ achieves the top-1 and top-5 test error rate by 37.5% and 17%. In 2012 version ‘AlexNet’ achieves 16.4% average of error rates. The results was a quite impressive as the first winner of the contest and the best in action state-of-art convolutional network.

Conclusion

AlexNet achieved the best of it and being the first winner of the LSVRC contest with the top-1 and top-5 error rates. this achievement highlighted the importance of neural network depth to achieve better results. Using CNN is best to get less parameter compared to standard neural network which helps to speed up the train process, it’s quite expensive but simple enough to achieve the goal. All the techniques used to speed up the train process, normalize the data or reduce overfitting show us the perseverance of the researchers to get the best result and it helps a lot to get the impressive result.

Personal Notes

The deep convolutional neural network described in this article is a magic key that helps to create a whole generation of modern model and network, and the result looks pretty top-down compared to the error rate of the 2017 competition version (accuracy of 95%) But ‘AlexNet’ is a way to compete to achieve a goal and fight hardware limitation by combining two GPUs to achieve their maximum compute, fight against time and a lot of waiting (only one training process takes ~ 5 days), and of course an embroidery of math and science. Some techniques seem old and there is a good alternative that the creators are missing out on. But thinking of 10 years is enough for this model to get old and think back to what lies ahead in 100 years.

--

--