Paper review Yixin Lin

Evolution of ConvNets - Part I

Over the next few posts, I’ll be reviewing some of the papers which advanced the state-of-the-art on image recognition. Most of these papers explain models which won the ImageNet LSVRC contest over the years.

In this post, I’ll talk about AlexNet, the original model which demonstrated the effectiveness of deep learning by winning ILSVRC 2012, and ZFNet, which won ILSVRC 2013. The latter is essentially a tweak of the former; however, both of these could be seen as a scaling up of LeNet1, the original convnet published back in 1988 by Yann LeCun (now director of Facebook AI Research).

In future posts, I’ll first talk about the VGGNet, the GoogLeNet/Inception architecture, and ResNet, which provided interesting architectural innovations beyond that of AlexNet. Then, I’ll explain the more recent modern tweaks, including batch normalization, combinations of Inception and ResNet, and Xception– all of which squeeze even more power out of these image recognition models and propel it to superhuman levels.


This is an extremely famous paper, titled “ImageNet Classification with Deep Convolutional Neural Networks” and authored by Krizhevsky, Sutskever, and Hinton at the University of Toronto. It launched the deep learning renaissance in 2012 by blowing the competition out of the water on ILSVRC with a model named AlexNet. It sparked the modern excitement around neural networks by clearly demonstrating the effectiveness of deep learning.

It explains some of the motivations behind convnets, primarily as a prior over the data distribution which biases it towards “stationarity of statistics and locality of pixel dependencies”– in other words, it is designed for images through the use of convolutions. For a more in-depth explanation of the motivation behind convnets, see section 9.2 of the Deep Learning book by Goodfellow et al2.



There are five convolutional layers, followed by three fully connected layers which outputs to a softmax over the 1000 ImageNet categories. The first two convolutional layers are followed by response-normalization layers and max-pooling layers; the fifth convolutional layer also has a max pooling layer after it.

There are a few practical notes which helped make this architecture successful:

  • Multiple GPU acceleration: They had several clever optimizations which enabled the feasibility of this large network.3
  • Using only ReLU nonlinearities: The ReLU function, $f(x) = \max(0,x)$, doesn’t saturate, as opposed to the previous standards of tanh or softmax. This increases training speed.
  • Overlapping pooling layers: Pooling layers reduce dimensionality of a layer by summarizing it (often with a max function, called max-pooling). They use overlapping layers, as opposed to just a grid.
  • Local response normalization: this normalizes the activations across kernels nearby. Intuitively, they have competition among local kernels in providing a large activation. Formally, if $a_{x,y}^i$ is output of the $i$-th kernel at position $x,y$ and for some hyperparameters $k,n,\alpha,\beta$, then $b_{x,y}^i$, the normalized output, is

In order to reduce overfitting (which occurs even on the massive ImageNet dataset!), they also use the two techniques:

  • Data augmentation: they generate image translations and reflections of the original dataset. They also add multiples of principle components of the RGB pixel values, the goal being to expose the network to the same images at different intensities.
  • Dropout: they apply the dropout technique, which has since become standard for reducing overfitting. During training, the output of neurons are set randomly to 0 in order to reduce dependency between specific neurons. One could also interpret this as training a huge ensemble of neural networks which have shared parameters. For more information on dropout, see section 7.12 of Goodfellow et al.


The results are absurdly good: top-5 test set error rates of 16.4%, from the closest competitor’s 26.2%, on ILSVRC 2012.


ZFNet, the model proposed in the paper “Visualizing and Understanding Convolutional Networks”4, improved on AlexNet by visualizing and debugging problems with the original architecture, leading to a model which won ILSVRC 2013. This paper appeared in ECCV 2014 and is authored by Zeiler and Fergus of the Courant Institute at NYU, but they soon left to found the computer vision company Clarifai based off this technology.


A significant portion of this paper is spent on new techniques for visualization of convnets; in particular, the authors reference their previous paper’s5 idea of the deconvnet, which can be thought of as an inverse mapping from feature activations to the input pixel space.

A deconvnet is a model that applies filtering and pooling in reverse. Specifically, a series of unpooling6, ReLU, and convolutional filtering layers is applied to each layer (in other words, they attach a deconvnet to each layer). When they want to examine a specific convnet activation, they zero out all other outputs of the current layer and see what the deconvnet outputs.

The resulting visualizations are quite beautiful and instructive; in particular, their Figure 2 demonstrates the effectiveness of the deconvnet technique well.

Architecture tweaks

Besides dispensing with the complicated sparse connections which AlexNet used for multi-GPU support7 (not super pivotal), they tweaked the first two convolution layers. As they say,

The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 6(c) & (e). More importantly, it also improves the classification performance… [emphasize mine]

(In fact, in VGGNet8, the trend towards smaller-sized convolutions is taken much further.)


The primary result is that an ensemble of these ZFNets (essentially just AlexNet with a small tweak) attains 14.8% on ILSVRC 2012 and wins ILSVRC 2013, soon after leading them to spawn the Clarifai startup.

They then performed several experiments to probe the architecture.

First, they play with the size of the layers and number of layers:

Removing the fully connected layers (6,7) only gives a slight increase in error. This is surprising, given that they contain the majority of model parameters… removing [many layers] yields a model … whose performance is dramantically worse. This would suggest that the overall depth of the model is important for obtaining good performance.

With the benefit of hindsight, this heavily forshadows the VGGNet approach, which proposes to use a large number of smaller convolutional layers.

They also show that the feature extractor performs quite well on different datasets, leading to the now-classic transfer learning technique of “use image feature extractor trained on ImageNet, exchange the softmax for your own”.

Finally, they demonstrate that depth matters by showing the consistently increasing performance as you ascend through the layers of the model.


AlexNet was the beginning of the deep learning excitement, and certainly provided a striking example of deep learning and convnet success. One could say this work validated Yann LeCun’s original faith in his own work (see this for his letter explaining why he withdrew his paper from ICML). However, future work greatly improved the performance and provided interesting architectural advances.

Similar work over the evolution of convnets over the last few years, which might be great further reading on the topic:

Footnotes and References

  1. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, november 1998. 

  2. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016. 

  3. I don’t think the details of this are particularly important, as there has been an enormous amount of engineering work on parallelization of neural network training on GPUs (and eventually shifting to ASICs like Google’s Tensor Processing Units). 

  4. Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer International Publishing, 2014. 

  5. Zeiler, Matthew D., Graham W. Taylor, and Rob Fergus. “Adaptive deconvolutional networks for mid and high level feature learning.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011. 

  6. Because pooling is not an invertible operator, they apply the hack of recording the locations of maxima for each pooling region in the original convnet and use it in the deconvnet. 

  7. Zeiler et al. trained using a single GTX 580, as opposed to the two used by Krizhevksy et al. I’m confused why this is possible, since they originally split it onto two GPUs in order to deal with GTX 580 memory constraints. 

  8. Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).