**This paper is the outcome when Microsoft finally released the beast! The ResNet “slayed” everything, and won not one, not two but five competitions; ILSVRC 2015 Image Classification, Detection and Localization, and COCO 2015 detection and segmentation.**

### Problems the Paper Addressed

The paper analyzed what was causing the accuracy of deeper networks to drop as compared to their shallower counterparts and provide a possible solution to it. Using this a very deep network was trained *(152 layers)* successfully.

Before we dive into the architecture, let’s look at the problem a little more closely.

### The Problem

The problem is the degradation problem; as we increase the depth of the network, instead of the accuracy increases it drops. This drop isn’t just on the validation set data but also on the training data. As a result of this, overfitting can be ruled out as the cause of the problem as if the model was overfitting, the training accuracy would improve which is different from the observation.

In the paper it is mention that the notorious vanishing gradient is taken care by normalized initialization and intermediate normalization layers and the problem is due to *“current solvers are unable to find a proper solution in feasible time”*. In a round about away, it is just the vanishing gradient problem! The measures taken to address it were not enough. Basically, as the value of the gradient becomes smaller as it flows from the output to the initial layers, training becomes very slow and the network fails to train.

Now someone might think, it could easily be a different problem from this. It could just be that better solution for these deep nets just don’t exist! Well, this possibility is also ruled out. Consider a shallower architecture and its deeper counterpart. There exists a solution where all the layers from the shallow net are copied to the deep net and the extra layers are just identity mapping. This proves that the solution space for a shallow net is a subset of that of a deeper net and hence a deeper model should produce no higher training error than its shallower counterpart.

Now that we have understood the problem, let’s look at innovations done in the architecture to fix it.

### The Architecture

Instead of learning a desired mapping from the input space to the output space, ResNets learn a Residual Mapping, called residual learning.

##### Residual Learning

Residual learning is based on the hypothesis *(without proof)* that multiple nonlinear layers can asymptotically approximate complicated functions however all functions don’t have the same ease of learning, some are harder while some are easier to learn.

Denoting the underlying mapping as * H(x)*, the stacked non-linear layers fit another mapping, called residual mapping, of

*. It is hypothesized the it is easier to fit the residual mapping,*

**F(x) := H(x) – x***, then to optimize the original mapping,*

**F(x)***. An example to see this is, say the ideal optimal mapping for*

**H(x) := F(x) + x***is the identity function i.e.*

**H(x)***, then it will be easier to push the residual to zero,*

**H(x) = x***, than to fit an identity mapping by a stack of non-linear layers.*

**F(x) = 0**This is achieved by using skip connections!

##### ResNet Architecture

The figure shows the smallest building block of a ResNet. It is basically a couple of stacked layers *(minimum two)* with a skip connection. Skip connections are basically identity mappings and hence contribute no additional parameters. Residual learning is applied to the stacked layers. The block can be represented as:

**y = ***F(***x, {W***i***}***)*** + x**

* F* is the residual function learned by the stacked layers.

*is element-wise addition. The dimensions of*

**F + x***and*

**x***must be equal to perform the addition. If this is not the case (e.g., when changing the input/output channels), the following can be done:*

**F**- The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions.
- Perform a linear projection
**W***s*by the shortcut connections, using**1 × 1**convs. to match the dimensions,**y =***F(***x, {W***i***}***)***+ W***s***x**. This introduces extra parameters and computation.

The final non-linearity is applied only after the element-wise addition.

One point to note is that, if the number of stacked layers was only one, there would be no advantages as the block would just behave like a normal conv. layer:

**y = ***F(***x, {W***i***}***)*** + x **

*⇒*** y = (W***1***x + b) + x**

*⇒*** y = (W***1*** + I)x + b**

*⇒*** y = W***2***x + b** …………… where **W***2*** = W***1*** + I** and I is an identity matrix.

The actual ResNet model is basically just the residual blocks repeated multiple times. Batch Normalization is used after each convolution but before applying the activation function.

### Training

224 × 224 crops are randomly sampled from an image resized such that its shorter side is randomly chosen form [256, 480], with the per-pixel mean subtracted. Stochastic gradient descent with mini-batch size of 256, momentum of 0.9 and a weight decay of 0.0001 is used. The learning rate is initialized at 0.1 and is divided by 10 when the error plateaus. The models are trained for 60 × 10e4 iterations. Dropout is not used.

### Conclusion

From the 22 layer GoogLeNetto the monstrous 152 layer ResNet was a huge leap in just one year! ResNets achieved a 3.57% top-5 error on the ImageNet competition.

*Original Paper: **Deep Residual Learning for Image Recognition*