inception.md

Inception Networks. Role of Auxiliary loss in Training and Inference

Deep networks suffer the problem of vanishing gradient, i.e., gradients near to the loss layer are much larger and as they get distributed to the previous layers backward, the magnitude of gradients diminish.

Inception tackles this problem by using the concept of auxiliary losses where (see fig) it branches out convolution outputs from 4a and 4d into their respective mini classification heads that employ standard softmax - predicting on the same classes as the main task

Hence during training, there will be a strong gradient flow from loss:0 to adjust layers 4a (and preceding layers) more aggressively. Similar is the case with layers "4d,4c and 4b" where the gradients are high from "loss:1" and for layers 4e onwards which are closer to "loss:2".
This way, during training, all the convolutional layers are actively learning from the losses that are coming from nearby layers. (Note that loss:2 is still influencing all the previous layers but it's just that the gradient contribution diminshes.)
All the three losses are combined in a weighted fashion before calling Loss.backward()

Loss = loss:2 + (0.3 * loss:1) + (0.3 * loss:0)

Such an exercise also encourages initial layers to learn better discriminative features as the gradeints are coming from multiple sources.
Due to the nature of ensemble learning (like random forests) this practice also functions as added regularization.

During prediction one has the choice of using softmaxes from losses 0, 1 and 2 as a way of ensembling the predictions and taking a weighted vote. LeNet has opted to use only the last softmax layer.