- Deep networks suffer the problem of vanishing gradient, i.e., gradients near to the loss layer are much larger and as they get distributed to the previous layers backward, the magnitude of gradients diminish.
- Inception tackles this problem by using the concept of auxiliary losses where (see fig) it branches out convolution outputs from 4a and 4d into their respective mini classification heads that employ standard softmax - predicting on the same classes as the main task
- Hence during training, there will be a strong gradient flow from loss:0 to adjust layers 4a (and preceding layers) more aggressively. Similar is the case with layers "4d,4c and 4b" where the gradients are high from "loss:1" and for layers 4e onwards which are closer to "loss:2".