- The paper explores the challenges involved in training deep networks, the effect of unsupervised pre-training on training process and visualizes the error function landscape for deep architectures.
- Link to the paper
- Datasets used - Shapeset and MNIST.
- Train deep architectures for a variable number of layers with and without pre-training.
- Weights initialized using random sample from [-1/sqrt(k), 1/sqrt(k)] where k is fan-in value.
- Increasing depth (without pre-training) causes error rate to go up faster than the case of pre-training.
- Pre-training also makes the network more robust to random initializations.
- At same training cost level, the pre-trained models systematically yields a lower cost than the randomly initialized ones.
- Pre-training seems to be most advantageous for smaller training sets.
- Pre-training appears to have a regularizing effect - it decreases the variance (for parameter configurations) by restricting the set of possible final configurations for parameter values and introduces a bias.
- Pre-training helps for larger layers (with a larger number of units per layer) and for deeper networks. But in the case of small networks, it can lower the performance.
- As small networks tend to have a small capacity, this supports the hypothesis that pre-training exhibits a kind of regularizing effect.
- Pre-training seems to provide a better marginal conditioning of the weights. Though this is not the only benefit pre-training provides as it captures more intricate dependencies.
- Pre-training the lower layers is more important (and impactful) than pre-training the layers closer to the output.
- Error landscape seems to be flatter for deep architectures and for the case of pre-training.
- Learning trajectories for pre-trained and not pre-trained models start and stay in different regions of function space. Moreover, trajectories of any of the given type initially move together, but at some point, they diverge away.