Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?
No. All weights should be initialized to different random values and should not have the same initial value. If weights are symmetrical, meaning they have the same value, it makes it almost impossible for backpropagation to converge to a good solution.
Think of it this way: if all the weights are the same, it's like having just one neuron per layer, but much slower.
The technique we use to break this symmetry is to sample weights randomly.
Yes, this is fine. It does not end up making much of a difference. It's been my practice to initialize the bias term to 1.
- It can take on negative values, so the average output of the neurons in any given layer is typically closer to 0 then when using the ReLU function. This helps alleviate the vanishing gradients problem. The vanishing gradients problem is the idea that gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the gradient descent update leaves the lower layer connection weights virtually unshanged, and training never converges to a good solution.
- It always has a non-zero derivative, which avoids the dying units issue that can affect ReLU units. "Dying ReLU's" is the problem where units stop outputting anything other than 0. In some cases, you may find that half of your networks neurons are dead, especially if you use a large learning rate.
- It is smooth everywhere, which helps gradient descent converge faster. ReLU's slope abruptly jumps from 0 to 1 at z = 0. Such an abrupt change can slow down gradient descent because it will bounce around z = 0.
(For vanishing gradients see p. 275) (For ELU see p. 281)
In which cases would you want to use each of the following activation functions: ELU, leaky ReLU, ReLU, tanh, logistic, and softmax?
The ELU function is a good default.
If you need the network to be as fast as possible, you can use one of the leaky ReLU variants.
The basic ReLU function is also preferred due to its simplicity, despite the fact they are generally outperformed by ELU and leaky ReLU. If you have a situation where it's preferable to have neurons output exactly 0, then ReLU is a good choice.
tanh can be useful in the output layer if you need your outputs to be between -1 and 1. It's not used too much in hidden layers.
The logistic function is useful in the output layer when you need to estimate a probability like in the binary, multi-class, or multi-class multi-label classification problems. Like tanh, it is not used in the hidden layers.
The softmax function is useful in the output layer when you need to output probabilities for mutually exclusive classes. Again, it's not used in hidden layers.
What may happen if you set the momentum hyperparameter too close to 1 (0.999999999999) when using the MomentumOptimizer?
The purpose of the momentum hyperparameter B is to simulate friction and prevent the momentum from growing too large. We set it to 0 for high friction, and 1 for no friction.
As the hyperparameter gets closer to 1, there will be less friction and the momentum optimization will "roll faster down the hill." This means that the optimizer will overshoot, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reasons why it is good to have a bit of friction in the system: it gets rid of these oscillations and thus speeds up convergence.
(See p. 295)
A sparse model is that where most weights are equal to 0. There's a couple ways of achieving that effect.
You can train the model normally then zero out tiny weights.
For more sparsity, you can apply l1 regularization during training, which pushes the optimizer towards sparsity.
Finally, you can combine l1 regulatization with dual averaging using TensorFlow's FTRLOptimizer class.
Dropout is a popular regularization technique for deep neural networks.
The algorithm is: at each training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporariliy "dropped out", meaning it will be entierly ignored during this training step. However, it may be active during the next step.
The hyperparameter p is called the dropout rate, and it is typically set to 0.5.
After training the neurons don't get dropped anymore. That's the gist of the algorithm.
Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. Remember, because dropout is only tuned during training, it has no impact on inference.
(See p. 303-305)
- Build a DNN with five hidden layers of 100 neurons each, using He initialization for the weights and the ELU activation function.
- Using Adam optimization and early stopping, try training MNIST but only on digits 0 to 4. (We will use transfer learning for digits 5 to 9 in the next exercise.) You will need a softmax output layer with 5 neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later.
- Tune the hyperparameters using cross-validation and see what precision you can achieve.
- Now try adding batch normalization and compare the learning curves: is it converging faster than before? Does it produce a better model?
- Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?
- Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a fresh new one.
- Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision?
- Try caching the frozen layers, and train the model again: how much faster is it now?
- Try again reusing just four hidden layers instead of five. Can you achieve a higher precision?
- Now unfreeze the top two hidden layers and continue training: can you get the model to perform even better?