Coursera: Machine Learning; Exercise 4 Tutorial
This document was adapted from the section titled Tutorial for Ex.4 Forward and Backpropagation (Spring 2014 session) from Coursera Programming Exercise 4 document on 2018-05-27.
I've reformatted the document, and made a few minor edits for legibility and consistency.
I do not know who the original author was; if you know please email [email protected], or just leave a comment in this GitHub gist. Thanks! 🙇
Expand the y output values into a matrix of single values (see ex4.pdf Page 5).
This is most easily done using an eye() matrix of size num_labels, with vectorized indexing by y, as in eye(num_labels)(y,:). A typical variable name would be y_matrix.
a1equals theXinput matrix with a column of 1's added (bias units)z2equals the product ofa1andΘ1a2is the result of passingz2throughg()a2then has a column of 1st added (bias units)z3equals the product ofa2andΘ2a3is the result of passingz3throughg()
Compute the unregularized cost according to ex4.pdf (top of Page 5).
ℹ️ Note: I had a hard time understanding this equation mainly that I had a misconception that
y(i)kis a vector, instead it is just simply one number.
- Using
a3,y_matrix, andm(the number of training examples). - Cost should be a scalar value. If you get a vector of cost values, you can sum that vector to get the cost.
- Remember to use element-wise multiplication with the
log()function. - Now you can run
ex4.mto check the unregularized cost is correct, then you can submit Part 1 to the grader.
Compute the regularized component of the cost according to ex4.pdf Page 6, using Θ1 and Θ2 (ignoring the columns of bias units), along with λ, and m.
The easiest method to do this is to compute the regularization terms separately, then add them to the unregularized cost from Step 3.
You can run ex4.m to check the regularized cost, then you can submit Part 2 to the grader.
You'll need to prepare the sigmoid gradient function g′(), as shown in ex4.pdf Page 7.
You can submit Part 3 to the grader.
Implement the random initialization function as instructed on ex4.pdf, top of Page 8.
You do not submit this function to the grader.
Now we work from the output layer back to the hidden layer, calculating how bad the errors are.
See ex4.pdf Page 9 for reference.
δ3equals the difference betweena3and they_matrix.δ2equals the product ofδ3andΘ2(ignoring theΘ2bias units), then multiplied element-wise by theg′()ofz2(computed back in Step 2).
Note that at this point, the instructions in
ex4.pdfare specific to looping implementations, so the notation there is different.
Δ2equals the product ofδ3anda2. This step calculates the product and sum of the errors.Δ1equals the product ofδ2anda1. This step calculates the product and sum of the errors.
Now we calculate the non-regularized theta gradients, using the sums of the errors we just computed (See ex4.pdf bottom of Page 11).
Θ1gradient equalsΔ1scaled by1/m.Θ2gradient equalsΔ2scaled by1/m.
The
ex4.mscript will also perform gradient checking for you, using a smaller test case than the full character classification example.So if you're debugging your
nnCostFunction()using thekeyboardcommand during this, you'll suddenly be seeing some much smaller sizes ofXand theΘvalues. Do not be alarmed.
If the feedback provided to you by ex4.m for gradient checking seems OK, you can now submit Part 4 to the grader.
For reference see ex4.pdf, top of Page 12, for the right-most terms of the equation for j>=1.
Now we calculate the regularization terms for the theta gradients.
The goal is that regularization of the gradient should not change the theta gradient(:,1) values (for the bias units) calculated in Step 8.
There are several ways to implement this (in Steps 9a and 9b).
Method 1
- a. Calculate the regularization for indexes
(:,2:end) - b. Add ☝️ them to theta gradients
(:,2:end).
Method 2
- a. Calculate the regularization for the entire theta gradient, then overwrite the
(:,1)value with0 - b. Add ☝️ to the entire matrix.
Pick a method, and calculate the regularization terms as follows:
(λ/m)*Θ1(using either Method 1 or Method 2), and...(λ/m)*Θ2(using either Method 1 or Method 2)
Add these regularization terms to the appropriate Θ1 gradient and Θ2 gradient terms from Step 8 (using either Method 1 or Method 2).
Avoid modifying the bias unit of the theta gradients.
⚠️ Note: there is an errata in the lecture video and slides regarding some missing parenthesis for this calculation. Theex4.pdffile is correct.
The ex4.m script will provide you feedback regarding the acceptable relative difference. If all seems well, you can submit Part 5 to the grader.Now pat yourself on the back.
Here are the sizes for the character recognition example, using the method described in this tutorial:
a1:5000x401z2:5000x25a2:5000x26a3:5000x10d3:5000x10d2:5000x25Theta1,Delta1andTheta1grad:25x401Theta2,Delta2andTheta2grad:10x26
ℹ️ Note: The
ex4.mscript uses a several test cases of different sizes, and the submit grader uses yet another different test case.