Exercises

1. What Linear Regression training algorithm can you use if you have a training set with millions of features?

You could use batch gradient descent, stochastic gradient descent, or mini-batch gradient descent. SGD and MBGD would work the best because neither of them need to load the entire dataset into memory in order to take 1 step of gradient descent. Batch would be ok with the caveat that you have enough memory to load all the data.

The normal equations method would not be a good choice because it is computationally inefficient. The main cause of the computational complexity comes from inverse operation on an (n x n) matrix.

O n2 . 4 to O n3

2. Suppose the features in your training set have very different scales: what algorithms might suffer from this, and how? What can you do about it?

The normal equations method does not require normalizing the features, so it remains unaffected by features in the training set having very different scales.

Feature scaling is required for the various gradient descent algorithms. Feature scaling will help gradient descent converge quicker.

3. Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?

Gradient descent produces a convex shaped graph which only has one global optimum. Therefore, it cannot get stuck in a local minimum.

4. Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?

No. The issue is that stochastic gradient descent and mini-batch gradient descent have randomness built into them. This means that they can find their way to nearby the global optimum, but they generally don't converge. One way to help them converge is to gradually reduce the learning rate hyperparameter.

5. Suppose you use Batch Gradient Descent and you plot the validation error at every epoch: if you notice that the validation error consistently goes up, what is likely going on? How can you fix this?

If the validation error consistently goes up after every epoch, then one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However, if the training error is not going up, then your model is overfitting the training set and you should stop training and apply the common remedies to overfitting (regularization, more data, fix errors in data, remove outliers, or reduce number of features).

6. Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?

Both Mini-batch and Stochastic gradient descent are not guarenteed to minimize the cost function after each step because the both have a degree of randomness built into them. Mini-bath randomly chooses which training examples to perform gradient descent on while Stochastic randomly chooses a single example. A better option is to save the model at regular intervals. When the model has not improved for a long time you can revert to the saved models.

7. Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest. Which will actually converge? How can you make the others converge as well?

Stochastic gradient descent is the fastest to converge because it only considers one training example at time. Batch gradient descent is guarenteed to converge given a small enough learning rate and some patience. Since SGD and MBGD are both random, one strategy you can use to help them converge is to reduce the learning rate over time so it takes smaller and smaller steps of gradient descent as it approaches the global minimum.

8. Suppose you are using Polynomial regression, you plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

When you plot learning curves and notice there is a large gap between the training error and the validation error that is characteristic of an overfitting model. The "gap" exists simply because the training error is lower than the validation error. One way to imrpove an overfitting model is to provide more training data. Another tactic is to reduce the complexity of the model. You can also reduce the number of features in your data. One last thing to try is add regularization to your model. Either L2 (ridge regression) or L1 (lasso) are good choices.

9. Suppose you are using Ridge regression and you notice that the training error and the validation error are almost equal and fairly high: would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?

When the training error and validation error are close to each other and high that means your model is underfitting (i.e. it has high bias). You should try to reduce the regularization hyperparameter.

10. Why would you want to use:

You should prefer some level of regularization. Generally it's a good idea to avoid plain linear regression. Ridge is a good default, but if you suspect that only several of the features are useful, you should go with lasso or elastic net regularization.

• Ridge regression instead of Linear Regression?
Use Ridge regression when your model is overfitting the training set. If you think only several features in your training set are useful, go with lasso or elastic net.

• Lasso instead of Ridge regression?
Lasso performs automatic feature selection by dropping weights of the most important features. Concretely, Lasso regression uses an l1 penalty which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for the most important weights. This is a way to perform feature selection auto‐ matically, which is good if you suspect that only few features actually matter. When you are not sure, you should prefer Ridge regression.

• Elastic Net instead of Lasso?
Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.

11. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime, should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

Softmax regression does not handle multiple output classes (i.e. [indoor, daytime]). So you'll need to use two logistic regression classsifiers.

byelipk/2-ml-exercises.md