Linear Regression - Understanding

Question: Why do we use gradient descent in linear regression?

From https://stackoverflow.com/questions/26804656/why-do-we-use-gradient-descent-in-linear-regression

In some machine learning classes I took recently, I've covered gradient descent to find the best fit line for linear regression.

In some statistics classes, I have learnt that we can compute this line using statistic analysis, using the mean and standard deviation - this page covers this approach in detail [http://onlinestatbook.com/2/regression/intro.html]. Why is this seemingly more simple technique not used in machine learning?

My question is, is gradient descent the preferred method for fitting linear models? If so, why? Or did the professor simply use gradient descent in a simpler setting to introduce the class to the technique?

Answer: Answer 1: The example you gave is one-dimensional, which is not usually the case in machine learning, where you have multiple input features. In that case, you need to invert a matrix to use their simple approach, which can be hard or ill-conditioned.

Usually the problem is formulated as a least square problem, which is slightly easier. There are standard least square solvers which could be used instead of gradient descent (and often are). If the number of data points is very hight, using a standard least squares solver might be too expensive, and (stochastic) gradient descent might give you a solution that is as good in terms of test-set error as a more precise solution, with a run-time that is orders of magnitude smaller (see this great chapter by Leon Bottou [http://leon.bottou.org/publications/pdf/mloptbook-2011.pdf])

If your problem is small that it can be efficiently solved by an off-the-shelf least squares solver, you should probably not do gradient descent.

Answer 2: Basically the 'gradient descent' algorithm is a general optimization technique and can be used to optimize ANY cost function. It is often used when the optimum point cannot be estimated in a closed form solution.

So let's say we want to minimize a cost function. What ends up happening in gradient descent is that we start from some random initial point and we try to move in the 'gradient direction' in order to decrease the cost function. We move step by step until there is no decrease in the cost function. At this time we are at the minimum point. To make it easier to understand, imagine a bowl and a ball. If we drop the ball from some initial point on the bowl it will move until it is settled at the bottom of the bowl.

As the gradient descent is a general algorithm, one can apply it to any problem that requires optimizing a cost function. In the regression problem, the cost function that is often used is the mean square error (MSE). Finding a closed form solution requires inverting a matrix that in most of the time is ill-conditioned (it's determinant is very close to zero and therefore it does not give a robust inverse matrix). To circumvent this problem, people are often take the gradient descent approach to find the solution which does not suffer from ill-conditionally problem.

Additional Resources:

http://www.analystsoft.com/en/products/statplus/content/help/statplus_getting_started.html An Introduction to Gradient Descent and Linear Regression - https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/

Gradient Descent Simplified - http://ucanalytics.com/blogs/intuitive-machine-learning-gradient-descent-simplified/

GD in Excel : https://medium.com/towards-data-science/keep-it-simple-how-to-simplify-understanding-of-algorithms-like-gradient-descent-19cb418d4276

Good Example: https://github.com/Apress/mastering-ml-w-python-in-six-steps/blob/master/Chapter_3_Code/Code/Regression.ipynb

mysticBliss/Understanding Linear Regression.md

Additional Resources: