Machine Learning Notes

Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.

Checkers playing computer

Well-posed Learning problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance T, as measured by P, improves with experience E.

Intro

The following types of ML algos will be covered in this class:

Supervised Learning
Unsupervised Learning
Reinforcement learning
Recommender Systems

It's also important to learn how to use ML algos

Supervised Learning

Let's take an example first.

Housing price prediction:

|                 x
|             x     x
|          x
|      x
|     x
|   x
|
 -------------------------

Price vs. Size in Feet

A supervised algorithm is used in a situation where the "right answer" is given.

Once type of supervised learning algorithm is a regression algorithm, which is useful for continuous kinds of data, like in this case, housing prices.

Let's take another example:

This graph shows size of a cancerous tumor and it's malignancy.

|
|           x    x x x x x
|
| x x x x x     x
 -------------------------
Malignant vs. Tumor Size

Supervised learning can be used to classify, which is useful when you have discrete valued output, in this case, (0 or 1) for malignant and benign.

This classification algorithm can be extended to an infinite number of features.

This is called a support vector machine.

Key Ideas:

Supervised = there is a right answer
Regression = continuous data
Classification = discrete data

Unsupervised Learning

In supervised learning, we're explicitly told what the "right" answer is.

In unsupervised learning, we're not given these labels, but rather, our task is to find "structure" in the given data.

For example, Google news analyses tonnes of news each day and groups them according to their content such that similar stories are in the same group. This clustering is done using a unsupervised algorithm.

Clustering algorithms are used in:

Organizing computing clusters
Social network analysis
Market segmentation
Astronomical data analysis

Here is another example of a unsupervised learning algorithm.

The cocktail part problem. Two people are talking at the same time. Can we differentiate the different voices? Turns out we can! The algorithm notices that the sound is the sum of two different sounds, and splits them apart.

Key Ideas:

Unsupervised = without prior structure
Clustering = finding structure

Linear Regression in 1var

Model Representation

Fit a straight line to data, and give a decent answer.

For example, in the earlier case of housing prices, we could figure out how to find a line of best fit, and use that to predict the prices of houses.

Some basic notation:

`m` = number of training examples
`x` = "input" variables / features
`y` = "output" variable / "target" variable

`(x, y)` -> one training example
`(x^i, y^i)` -> i^th training example

Process of learning:

Training Set => Learning Algo
                  |
Size of house ==> h ==> Estimate Price

Where h is a hypothesis, or a function which maps from xs to ys.

How do we represent h?

  h_theta(x) = theta_o + theta_1x

This assumes a linear "best fit". This is linear regression with one variable, or univariate linear regression.

Cost Function

Used to figure out the best fit to the data, or h.

The thetas of h are called the parameters of h.

Using different parameters, h changes the way it maps xs to ys.

In linear regression, we want to get values of theta such that the line we generate from h(x) is a good fit to the data.

Idea: let's choose our theta_0 and theta_1 of h_theta(x) is close to y for out training examples (x, y).

So I want to minimize the squared difference between the the output of h (which is based on my thetas) and the y values of my training data.

The function described above is called the cost function, and in this case, this is the squared error cost function.

This process of minimization can be thought of as the actual "learning" going on over here.

Gradient Descent

The gradient descent algorithm is used to minimize the cost function J(thetas).

Given a fn J(theta_0, theta_1), where J is a cost function, we want to minimize J.

The idea of gradient descent is the keep changing the theta values to try to minimize J.

We begin at a random values of J with random values of theta_0 and theta_1. Then we take the derivative of J with respect to our current values of both thetas, and subtract that from our current value of theta.

Basically, what we're doing by taking the derivative, is we're going, "down" the slope of the function. By continuing to do down by the derivative, or the slope at the previous theta, we will eventually end up at an minimum. Note that depending on which values of theta you begin with, you may get different local optima.

Determining ∂/∂(Theta_0/1) of J

Advanced calculus skills required:

Basically, you end up with the following functions for updating theta_0 and theta_1:

theta_0 := theta_0 - a∑(x^(i) - y(i)) theta_1 := theta_1 - a∑(x^(i) - y(i)) * x^(i)

Go prove these.

Now you can use linear regression to solve all of your problems! :D

siddMahen/ml-notes.md