Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.
- Checkers playing computer
Well-posed Learning problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance T, as measured by P, improves with experience E.
The following types of ML algos will be covered in this class:
- Supervised Learning
- Unsupervised Learning
- Reinforcement learning
- Recommender Systems
It's also important to learn how to use ML algos
Let's take an example first.
Housing price prediction:
| x
| x x
| x
| x
| x
| x
|
-------------------------
Price vs. Size in Feet
A supervised algorithm is used in a situation where the "right answer" is given.
Once type of supervised learning algorithm is a regression algorithm, which is useful for continuous kinds of data, like in this case, housing prices.
Let's take another example:
This graph shows size of a cancerous tumor and it's malignancy.
|
| x x x x x x
|
| x x x x x x
-------------------------
Malignant vs. Tumor Size
Supervised learning can be used to classify, which is useful when you have discrete valued output, in this case, (0 or 1) for malignant and benign.
This classification algorithm can be extended to an infinite number of features.
This is called a support vector machine.
Key Ideas:
- Supervised = there is a right answer
- Regression = continuous data
- Classification = discrete data
In supervised learning, we're explicitly told what the "right" answer is.
In unsupervised learning, we're not given these labels, but rather, our task is to find "structure" in the given data.
For example, Google news analyses tonnes of news each day and groups them according to their content such that similar stories are in the same group. This clustering is done using a unsupervised algorithm.
Clustering algorithms are used in:
- Organizing computing clusters
- Social network analysis
- Market segmentation
- Astronomical data analysis
Here is another example of a unsupervised learning algorithm.
The cocktail part problem. Two people are talking at the same time. Can we differentiate the different voices? Turns out we can! The algorithm notices that the sound is the sum of two different sounds, and splits them apart.
Key Ideas:
- Unsupervised = without prior structure
- Clustering = finding structure
Fit a straight line to data, and give a decent answer.
For example, in the earlier case of housing prices, we could figure out how to find a line of best fit, and use that to predict the prices of houses.
Some basic notation:
`m` = number of training examples
`x` = "input" variables / features
`y` = "output" variable / "target" variable
`(x, y)` -> one training example
`(x^i, y^i)` -> i^th training example
Process of learning:
Training Set => Learning Algo
|
Size of house ==> h ==> Estimate Price
Where h
is a hypothesis, or a function which maps from x
s to y
s.
How do we represent h
?
h_theta(x) = theta_o + theta_1x
This assumes a linear "best fit". This is linear regression with one variable, or univariate linear regression.
Used to figure out the best fit to the data, or h
.
The thetas of h
are called the parameters of h
.
Using different parameters, h
changes the way it maps x
s to y
s.
In linear regression, we want to get values of theta such that the line we generate from h(x)
is a good fit to the data.
Idea: let's choose our theta_0
and theta_1
of h_theta(x)
is close to y
for out training examples (x, y)
.
So I want to minimize the squared difference between the the output of h
(which is based on my thetas
) and the y
values of my training data.
The function described above is called the cost function, and in this case, this is the squared error cost function.
This process of minimization can be thought of as the actual "learning" going on over here.
The gradient descent algorithm is used to minimize the cost function J(thetas)
.
Given a fn J(theta_0, theta_1)
, where J
is a cost function, we want to minimize J
.
The idea of gradient descent is the keep changing the theta
values to try to minimize J
.
We begin at a random values of J
with random values of theta_0
and theta_1
. Then we take the derivative of J
with respect to our current values of both thetas, and subtract that from our current value of theta
.
Basically, what we're doing by taking the derivative, is we're going, "down" the slope of the function. By continuing to do down by the derivative, or the slope at the previous theta
, we will eventually end up at an minimum. Note that depending on which values of theta you begin with, you may get different local optima.
Advanced calculus skills required:
Basically, you end up with the following functions for updating theta_0
and theta_1
:
theta_0 := theta_0 - a∑(x^(i) - y(i))
theta_1 := theta_1 - a∑(x^(i) - y(i)) * x^(i)
Go prove these.
Now you can use linear regression to solve all of your problems! :D