Probability notes (http://www.autonlab.org/tutorials/prob18.pdf):
========================
- "P(A)" means the probability that the event A will happen/A will be true.
- The core axioms
- 0 <= P(A) <= 1
- P(true) = 1
- P(false) = 0
- P(A or B) = P(A) + P(B) - P(A and B) Note: it seems that this is definition of or is the exclusive or.
- From the core axioms, we can derive:
- P(not A) = P(~A) = 1 - P(A)
- P(A) = P(A ^ B) + P(A ^ ~B)
- Conditional Probability:
- P(A|B) = fraction of worlds in which B is true that also have A true Really, this means, given that some random variable "B" is true, what is the probability that A will also be true? Stated yet another way, given the world of B, what percentage of that world overlaps with the world of A?
- P(A|B) = (Area of A and B regions) / Area of B = P(A ^ B) / P(B)
- Corollary of conditional probability: the chain rule: P(A ^ B) = P(A | B) * P(B)
- Bayes rule:
P(B|A) = P(A ^ B) / P(A) = (P(A|B) * P(B)) / P(A)
- Notice how this relates P(A|B) to P(B|A). Tip: thinking about this geometrically (i.e. as areas) helps a lot.
- These rules let you perform bayesian inference.
- Joint Distributions are probabilities of several variables being true. Think of it as a truth table with an extra columns with probabilities assignments. TODO: better description of joint distributions
- Terminology: "Density estimator" learns a mapping from a set of attributes to a probability.
- Joint Distributions are a type of density estimator
- Joint Distributions are a bad density estimator because they overfit/don't generalize.
- Naive Density Estimation
- Assumes that each attribute is distributed independenly of any of the other attributes.
- i.e. P(x[i] = v | x[1] = u1, x[2] = u2, x[3] = u3, ...) = P(x[i] = v)
- i.e. the probability of x[i] being v does not depend on the probabilities of other events.
- Assume A and B are boolean random variables. They are independent iff P(A|B) = P(A)
- Under independence, the following examples are true: P(A^B) = P(A) * P(B) P(~A|B) = P(~A) P(A|~B) = P(A)
- i.e. P(x[i] = v | x[1] = u1, x[2] = u2, x[3] = u3, ...) = P(x[i] = v)
- To learn a naive density estimator: P(x[i] = u) = (# records in whichs x[i] = u) / (total # of records)
- Assumes that each attribute is distributed independenly of any of the other attributes.
- Bayes Classifiers
- Used to predict categorical outputs
- Predicts an output Y with arity n: v1, v2, ..., vn
- There are M input attributes X1, X2, ... Xm
- Break dataset in to n smaller datasets DS1, DS2, ..., DSn
- Dataset i (aka DSi) equals the records in which Y = vi
- Then for each dataset, DSi, learning a density estimator, Mi, to model the input distribution among the Y=vi records.
- Mi estimates P(X1, X2, ..., Xm | Y = vi)
- TODO: don't yet have a good grasp of bayes classifiers