Notes for "Probability for Data Miners"

Probability notes (http://www.autonlab.org/tutorials/prob18.pdf):

========================

"P(A)" means the probability that the event A will happen/A will be true.
The core axioms
1. 0 <= P(A) <= 1
2. P(true) = 1
3. P(false) = 0
4. P(A or B) = P(A) + P(B) - P(A and B) Note: it seems that this is definition of or is the exclusive or.
From the core axioms, we can derive:
1. P(not A) = P(~A) = 1 - P(A)
2. P(A) = P(A ^ B) + P(A ^ ~B)
Conditional Probability:
- P(A|B) = fraction of worlds in which B is true that also have A true Really, this means, given that some random variable "B" is true, what is the probability that A will also be true? Stated yet another way, given the world of B, what percentage of that world overlaps with the world of A?
- P(A|B) = (Area of A and B regions) / Area of B = P(A ^ B) / P(B)
- Corollary of conditional probability: the chain rule: P(A ^ B) = P(A | B) * P(B)
- Bayes rule: P(B|A) = P(A ^ B) / P(A) = (P(A|B) * P(B)) / P(A)
  - Notice how this relates P(A|B) to P(B|A). Tip: thinking about this geometrically (i.e. as areas) helps a lot.
  - These rules let you perform bayesian inference.
  - Joint Distributions are probabilities of several variables being true. Think of it as a truth table with an extra columns with probabilities assignments. TODO: better description of joint distributions
Terminology: "Density estimator" learns a mapping from a set of attributes to a probability.
- Joint Distributions are a type of density estimator
- Joint Distributions are a bad density estimator because they overfit/don't generalize.
Naive Density Estimation
- Assumes that each attribute is distributed independenly of any of the other attributes.
  - i.e. P(x[i] = v | x[1] = u1, x[2] = u2, x[3] = u3, ...) = P(x[i] = v)
    - i.e. the probability of x[i] being v does not depend on the probabilities of other events.
  - Assume A and B are boolean random variables. They are independent iff P(A|B) = P(A)
  - Under independence, the following examples are true: P(A^B) = P(A) * P(B) P(~A|B) = P(~A) P(A|~B) = P(A)
- To learn a naive density estimator: P(x[i] = u) = (# records in whichs x[i] = u) / (total # of records)
Bayes Classifiers
- Used to predict categorical outputs
- Predicts an output Y with arity n: v1, v2, ..., vn
- There are M input attributes X1, X2, ... Xm
- Break dataset in to n smaller datasets DS1, DS2, ..., DSn
  - Dataset i (aka DSi) equals the records in which Y = vi
- Then for each dataset, DSi, learning a density estimator, Mi, to model the input distribution among the Y=vi records.
  - Mi estimates P(X1, X2, ..., Xm | Y = vi)
- TODO: don't yet have a good grasp of bayes classifiers

jhaberstro/prob_data_miners_notes.md

Probability notes (http://www.autonlab.org/tutorials/prob18.pdf):