Skip to content

Instantly share code, notes, and snippets.

@jhaberstro
Last active September 29, 2015 00:20
Show Gist options
  • Save jhaberstro/b6011d8ab0aac671c864 to your computer and use it in GitHub Desktop.
Save jhaberstro/b6011d8ab0aac671c864 to your computer and use it in GitHub Desktop.
Notes for "Probability for Data Miners"

========================

  • "P(A)" means the probability that the event A will happen/A will be true.
  • The core axioms
    1. 0 <= P(A) <= 1
    2. P(true) = 1
    3. P(false) = 0
    4. P(A or B) = P(A) + P(B) - P(A and B) Note: it seems that this is definition of or is the exclusive or.
  • From the core axioms, we can derive:
    1. P(not A) = P(~A) = 1 - P(A)
    2. P(A) = P(A ^ B) + P(A ^ ~B)
  • Conditional Probability:
    • P(A|B) = fraction of worlds in which B is true that also have A true Really, this means, given that some random variable "B" is true, what is the probability that A will also be true? Stated yet another way, given the world of B, what percentage of that world overlaps with the world of A?
    • P(A|B) = (Area of A and B regions) / Area of B = P(A ^ B) / P(B)
    • Corollary of conditional probability: the chain rule: P(A ^ B) = P(A | B) * P(B)
    • Bayes rule: P(B|A) = P(A ^ B) / P(A) = (P(A|B) * P(B)) / P(A)
      • Notice how this relates P(A|B) to P(B|A). Tip: thinking about this geometrically (i.e. as areas) helps a lot.
      • These rules let you perform bayesian inference.
      • Joint Distributions are probabilities of several variables being true. Think of it as a truth table with an extra columns with probabilities assignments. TODO: better description of joint distributions
  • Terminology: "Density estimator" learns a mapping from a set of attributes to a probability.
    • Joint Distributions are a type of density estimator
    • Joint Distributions are a bad density estimator because they overfit/don't generalize.
  • Naive Density Estimation
    • Assumes that each attribute is distributed independenly of any of the other attributes.
      • i.e. P(x[i] = v | x[1] = u1, x[2] = u2, x[3] = u3, ...) = P(x[i] = v)
        • i.e. the probability of x[i] being v does not depend on the probabilities of other events.
      • Assume A and B are boolean random variables. They are independent iff P(A|B) = P(A)
      • Under independence, the following examples are true: P(A^B) = P(A) * P(B) P(~A|B) = P(~A) P(A|~B) = P(A)
    • To learn a naive density estimator: P(x[i] = u) = (# records in whichs x[i] = u) / (total # of records)
  • Bayes Classifiers
    • Used to predict categorical outputs
    • Predicts an output Y with arity n: v1, v2, ..., vn
    • There are M input attributes X1, X2, ... Xm
    • Break dataset in to n smaller datasets DS1, DS2, ..., DSn
      • Dataset i (aka DSi) equals the records in which Y = vi
    • Then for each dataset, DSi, learning a density estimator, Mi, to model the input distribution among the Y=vi records.
      • Mi estimates P(X1, X2, ..., Xm | Y = vi)
    • TODO: don't yet have a good grasp of bayes classifiers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment