Skip to content

Instantly share code, notes, and snippets.

@ctufts
Last active July 22, 2016 20:38
Show Gist options
  • Save ctufts/6e8e23680f72c05cacbc8b3431578673 to your computer and use it in GitHub Desktop.
Save ctufts/6e8e23680f72c05cacbc8b3431578673 to your computer and use it in GitHub Desktop.
General notes about statistics (distributions, tests, etc.)
  • Test for normality:
    • Shapiro-Wilk: Null Hypothesis is that the data is normally distributed. If p-value below alpha (0.05 or whatever significance you are looking for), null hypothesis is rejected (data is non-normal)
    • When testing with large samples (test is biased by sample size - will be statistically significant at large sample size) accompany test with a Q-Q plot
    • Anderson-Darling
  • Comparison on distributions (no assumption of normality)
    • Kolmogorov-Smirnov test
      • Compares CDF's of two sample sets - D value close to 1 indicates distributions are different, close to 0 distributions are close to one another
    • Wilcoxon’s signed-rank test
      • Compares medians from two sample sets
    • Mann-Whitney U Test: Similar to Wilcoxon, but samples don't have to be paired
      • Null Hypothesis: Both groups have the same distribution
      • If U-value is close to 1, medians are very different, if the medians are similar U will be close to n1*n2/2 where n1 and n2 are the number of points in dataset 1 and 2
    • Permutation Tests: can be used to apply a number of different comparison metrics. Continually resamples the data, shuffling labels to produce an empirical comparison with the metric applied to the real labeled set.
  • Zipf's Law/Zipfian Distribution
    • Most frequently used word occurs 2x's as much as 2nd most frequent, 3x's as much as 3rd most frequent, etc.
    • Distibution of words in a corpus
  • Distance between 2 probability distributions
    • Jensen-Shannon divergence
  • Difference between statistical and probabilistic inference:
    • Statistical Inference: Estimate parameters/underlying distribution based on analysis of data
    • Probabilistic Inference: Computing joint or marginal distributions based on a known distribution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment