General notes about statistics (distributions, tests, etc.)

Test for normality:
- Shapiro-Wilk: Null Hypothesis is that the data is normally distributed. If p-value below alpha (0.05 or whatever significance you are looking for), null hypothesis is rejected (data is non-normal)
- When testing with large samples (test is biased by sample size - will be statistically significant at large sample size) accompany test with a Q-Q plot
- Anderson-Darling
Comparison on distributions (no assumption of normality)
- Kolmogorov-Smirnov test
  - Compares CDF's of two sample sets - D value close to 1 indicates distributions are different, close to 0 distributions are close to one another
- Wilcoxon’s signed-rank test
  - Compares medians from two sample sets
- Mann-Whitney U Test: Similar to Wilcoxon, but samples don't have to be paired
  - Null Hypothesis: Both groups have the same distribution
  - If U-value is close to 1, medians are very different, if the medians are similar U will be close to n1*n2/2 where n1 and n2 are the number of points in dataset 1 and 2
- Permutation Tests: can be used to apply a number of different comparison metrics. Continually resamples the data, shuffling labels to produce an empirical comparison with the metric applied to the real labeled set.
Zipf's Law/Zipfian Distribution
- Most frequently used word occurs 2x's as much as 2nd most frequent, 3x's as much as 3rd most frequent, etc.
- Distibution of words in a corpus
Distance between 2 probability distributions
- Jensen-Shannon divergence
Difference between statistical and probabilistic inference:
- Statistical Inference: Estimate parameters/underlying distribution based on analysis of data
- Probabilistic Inference: Computing joint or marginal distributions based on a known distribution

ctufts/Stat_notes.md