Chris Tufts ctufts

Test for normality:
- Shapiro-Wilk: Null Hypothesis is that the data is normally distributed. If p-value below alpha (0.05 or whatever significance you are looking for), null hypothesis is rejected (data is non-normal)
- When testing with large samples (test is biased by sample size - will be statistically significant at large sample size) accompany test with a Q-Q plot
- Anderson-Darling
Comparison on distributions (no assumption of normality)
- Kolmogorov-Smirnov test
  - Compares CDF's of two sample sets - D value close to 1 indicates distributions are different, close to 0 distributions are close to one another
- Wilcoxon’s signed-rank test
  - Compares medians from two sample sets
Mann-Whitney U Test: Similar to Wilcoxon, but samples don't have to be paired

save_as_text : don't use this unless you just want to read the text in the file. Otherwise it will cause issues if you want to go back later and revise/filter the dictionary
If you choose to import a dictionary then alter it, the corpus must also be updated as outlined here - Q8
You have to limit the number of features in large datasets otherwise the memory consumption is huge
This is regardless of weather the corpus is loaded in RAM or serialized
Iterations argument - refers to the number of iterations in the EM step

	ds %>% group_by(group1, group2) %>%
	summarise(
	summary_value = some_function
	) %>% arrange(desc(summary_value)) %>% group_by(group1) %>%
	mutate(rank=row_number())

	plot_df <-df %>% group_by(feature) %>%
	do(
	plots = ggplot(data = .) + aes(x = xcol, y = ycol) +
	geom_point() + ggtitle(.$feature)
	)

	# show plots
	plot_df$plots

	# ds has columns A, B, C, - group by A, then use B and C as inputs in the
	# MSE calculation
	grouped = ds.groupby('A')
	mse = grouped.apply( lambda x: metrics.mean_squared_error(x['B'], x['C']))

	------------------------------------------------------------------
	-- alter column name
	ALTER TABLE `xyz` CHANGE `manufacurerid` `manufacturerid` INT;

	------------------------------------------------------------------
	-- export database
	------------------------------------------------------------------
	mysqldump db table > filename.out
	------------------------------------------------------------------
	-- import database