Skip to content

Instantly share code, notes, and snippets.

@ctufts
ctufts / group_arrange_assign_ranking.R
Last active June 24, 2016 14:37
Group by , summarise, sort on summary data, append ranking from the sorting - dplyr
ds %>% group_by(group1, group2) %>%
summarise(
summary_value = some_function
) %>% arrange(desc(summary_value)) %>% group_by(group1) %>%
mutate(rank=row_number())
@ctufts
ctufts / .block
Last active October 15, 2024 06:34
Clustered Force Layout Bubble Chart
license: gpl-3.0
height: 500
border: yes
@ctufts
ctufts / python_reference.md
Created July 11, 2016 17:35
Pandas/Python functions/reference
  • df.dtypes : lists the type of each column in the dataframe (no parenthesis)
@ctufts
ctufts / group_by_and_ggplot.R
Created July 11, 2016 19:56
dplyr group_by and ggplot example
plot_df <-df %>% group_by(feature) %>%
do(
plots = ggplot(data = .) + aes(x = xcol, y = ycol) +
geom_point() + ggtitle(.$feature)
)
# show plots
plot_df$plots
@ctufts
ctufts / Stat_notes.md
Last active July 22, 2016 20:38
General notes about statistics (distributions, tests, etc.)
  • Test for normality:
    • Shapiro-Wilk: Null Hypothesis is that the data is normally distributed. If p-value below alpha (0.05 or whatever significance you are looking for), null hypothesis is rejected (data is non-normal)
    • When testing with large samples (test is biased by sample size - will be statistically significant at large sample size) accompany test with a Q-Q plot
    • Anderson-Darling
  • Comparison on distributions (no assumption of normality)
  • Mann-Whitney U Test: Similar to Wilcoxon, but samples don't have to be paired
@ctufts
ctufts / groupby_apply_multiple_inputs.py
Created July 29, 2016 18:06
group by and apply a function with multiple input arguments (PANDAS)
# ds has columns A, B, C, - group by A, then use B and C as inputs in the
# MSE calculation
grouped = ds.groupby('A')
mse = grouped.apply( lambda x: metrics.mean_squared_error(x['B'], x['C']))
@ctufts
ctufts / gensim_notes.md
Last active August 8, 2016 18:08
General notes from using gensim on 20 million messages
  • save_as_text : don't use this unless you just want to read the text in the file. Otherwise it will cause issues if you want to go back later and revise/filter the dictionary
  • If you choose to import a dictionary then alter it, the corpus must also be updated as outlined here - Q8
  • You have to limit the number of features in large datasets otherwise the memory consumption is huge
  • This is regardless of weather the corpus is loaded in RAM or serialized
  • Iterations argument - refers to the number of iterations in the EM step
@ctufts
ctufts / common_operations.sql
Last active April 26, 2017 19:23
MySQL examples/common operations
------------------------------------------------------------------
-- alter column name
ALTER TABLE `xyz` CHANGE `manufacurerid` `manufacturerid` INT;
------------------------------------------------------------------
-- export database
------------------------------------------------------------------
mysqldump db table > filename.out
------------------------------------------------------------------
-- import database
@ctufts
ctufts / .block
Last active June 28, 2017 00:58
Interactive Scatterplot with Regression Line
license: gpl-3.0
height: 500
scrolling: no
border: no
@ctufts
ctufts / .block
Last active December 13, 2016 20:24
Path Transitions - Izhikevich Neuron Model
license: gpl-3.0
height: 500
scrolling: no
border: no