kaggle collection

Competitive Machine Learning

feature engineering (most important by far)!!!!!
simple models
overfitting leaderboard
ensembling

predict the right thing!
build pipeline and put something on the leaderboard
allocate time to play with data, explore
make heavy use of forums
understand subtleties of algos, know what tool to use when

http://blog.kaggle.com/2014/08/01/learning-from-the-best/

https://www.kaggle.com/c/15-071x-the-analytics-edge-competition-spring-2015/forums/t/13492/useful-tips?limit=all

skewed data (hist -> log)
scaling (mean 0, stddev 1), centering, normalizing
factors where sensible
ideally work with model.matrix (but didn’t get it to work)
use correct outcome metric (AUC, etc.)
look at differences, patterns between train/test
feature selection (“importance”, “varImp”)

http://www.quora.com/What-do-top-Kaggle-competitors-focus-on

research comps have less competition than the ones for money
good team!

http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf

very, very good
use model.matrix
eliminate collinear predictors, scale, center
imputation of missing data
transformation (Box-Cox, PCA/ICA)
"To get honest estimates of performance, all data transformations should be included within the cross–validation loop."

https://github.com/sux13/DataScienceSpCourseNotes/raw/master/8_PREDMACHLEARN/Practical_Machine_Learning_Course_Notes.pdf

caret tuneLength
author likes repeated cv
try “C5.0” method
author prefers “kernlab” for svm
run each algo on same data, by setting seed before train
caret “resamples” for comparing models

http://www.slideshare.net/ksankar/data-wrangling-for-kaggle-data-science-competitions

viz test/train. param distribution.
bagging, boosting, stacking, ensembling
complexity of model should reflect complexity of data
test for normality visually (skew, curtosis, outliers, z-scores)
keep tab of your submission cv values, etc. tag git code for submissions. unique IDs
train a classifier and look at feature weights, importance. visualize a tree

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234?related=1

tau n-grams
compute differences/ratios of features
discard features that are “too good"

http://www.slideshare.net/ksankar/oscon-kaggle20?related=1

understand data distribution, collinearity, peculiarities cont/discrete, etc.
understand differences between training and test data`

http://www.slideshare.net/OwenZhang2/winning-data-science-competitions?related=1

think more, try less
when in doubt, use gbm
target 1000 trees, tune learning rate
don’t be afraid to use 10+ interaction depth
convert high-cardinality into numerical (out-of-fold average)
glmnet - “opposite of gbm” (needs much more work)
tau for text mining (n-grams, num chars, num words
many “text-mining” comps. are dominated by structured fields
when in doubt, use average blender

http://www.slideshare.net/SebastianRaschka/nextgen-talk-022015

don’t use validation set until the very end
convert ordinal vars to numeric
convert categorical to vector
kernel pca
look out for imbalanced training/test set

http://topepo.github.io/caret/

excellent resource all around

https://www.youtube.com/watch?v=9Zag7uhjdYo

kaggle best prac

http://ml.posthaven.com/machine-learning-done-wrong

standardize before regularize
multi-collinearity

http://blog.kaggle.com/2011/01/14/how-i-did-it-will-cukierski-on-finishing-second-in-the-ijcnn-social-network-challenge/#more-728

anecdotal

http://blog.kaggle.com/

awesome

https://rstudio-pubs-static.s3.amazonaws.com/22067_48fad02fb1a944e9a8fb1d56c55119ef.html

very good feature selection

http://machinelearningmastery.com/blog/

good

http://stat.ethz.ch/Teaching/maechler/R/useR_2014/Maechler-2014-pr.pdf

start a package via package.skeleton()
RUnit, testthat
rbenchmark, microbenchmark, pdbPROF, etc.
never use .RData, code must run in batch mode
prefer “attach()” over “load()"
floating point is not exact
use log1p rather than log(1+x) for x<<1
use x[ind, drop=FALSE] rather than x[ind,]
is.na()
always go column-by-column, not row-by-row
plot training error vs. test error

http://technocalifornia.blogspot.ie/2012/07/more-data-or-better-models.html

good overview

https://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data

how to visualize data
visualization tab in weka
cluster and look at clusters
great post by “Martin"

Other Sources:

no source:

parameter tuning (visualize parameters)
talk about issues with factor vars, as well as different factor levels between train and test
in many cases the actual run shouldn’t run too long. if it does, make sure you haven’t overcomplicated things
find best way of visualizing data for exploratory purposes
start with the simplest approach that could possibly work, and refine/iterate from there
read relevant literature, but don’t get carried away, and keep things simple
be careful about your intuitions
understand the target function and eval function (AUC, etc.)
choose the right tool for the right job (postgres, excel, R, etc.)
doing cross validation on test set is a serious methodological error (sic!)

Postgres

consider using postgres for data analysis
easy import from csv
very powerful queries (arithmetic, select, complex queries, etc.)
http://www.postgresql.org/docs/9.2/static/functions-aggregate.html
http://www.postgresql.org/docs/9.4/static/functions-math.html
http://madlib.net/product/ (machine learning in sql)
http://postgis.net/
dplyr (r integration with sql databases)
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html (alternative to sql queries?)
http://zevross.com/blog/2014/03/26/four-reasons-why-you-should-check-out-the-r-package-dplyr-3/

cadrev/gist:10506216e36777d3bfdf

no source:

Postgres