- feature engineering (most important by far)!!!!!
- simple models
- overfitting leaderboard
- ensembling
- predict the right thing!
- build pipeline and put something on the leaderboard
- allocate time to play with data, explore
- make heavy use of forums
- understand subtleties of algos, know what tool to use when
https://www.kaggle.com/c/15-071x-the-analytics-edge-competition-spring-2015/forums/t/13492/useful-tips?limit=all
- skewed data (hist -> log)
- scaling (mean 0, stddev 1), centering, normalizing
- factors where sensible
- ideally work with model.matrix (but didn’t get it to work)
- use correct outcome metric (AUC, etc.)
- look at differences, patterns between train/test
- feature selection (“importance”, “varImp”)
- research comps have less competition than the ones for money
- good team!
- very, very good
- use model.matrix
- eliminate collinear predictors, scale, center
- imputation of missing data
- transformation (Box-Cox, PCA/ICA)
- "To get honest estimates of performance, all data transformations should be included within the cross–validation loop."
https://github.com/sux13/DataScienceSpCourseNotes/raw/master/8_PREDMACHLEARN/Practical_Machine_Learning_Course_Notes.pdf
- caret tuneLength
- author likes repeated cv
- try “C5.0” method
- author prefers “kernlab” for svm
- run each algo on same data, by setting seed before train
- caret “resamples” for comparing models
- viz test/train. param distribution.
- bagging, boosting, stacking, ensembling
- complexity of model should reflect complexity of data
- test for normality visually (skew, curtosis, outliers, z-scores)
- keep tab of your submission cv values, etc. tag git code for submissions. unique IDs
- train a classifier and look at feature weights, importance. visualize a tree
- tau n-grams
- compute differences/ratios of features
- discard features that are “too good"
- understand data distribution, collinearity, peculiarities cont/discrete, etc.
- understand differences between training and test data`
- think more, try less
- when in doubt, use gbm
- target 1000 trees, tune learning rate
- don’t be afraid to use 10+ interaction depth
- convert high-cardinality into numerical (out-of-fold average)
- glmnet - “opposite of gbm” (needs much more work)
- tau for text mining (n-grams, num chars, num words
- many “text-mining” comps. are dominated by structured fields
- when in doubt, use average blender
- don’t use validation set until the very end
- convert ordinal vars to numeric
- convert categorical to vector
- kernel pca
- look out for imbalanced training/test set
- excellent resource all around
- kaggle best prac
- standardize before regularize
- multi-collinearity
http://blog.kaggle.com/2011/01/14/how-i-did-it-will-cukierski-on-finishing-second-in-the-ijcnn-social-network-challenge/#more-728
- anecdotal
- awesome
- very good feature selection
- good
- start a package via package.skeleton()
- RUnit, testthat
- rbenchmark, microbenchmark, pdbPROF, etc.
- never use .RData, code must run in batch mode
- prefer “attach()” over “load()"
- floating point is not exact
- use log1p rather than log(1+x) for x<<1
- use x[ind, drop=FALSE] rather than x[ind,]
- is.na()
- always go column-by-column, not row-by-row
- plot training error vs. test error
- good overview
https://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
- how to visualize data
- visualization tab in weka
- cluster and look at clusters
- great post by “Martin"
Other Sources:
- https://medium.com/@nomadic_mind/new-to-machine-learning-avoid-these-three-mistakes-73258b3848a4 - pretty good
- https://www.kaggle.com/c/pakdd-cup-2014/forums/t/7573/what-did-you-do-to-get-to-the-top-of-the-board
- http://danielnee.com/archives/155
- http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
- https://medium.com/cs-math/why-becoming-a-data-scientist-is-not-actually-easier-than-you-think-5b65b548069b
- http://blog.kaggle.com/2012/07/06/the-dangers-of-overfitting-psychopathy-post-mortem/
- http://www.autonlab.org/tutorials/overfit10.pdf
- http://machinelearningmastery.com/hands-on-big-data-by-peter-norvig/
- parameter tuning (visualize parameters)
- talk about issues with factor vars, as well as different factor levels between train and test
- in many cases the actual run shouldn’t run too long. if it does, make sure you haven’t overcomplicated things
- find best way of visualizing data for exploratory purposes
- start with the simplest approach that could possibly work, and refine/iterate from there
- read relevant literature, but don’t get carried away, and keep things simple
- be careful about your intuitions
- understand the target function and eval function (AUC, etc.)
- choose the right tool for the right job (postgres, excel, R, etc.)
- doing cross validation on test set is a serious methodological error (sic!)
- consider using postgres for data analysis
- easy import from csv
- very powerful queries (arithmetic, select, complex queries, etc.)
- http://www.postgresql.org/docs/9.2/static/functions-aggregate.html
- http://www.postgresql.org/docs/9.4/static/functions-math.html
- http://madlib.net/product/ (machine learning in sql)
- http://postgis.net/
- dplyr (r integration with sql databases)
- http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html (alternative to sql queries?)
- http://zevross.com/blog/2014/03/26/four-reasons-why-you-should-check-out-the-r-package-dplyr-3/