Skip to content

Instantly share code, notes, and snippets.

@krmaxwell
Forked from hadley/ds-training.md
Last active August 29, 2015 14:17
Show Gist options
  • Save krmaxwell/bb7d9f2acd6af5a7fc0c to your computer and use it in GitHub Desktop.
Save krmaxwell/bb7d9f2acd6af5a7fc0c to your computer and use it in GitHub Desktop.

If you were to give recommendations to your "little brother/sister" on things that they need to do to become a data scientist, what would those things be?

I think the "Data Science Venn Diagram" (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) is a great place to start. You need three things to be a good data scientist:

  • Statistical knowledge
  • Programming/hacking skills
  • Domain expertise

Statistical knowledge

You need to be able to think "statistically": you need to be able to turn sample data into inferences about the underlying population. I'm not sure how you develop statistical thinking - I did it through a masters and then PhD in statistics, but that's obviously a big time investment!

I think you need some knowledge of specific statistical/machine learning techniques, but a deep theoretical understanding is not that important. You need to understand the strengths and weaknesses of each technique, but you don't need a deep theoretical understanding. The vast majority of data science problems can be solved by a creative assembly of off-the-shelf techniques, and don't require new theory.

I'd recommend developing a familiarity with linear models and their variations (esp. generalised linear models, splines and the lasso). Yes, they are linear, but a linear approximation is a good place to start for many problems. For problems that focus more on prediction than understanding, make sure you're familiar with the most popular ML techniques, e.g. random forests and support vector machines.

Programming skills

You need to be fluent with either R or python. There are other options, but none of them have the community that R and python have, which means you'll need to spend a lot of time reinventing tools that already exist elsewhere. Obviously, I prefer R, and unlike what some people claim it is a well founded programming language that is well tailored for its domain.

If you use R you want to be conversant with a set of packages that allows you to solve the following practical problems:

  • Ingest: how do you get your data into R?
  • Manipulation: how do you filter, summarise, mutate et?
  • Visualisation: how do you explore your data visually?
  • Modelling: once you have a precise question, how do you answer it with a model?
  • Reporting: once you've figure out the solution, how do you communicate it to others?

My recommendations for starting places are:

  • Ingest: readr (flat files), DBI (databases), tidyr (data tidying)
  • Manipulation: dplyr
  • Visualisation: ggplot2 (and ggvis in a year or two)
  • Modelling: caret
  • Reporting: Rmarkdown and shiny

You should also invest some time in learning how to be a productive R programmer (e.g. http://adv-r.had.co.nz) and learning how to write packages (http://r-pkgs.had.co.nz). Start by learning the basics of functional programming - this will have the biggest payoff for your productivity in R.

Domain knowledge

This obviously depends on the domain, but as a data scientist should be able to contribute meaningfully to any project, even if you're not intimately familiar with the specifics. I think this means you should be generally well read (e.g. at the level of New Scientist for the sciences) and an able communicator. A good data scientist will help the real domain experts refine and frame their questions in a helpful way. Unfortunately I don't know of any good resources for learning how to ask questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment