Skip to content

Instantly share code, notes, and snippets.

@ryanorsinger
Created September 27, 2019 15:18
Show Gist options
  • Save ryanorsinger/5684ba68692206d3e6769ec3d0b02626 to your computer and use it in GitHub Desktop.
Save ryanorsinger/5684ba68692206d3e6769ec3d0b02626 to your computer and use it in GitHub Desktop.
Data Science Workflow and Overview

our trade/craft

gaining insights from data for actionable decisions data science -> applied interdisciplinary science around using data to make decision

practicing data science is a function

  1. input
  2. process
  3. output

data science workflow

  • plan
  • acquire
  • prepare
  • explore
  • model
  • present
  • maintain your data products

inputs:

  • csv

  • sql

  • raw text

  • images

  • audio or video

  • various types of data in various types of formats

  • we have 2 main kinds of data

    • labeled data:
      • we have our targets
      • we have our labels (some human interaction)
        • example: hitting spam button is adding a lable to the data inside that email
          • sender, subject, recipient, attachments, language, wording
    • unlabeled data:
      • raw text
      • images where a human has not added any input
  • sometimes our data is structured and sometimes not.

    • either way, it requires work to prepare or integrate

acquire

  • sometimes, it's as easy as pd.read_csv()
  • sometimes, we've got to go collect the data ourselves
  • how do we sample a population to have a representative sample
  • this is where you need to be good w/ sql
  • may have to talk to different data sources
  • overlap w/ other software dev skills
  • you may get data from a data engineer
  • most of the time, this is on you.

prepare

  • 80% of the time, we're preparing the data
  • pandas, pandas, pandas
  • it's pandamonium.
  • derived values like gross revenue if you have units and price
  • feature engineering

explore

  • statistics
  • probability
  • visualization
  • we explore the data to figure out what model is a good candidate
  • set aside 30% of our data for data for testing our model

model

  • run a model or two(maybe three)
  • measure the effectiveness
    • true positives, true negatives, false positive, false negatives
    • accuracy
    • precision
    • recall
  • hyperparameter tuning (we're modifying the components of the ML equation that aren't only the inputs)
    • weights on weighted averages
    • number of groups in k-means
    • think of these like tweaking a performance automobile
    • experiment with your fuel-air mixture
  • we need to test our model
  • ideally, we've got new data coming in

present

  • building data products to share
  • write a paper or whitepaper
  • produce a talk or a handout for talk
  • bokeh
  • tableau
  • handouts

Maintain

  • you may retrain your model
  • get new data
  • engineer your model to run on streaming inputs vs. a one time dump of data
  • maintain your tool

Technologies in their place of the Data Science Workflow

Reminder: Workflow is plan, acquire, prepare, explore, model, present

Plan -> people think and people discuss and collaborate, project management like trello

Acquire

  • Technologies: python, pandas, numpy, SQL, python libraries (web scraping, etc...)
  • Skills: programming, troubleshooting, keeping the big picture in mind, keeping the next step in mind
  • Tasks: getting raw data

Prepare

  • Technologies: serious python. lots and lots of pandas.
  • Skills: attention to detail, debug other people's data, debug our own code
  • Tasks: cleaning the data, this is the "data wrangling" or "data munging" part, feature engineering

Explore

  • Tasks: get a sense of the relationship of data's variables, visualization, statistical analysis
  • Skills: statistics, visualization
  • Technologies: matplotlib, seaborn, pandas, python, numpy, scipy, statistical packages

Model

  • Technologies: scipy, sklearn (ML algorithms), keras or tensorflow
  • Tasks: Evaluating the effacacy of a model, hyperparameter tuning
  • Skills: determining overfitting vs. underfitting

Present

  • Tasks: get your point across quickly, be prepared to support your thesis, but start w/ the bottom line (your point)
  • Skills: public speaking, visual design and a sense for how people process information
  • Technologies: Tableau, Microsoft Word to make a handout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment