our trade/craft

gaining insights from data for actionable decisions data science -> applied interdisciplinary science around using data to make decision

practicing data science is a function

input
process
output

data science workflow

plan
acquire
prepare
explore
model
present
maintain your data products

inputs:

csv
sql
raw text
images
audio or video
various types of data in various types of formats
we have 2 main kinds of data
- labeled data:
  - we have our targets
  - we have our labels (some human interaction)
    - example: hitting spam button is adding a lable to the data inside that email
      - sender, subject, recipient, attachments, language, wording
- unlabeled data:
  - raw text
  - images where a human has not added any input
sometimes our data is structured and sometimes not.
- either way, it requires work to prepare or integrate

acquire

sometimes, it's as easy as pd.read_csv()
sometimes, we've got to go collect the data ourselves
how do we sample a population to have a representative sample
this is where you need to be good w/ sql
may have to talk to different data sources
overlap w/ other software dev skills
you may get data from a data engineer
most of the time, this is on you.

prepare

80% of the time, we're preparing the data
pandas, pandas, pandas
it's pandamonium.
derived values like gross revenue if you have units and price
feature engineering

explore

statistics
probability
visualization
we explore the data to figure out what model is a good candidate
set aside 30% of our data for data for testing our model

model

run a model or two(maybe three)
measure the effectiveness
- true positives, true negatives, false positive, false negatives
- accuracy
- precision
- recall
hyperparameter tuning (we're modifying the components of the ML equation that aren't only the inputs)
- weights on weighted averages
- number of groups in k-means
- think of these like tweaking a performance automobile
- experiment with your fuel-air mixture
we need to test our model
ideally, we've got new data coming in

present

building data products to share
write a paper or whitepaper
produce a talk or a handout for talk
bokeh
tableau
handouts

Maintain

you may retrain your model
get new data
engineer your model to run on streaming inputs vs. a one time dump of data
maintain your tool

Technologies in their place of the Data Science Workflow

Reminder: Workflow is plan, acquire, prepare, explore, model, present

Plan -> people think and people discuss and collaborate, project management like trello

Acquire

Technologies: python, pandas, numpy, SQL, python libraries (web scraping, etc...)
Skills: programming, troubleshooting, keeping the big picture in mind, keeping the next step in mind
Tasks: getting raw data

Prepare

Technologies: serious python. lots and lots of pandas.
Skills: attention to detail, debug other people's data, debug our own code
Tasks: cleaning the data, this is the "data wrangling" or "data munging" part, feature engineering

Explore

Tasks: get a sense of the relationship of data's variables, visualization, statistical analysis
Skills: statistics, visualization
Technologies: matplotlib, seaborn, pandas, python, numpy, scipy, statistical packages

Model

Technologies: scipy, sklearn (ML algorithms), keras or tensorflow
Tasks: Evaluating the effacacy of a model, hyperparameter tuning
Skills: determining overfitting vs. underfitting

Present

Tasks: get your point across quickly, be prepared to support your thesis, but start w/ the bottom line (your point)
Skills: public speaking, visual design and a sense for how people process information
Technologies: Tableau, Microsoft Word to make a handout

ryanorsinger/overview.md