gaining insights from data for actionable decisions data science -> applied interdisciplinary science around using data to make decision
- input
- process
- output
- plan
- acquire
- prepare
- explore
- model
- present
- maintain your data products
-
csv
-
sql
-
raw text
-
images
-
audio or video
-
various types of data in various types of formats
-
we have 2 main kinds of data
- labeled data:
- we have our targets
- we have our labels (some human interaction)
- example: hitting spam button is adding a lable to the data inside that email
- sender, subject, recipient, attachments, language, wording
- example: hitting spam button is adding a lable to the data inside that email
- unlabeled data:
- raw text
- images where a human has not added any input
- labeled data:
-
sometimes our data is structured and sometimes not.
- either way, it requires work to prepare or integrate
- sometimes, it's as easy as pd.read_csv()
- sometimes, we've got to go collect the data ourselves
- how do we sample a population to have a representative sample
- this is where you need to be good w/ sql
- may have to talk to different data sources
- overlap w/ other software dev skills
- you may get data from a data engineer
- most of the time, this is on you.
- 80% of the time, we're preparing the data
- pandas, pandas, pandas
- it's pandamonium.
- derived values like gross revenue if you have units and price
- feature engineering
- statistics
- probability
- visualization
- we explore the data to figure out what model is a good candidate
- set aside 30% of our data for data for testing our model
- run a model or two(maybe three)
- measure the effectiveness
- true positives, true negatives, false positive, false negatives
- accuracy
- precision
- recall
- hyperparameter tuning (we're modifying the components of the ML equation that aren't only the inputs)
- weights on weighted averages
- number of groups in k-means
- think of these like tweaking a performance automobile
- experiment with your fuel-air mixture
- we need to test our model
- ideally, we've got new data coming in
- building data products to share
- write a paper or whitepaper
- produce a talk or a handout for talk
- bokeh
- tableau
- handouts
- you may retrain your model
- get new data
- engineer your model to run on streaming inputs vs. a one time dump of data
- maintain your tool
Reminder: Workflow is plan, acquire, prepare, explore, model, present
Plan -> people think and people discuss and collaborate, project management like trello
- Technologies: python, pandas, numpy, SQL, python libraries (web scraping, etc...)
- Skills: programming, troubleshooting, keeping the big picture in mind, keeping the next step in mind
- Tasks: getting raw data
- Technologies: serious python. lots and lots of pandas.
- Skills: attention to detail, debug other people's data, debug our own code
- Tasks: cleaning the data, this is the "data wrangling" or "data munging" part, feature engineering
- Tasks: get a sense of the relationship of data's variables, visualization, statistical analysis
- Skills: statistics, visualization
- Technologies: matplotlib, seaborn, pandas, python, numpy, scipy, statistical packages
- Technologies: scipy, sklearn (ML algorithms), keras or tensorflow
- Tasks: Evaluating the effacacy of a model, hyperparameter tuning
- Skills: determining overfitting vs. underfitting
- Tasks: get your point across quickly, be prepared to support your thesis, but start w/ the bottom line (your point)
- Skills: public speaking, visual design and a sense for how people process information
- Technologies: Tableau, Microsoft Word to make a handout