-
All the data, all the time.
-
Data is its own best model.
-
End-to-end solution, first. Then iterate.
-
Simple working system then add complexity (if needed).
-
Research best practices before inventing.
-
Choose for features; Stay for community.
-
Point estimates are always wrong.
-
Base rates matter. Absolute values matter.
-
Data Science is an applied field. Research and instructional solutions have only secondary value.
-
You have assumptions. It is better if they are explicit.
-
Random is a good baseline. Sometimes random is shockingly hard to beat.
-
Assume data is streaming. There will be always more data tomorrow.
-
Make a only single pass over the data.
-
Real-time means different time-scales to different people. The business proabably doesn't need, thus data science shouldn't support, "real-time" analytics.
-
An approximate answer right now is often better than extact answer in the distant future (or never).
-
Keep the data as raw as possible. Create views for specific use cases.
-
Avoid lossy data compression, for example frequentist linear regression.
-
Meta-data is as important as data.
-
Pareto efficiency for data science techniques:
-
Hash maps and friends (sets, Counters, bloom filters, …)
-
Bayes Theorem
-
A relational database
-
Epsilon-greedy algorithm
-
Use Python's built-in data types (and their methods) and functions as much as possible.
-
Every function should have tests.
-
Benchmarking and profiling trumps complexity analysis.
Last active
June 9, 2019 22:28
-
-
Save brianspiering/1df070d85ab405b607b2222c6eca845c to your computer and use it in GitHub Desktop.
Brian Spiering's Data Science Principles
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment