Brian Spiering's Data Science Principles

All the data, all the time.
Data is its own best model.
End-to-end solution, first. Then iterate.
Simple working system then add complexity (if needed).
Research best practices before inventing.
Choose for features; Stay for community.
Point estimates are always wrong.
Base rates matter. Absolute values matter.
Data Science is an applied field. Research and instructional solutions have only secondary value.
You have assumptions. It is better if they are explicit.
Random is a good baseline. Sometimes random is shockingly hard to beat.
Assume data is streaming. There will be always more data tomorrow.
Make a only single pass over the data.
Real-time means different time-scales to different people. The business proabably doesn't need, thus data science shouldn't support, "real-time" analytics.
An approximate answer right now is often better than extact answer in the distant future (or never).
Keep the data as raw as possible. Create views for specific use cases.
Avoid lossy data compression, for example frequentist linear regression.
Meta-data is as important as data.
Pareto efficiency for data science techniques:
Hash maps and friends (sets, Counters, bloom filters, …)
Bayes Theorem
A relational database
Epsilon-greedy algorithm
Use Python's built-in data types (and their methods) and functions as much as possible.
Every function should have tests.
Benchmarking and profiling trumps complexity analysis.

brianspiering/data_science_principles.md