Notes on Software Engineering skills for Data Scientists

SWE for DS

Tests
- Basic how to with pytest
  - Directory and naming structure
  - Run from CLI
  - How to run specific tests
  - Test coverage
  - Gotchas around package structure and pycache
- Some intro test principles
  - Test one thing
  - Descriptive test names
  - Test what not how
  - Unit/integration/regression/smoke
- Fixtures
- Mocking
Packages
- PYTHONPATH
- Imports
- Modules, submodules
- Entry Points
Notebook -> Functions in NB -> Script -> Package
- Copying cells? Make a function
- Re-running notebook to just produce an output? Make a script
- Copying notebooks to produce different reports? Make a package
- Use cookiecutter to just quickly make a package.
Configuration, CLI, and global vars
- Don’t modify the code then rerun it.
- Switch to CLI / config early.
- Use config to fix things in stone
- Probably avoid global vars, except for environments
- CLI works too. Show typer.
Entry Points (put this in packages?)
Environments
- Make envs look as close to each other as possible.
- Ways to do this are setting up the env at the beginning of the script running
- Use env var as a simple way to control this.
- Preprod for ML?
Logging
- Just do it instead of print.
- Logger needs to be instantiated at the beginning, so we need a concept of entry points.
- How to: handler, formatter, loggers. Use __name__.
Freezing analyses
- All queries have an end date.
- Gets tricky sometimes for mutable tables.
- Have a global var at beginning to control this.
- Discuss cache?
Reproducible analyses (because they’re never over).
- NB runs end to end.
- Do queries in the notebook. Don’t use results from a query in a separate place.
- If necessary, cache query results (or use bionic!).
- Freeze analyses, have end dates.
- Aggressive refactoring. NB->functions->package. Just fucking do it.
- Use poetry or Docker for dependency management.
- Use a virtual env.
- Just fucking do it.
Serialization?
- Why can’t I unpickle this thing?
Package versioning?
Train / Predict parity
- Use the same functions for everything. Look to scikit-learn. The more sklearn-y, the better.
Bubbling Exceptions, Failing Gracefully
CI/CD
Automate non-logic (e.g. use black for formatting, CI/CD for running tests)
Use dataclasses/classes instead of dictionaries
When to use classes vs. functions
Also a quick intro to classes. They’re scary for beginners!
Everything is a pipeline. Layers to the pipeline ("pipeline"->"transform"->"transformations"). Look to scikit-learn for all this stuff.
Flask / APIs
- How to load things into memory
- Background tasks (e.g. to reload a model)
- Local server -> WSGI -> nginx. Deployment in general.
- Serializing numpy to JSON.
Debugging. I ought to learn!
I should add lots of side-comments each time a new thing that wasn’t seen before is used (e.g. list comprehensions, decorators, etc…).
How to read other peoples’ code. How to read code that produces data in the database. How data gets into the database- - !

EthanRosenthal/swe4ds.md