SWE for DS
- Tests
- Basic how to with pytest
- Directory and naming structure
- Run from CLI
- How to run specific tests
- Test coverage
- Gotchas around package structure and pycache
- Some intro test principles
- Test one thing
- Descriptive test names
- Test what not how
- Unit/integration/regression/smoke
- Fixtures
- Mocking
- Basic how to with pytest
- Packages
- PYTHONPATH
- Imports
- Modules, submodules
- Entry Points
- Notebook -> Functions in NB -> Script -> Package
- Copying cells? Make a function
- Re-running notebook to just produce an output? Make a script
- Copying notebooks to produce different reports? Make a package
- Use cookiecutter to just quickly make a package.
- Configuration, CLI, and global vars
- Don’t modify the code then rerun it.
- Switch to CLI / config early.
- Use config to fix things in stone
- Probably avoid global vars, except for environments
- CLI works too. Show typer.
- Entry Points (put this in packages?)
- Environments
- Make envs look as close to each other as possible.
- Ways to do this are setting up the env at the beginning of the script running
- Use env var as a simple way to control this.
- Preprod for ML?
- Logging
- Just do it instead of print.
- Logger needs to be instantiated at the beginning, so we need a concept of entry points.
- How to: handler, formatter, loggers. Use
__name__
.
- Freezing analyses
- All queries have an end date.
- Gets tricky sometimes for mutable tables.
- Have a global var at beginning to control this.
- Discuss cache?
- Reproducible analyses (because they’re never over).
- NB runs end to end.
- Do queries in the notebook. Don’t use results from a query in a separate place.
- If necessary, cache query results (or use bionic!).
- Freeze analyses, have end dates.
- Aggressive refactoring. NB->functions->package. Just fucking do it.
- Use poetry or Docker for dependency management.
- Use a virtual env.
- Just fucking do it.
- Serialization?
- Why can’t I unpickle this thing?
- Package versioning?
- Train / Predict parity
- Use the same functions for everything. Look to scikit-learn. The more sklearn-y, the better.
- Bubbling Exceptions, Failing Gracefully
- CI/CD
- Automate non-logic (e.g. use black for formatting, CI/CD for running tests)
- Use dataclasses/classes instead of dictionaries
- When to use classes vs. functions
- Also a quick intro to classes. They’re scary for beginners!
- Everything is a pipeline. Layers to the pipeline ("pipeline"->"transform"->"transformations"). Look to scikit-learn for all this stuff.
- Flask / APIs
- How to load things into memory
- Background tasks (e.g. to reload a model)
- Local server -> WSGI -> nginx. Deployment in general.
- Serializing numpy to JSON.
- Debugging. I ought to learn!
- I should add lots of side-comments each time a new thing that wasn’t seen before is used (e.g. list comprehensions, decorators, etc…).
- How to read other peoples’ code. How to read code that produces data in the database. How data gets into the database- - !