Skip to content

Instantly share code, notes, and snippets.

@EthanRosenthal
Created April 19, 2021 16:43
Show Gist options
  • Save EthanRosenthal/f7774501c53e5a3c8a7b9bc244bddd2f to your computer and use it in GitHub Desktop.
Save EthanRosenthal/f7774501c53e5a3c8a7b9bc244bddd2f to your computer and use it in GitHub Desktop.
Notes on Software Engineering skills for Data Scientists

SWE for DS

  • Tests
    • Basic how to with pytest
      • Directory and naming structure
      • Run from CLI
      • How to run specific tests
      • Test coverage
      • Gotchas around package structure and pycache
    • Some intro test principles
      • Test one thing
      • Descriptive test names
      • Test what not how
      • Unit/integration/regression/smoke
    • Fixtures
    • Mocking
  • Packages
    • PYTHONPATH
    • Imports
    • Modules, submodules
    • Entry Points
  • Notebook -> Functions in NB -> Script -> Package
    • Copying cells? Make a function
    • Re-running notebook to just produce an output? Make a script
    • Copying notebooks to produce different reports? Make a package
    • Use cookiecutter to just quickly make a package.
  • Configuration, CLI, and global vars
    • Don’t modify the code then rerun it.
    • Switch to CLI / config early.
    • Use config to fix things in stone
    • Probably avoid global vars, except for environments
    • CLI works too. Show typer.
  • Entry Points (put this in packages?)
  • Environments
    • Make envs look as close to each other as possible.
    • Ways to do this are setting up the env at the beginning of the script running
    • Use env var as a simple way to control this.
    • Preprod for ML?
  • Logging
    • Just do it instead of print.
    • Logger needs to be instantiated at the beginning, so we need a concept of entry points.
    • How to: handler, formatter, loggers. Use __name__.
  • Freezing analyses
    • All queries have an end date.
    • Gets tricky sometimes for mutable tables.
    • Have a global var at beginning to control this.
    • Discuss cache?
  • Reproducible analyses (because they’re never over).
    • NB runs end to end.
    • Do queries in the notebook. Don’t use results from a query in a separate place.
    • If necessary, cache query results (or use bionic!).
    • Freeze analyses, have end dates.
    • Aggressive refactoring. NB->functions->package. Just fucking do it.
    • Use poetry or Docker for dependency management.
    • Use a virtual env.
    • Just fucking do it.
  • Serialization?
    • Why can’t I unpickle this thing?
  • Package versioning?
  • Train / Predict parity
    • Use the same functions for everything. Look to scikit-learn. The more sklearn-y, the better.
  • Bubbling Exceptions, Failing Gracefully
  • CI/CD
  • Automate non-logic (e.g. use black for formatting, CI/CD for running tests)
  • Use dataclasses/classes instead of dictionaries
  • When to use classes vs. functions
  • Also a quick intro to classes. They’re scary for beginners!
  • Everything is a pipeline. Layers to the pipeline ("pipeline"->"transform"->"transformations"). Look to scikit-learn for all this stuff.
  • Flask / APIs
    • How to load things into memory
    • Background tasks (e.g. to reload a model)
    • Local server -> WSGI -> nginx. Deployment in general.
    • Serializing numpy to JSON.
  • Debugging. I ought to learn!
  • I should add lots of side-comments each time a new thing that wasn’t seen before is used (e.g. list comprehensions, decorators, etc…).
  • How to read other peoples’ code. How to read code that produces data in the database. How data gets into the database- - !
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment