Skip to content

Instantly share code, notes, and snippets.

@matthewfeickert
Last active July 15, 2020 03:08
Show Gist options
  • Save matthewfeickert/4337adc85bcadf52567757e44a7a3671 to your computer and use it in GitHub Desktop.
Save matthewfeickert/4337adc85bcadf52567757e44a7a3671 to your computer and use it in GitHub Desktop.
Python dependency thoughts

This got long, so TLDR: Micky's reactions are what I would call "correct". If people want reproducible environments that's great, but pip freeze is not going to get you there.

The answer is (like most things in life) it depends. What are you trying to do? I tend to split along this line into two idea groups: "libraries" and "applications".

At this point I'd suggest that you read the following to make sure that we're on the same page about these words.

Hopefully you read them, but the TLDR on those is:

  • Libraries are supposed to work in an ecosystem of things and so should strive to only require the minimum restrictions on dependencies that will still gaurentee a stable API. Here you probably just want to have the core dependencies be specified with minimum version numbers or to ensure stability within a major release (aka, TensorFlow v1 API vs v2 API) use the compatible release syntax of PEP 440 (https://www.python.org/dev/peps/pep-0440/#compatible-release) --- tensorflow ~= 2.1 is equivalent to >= 2.1, == 2.* which means that you'll always be within the major release 2.X but that you just need to guarantee a version that is at least v2.1+.

  • Applications are things that you build with libraries and you care about deploying into a specific environment. With applications you want to provide a very rigid dependencies that are frozen (hint hint lock files are coming). Applications live "in production". However, that's a rigid sense of the word, and maybe your "application" might be the summary plots that you're generating for your analysis. You want a specific set of requirements, but you also don't care too much if those requirements get updated as long as the API and behaviour is basically stable.

Okay, so now to actually get to the question: "Should the whole of pip freeze from a virtual env be dumped in there". Almost never is this what you really want.

If you just want to make your plots and want to make sure the API doesn't break:

  • First ask if you have any heavy external compiled dependencies. If the answer is no, don't use Conda unless you really like wasting hours of your life having a slow solver get you non-reproducible builds. If you want optimization then maybe you do want Conda though, as those MKL libraries are :chef kiss:
  • If you do use Conda, just replace the below with environment.yml, though note that Conda's environment syntax is not as mature as pip's and so you'll be restricted (again, sometimes you really want Conda and that's fine)
  • Setup a clean virtual environment
  • Create a blank requirements.txt and then using '~=' and '==' syntax lay out all the core dependencies that you're going to use (what do you actually import). Then once you've done that just 'pip install -r requirements.txt' and let those core constraints that matter sort out the rest of all the dependencies.
  • Yay!

If you really have an application that needs to be deployed into an environment and be the same every time then what you really want are lock files. Lock files ensure that you have the exact same environment by storing not only the versions of your dependencies, but the hashes (aka, digests, SHAs) of them, and they additionally do this in a clearly ordered manner by figuring out what the actual requirements you have are and then what the resulting dependency graph would be for those requirements and generating a hash map for it. This is a lock file and they are awesome. Many other languages have these as a default, but Python has a weird history.

You can actually generate lock files in a requirements.txt with pip-tools (https://twitter.com/di_codes/status/1252233796109381632?s=20) but it is much easier to use a tool that explicitly deals with these things. The two most robust and well known are Poetry (https://python-poetry.org/) and pipenv (https://github.com/pypa/pipenv). I've gone on long enough with my self indulgent nerd rant, so I'll just say that these both work, but I have a strong bias towards Poetry and I'm happy to explain why if anyone cares.

So if you are making a library, have minimal dependencies. If you're making a "for you" application, use requirements.txt and use compatible release syntax on your core dependencies. If you're making a "deploying to production" application use the right tool for the job and use lock files to get fully reproducible dependency graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment