Skip to content

Instantly share code, notes, and snippets.

@mcg1969
Last active December 12, 2017 14:57
Show Gist options
  • Save mcg1969/20667002c79f702e3c7003c81b5f24c8 to your computer and use it in GitHub Desktop.
Save mcg1969/20667002c79f702e3c7003c81b5f24c8 to your computer and use it in GitHub Desktop.
Conda hackery: variants

Conda proposal: variants

Motivation

There are many situations where we are inclined to produce multiple variants of the same package, with each variant depending on a different set of low-level dependencies. For instance:

  • A numerical package might rely on the use of the Basic Linear Algebra Subprograms (BLAS). There are a variety of implementations of the BLAS we might wish to support, including including MKL, OpenBLAS, ACML, Accelerate, ATLAS.
  • We might wish to compile Python against different compilers that are not link-compatible with each other; thus all packages compiled against the CPython API must be recompiled.

The existence of these multiple variants can potentially pose a problem for users: how do they make sure that all of the packages in their environment are compatible with each other? That is: how do we ensure all packages that rely on BLAS use the same BLAS variant? How do we ensure that all packages with CPython dependency use the same ABI?

If there are only two variants, then conda's features/track_features facility provides a solution. For instance, if a user installs the nomkl metapackage, it turns on the nomkl feature, which causes all packages that link to BLAS to select an OpenBLAS variant instead of an MKL variant. Unfortunately, features (or perhaps our deployment of them) have proven to be a bit fragile, and they are necessarily limited to two variants.

To address this problem, we propose to formalize an approach for relying on conda's natural dependency resolution facilities. As you might have guessed, we are calling this approach variants.

Variant metapackages

To construct a set of variants, we begin by collecting the following information:

  • A name for the variant class; e.g., blas
  • Names for variant instance; e.g., mkl, openblas, accelerate, atlas.

These names must be compatible with Windows and Unix filename conventions, and cannot contain dash - characters (underscores are fine). Armed with this information, we proceed to build a set of packages, one for each variant instance, as follows:

  • Package name: the variant class; e.g., blas
  • Build string: the variant instance; e.g., mkl
  • Version number: 1 for the preferred instance; 0 for all others
  • Build number: 0, identical across all instances
  • Dependencies: none

The specific choices of 0 and 1 are not necessarily important for the version and build numbers. However, selecting exactly one variant instance to have version 1, and using identical values in all other cases, is important to communicating the preference information to conda. In theory, you could provide a preference hierarchy using version numbers 2, 3, etc. as well.

As a result of this build process, we will obtain a set of files with names of the form name-0-instance.tar.bz2 or name-1-instance, assuming that the standard naming convention is employed. For instance, for BLAS, we might have the following filenames:

     blas-1-mkl.tar.bz2
     blas-0-openblas.tar.bz2
     blas-0-accelerate.tar.bz
     blas-0-atlas.tar.bz2

Using the variants when package building

Once the variants have been built, we can now build packages that rely on them. To do so, we simply include the appropriate package as a dependency. For instance, the MKL version of a package might have this in their dependency list:

    depends:
       - mkl
       - blas * mkl

Note the use of the wildcard for version number. This gives you the ability to build these packages without knowing which variant is preferred. In fact, you can even change the preferences after the fact without having to rebuild these packages.

One might be tempted to simplify this process by including mkl as a dependency of blas-1-mkl, openblas as a dependency of blas-0-openblas, and so forth. In some cases, this should work just fine, but I would recommend this approach only if that dependency can be made completely version free. In other words, don't make blas-1-mkl depend on mkl 12.1.*; just make it depend on mkl. It will be very important to avoid the need to update these metapackages as the new versions of their underlying dependencies change. If a particular package does require a specific version of MKL, it can still be specified alongside the variant metapackage; e.g.,

    depends:
       - mkl >=12.1,<13
       - blas * mkl

Having said this, in some cases a variant will naturally be tied to particular versions. For instance, suppose we used a variant approach to differentiate between incompatible C++ ABIs. In this case, the individual variant instances might be drawn from a matrix of different C++ compilers and versions; e.g., cppabi-*-gcc5, cppabi-*-icc4, etc. (These are simply examples; I have no specific knowledge of C++ ABI issues.) In this case, it would be desirable for the variant metapackages to include version specifications in their dependencies.

Specifying a variant on the command line

Now that the variants have been put in place, a user can begin taking advantage of them without even knowing they are present. Suppose for instance the NumPy and SciPy have been built against multiple BLAS versions. Then performing

    conda create -n newenv python=2.7 numpy scipy

will automatically install blas-1-mkl.tar.bz, and sure that the mkl variant of both NumPy and SciPy are selected.

If the user wishes to specify a particular variant, they can do this:

    conda create -n newenv python=2.7 numpy scipy blas=*=openblas

Note the use of the wildcard to specify the version number. This will create the same environment as before, but with the openblas variant. To change variants, the user can simply install the a new variant package; for instance,

    conda install blas=*=atlas

force NumPy and SciPy to be updated to their ATLAS variants.

Challenge: conda update --all

Using the version number to specify the "preferred" or "default" variant introduces a problem with conda update --all. When this command is run, conda will select the highest version number of the variant class. It will switch the user to this preferred variant instance, whether or not they asked for it.

Unfortunately, giving all of the variant metapackages the same version number eliminates our ability to specify one as the default---and it still runs into problems with conda update --all. Under this scenario, conda will see a tie across all of the variants, and it will break that tie in an undefined manner. There will be no predicatbility on initial installs of the variant unless it is explicitly specified.

So it is clear that we will need to come up with an improvement to the Conda solver that will allow us to achieve the full behavior we seek. I propose a simple modification: when conda update --all is specified, we do not include variant metapackages in the list of packages to be updated. This will require some formal way to communicate to the solver that a package is not to be included in conda update --all.

If someone wishes to use variant packages effectively with an older version of conda, then they could pin the particular variant metapackage.

Minor challege: orphan packages

Consider again the following sequence of commands:

    conda create -n newenv python=2.7 numpy scipy blas=*=openblas
    conda install blas=*=mkl

The first command will install the OpenBlas variants of NumPy and SciPy, which will require the installation of the openblas conda package. The second command will replace NumPy and SciPy with ATLAS variants, and install the mkl conda package. The second command, however, does not remove OpenBLAS from the conda environment, even though it is not being used.

This is a natural consequence of the way conda works, and is not necessarily a problem if mkl and openblas are properly designed. In fact, we might want both packages to be installed alongside each other. For instance, there might be an applciation outside of the Python ecosystem that depends on a different BLAS variant than the one we have specified for Python.

Nevertheless, it points to a potential improvement in conda: the ability to detect these "orphan" packages and, upon request (say, with a conda clean command) remove them from an environment. This can be accomplished by examining the install and remove history for a given environment and differentiating between packages that are explicitly installed, those required because of dependencies, and orphans.

@mwiebe
Copy link

mwiebe commented Jul 17, 2016

I see what you mean now, the variant instances just affect the build string, and in my example the packages which have the orphan problem you describe would actually be the runtime libraries for given compiler configurations. E.g. multiple msvc runtime libraries could remain installed when switching the compiler toolchain, but the package specifying which one to build with is mutually exclusive because it always has the same package name. All sounds good!

@mcg1969
Copy link
Author

mcg1969 commented Jul 24, 2016

@mwiebe I think this is a good way to look at it indeed. Any packages that can be safely installed alongside each other should be allowed to do so, which requires that they have a different package name (or at least a different package-namespace combination, once we get that going). That's why, for instance, we've moved to giving the MSVC runtimes different package names. Variants are for those situations where we have a set of options and we want to ensure exactly one of them is engaged.

@msarahan
Copy link

@mcg1969 this looks very good. How should we go about proving it in an example somewhere? Should this be an integration test for conda? conda-build?

I am revisiting this discussion after talking with @csoja today about how we might support a gpu variant of tensorflow, given that we currently do not have redistribution rights to CUDA/CUDNN. I think we'll try to have the default variant be CPU only - perhaps this is more of a job for a feature, since it is a binary choice? This way, the CPU default would work out of the box, but the GPU one would require users to go off and install CUDA/CUDNN themselves before it would work.

@johanneskoester
Copy link

Very nice indeed. One could add a new key priority as an alternative to version. This way, it would (a) be obvious for the solver which package is default, and (b) the YAML of a variant metapackage would be more readable to somebody who does not know the internals. Also, you could omit the * when specifying the package as dependency or in the command line. Of course, the downside would be a bit more complexity in the conda code.

@jakirkham
Copy link

There is a typo in the orphan section. It says ATLAS, but it appears you are referring to MKL.

@jakirkham
Copy link

Would add that if preference is to be held in this variant concept, there should be 2 versions numbers. The reason being the ordering may need to be adjusted or other package metadata may need to be updated. In this case, it would be ideal to be able to change the first version number for this content so everyone gets the latest working one and preference can be handled in the second one.

As far as preference goes generally, the 0/1 situation with preference is a little to flat. That said, after a year of the blas package, I think conveying preference was a bad idea and would rather have none at all. Things like channel priority have rectified things significantly. Ultimately it is up to users to make these choices. In this case, there would only need to be one version number for the package.

Have you tested cases where a package exists for only one type of BLAS, but not others? This case has become very common as there are many packages in conda-forge built with openblas, which simply don't exist for other variants (or in other channels). I don't expect any of the ideas presented to cause problems for this use case, but it would be good to verify that is true. Happy to provide example packages if needed.

@jakirkham
Copy link

jakirkham commented Dec 12, 2017

Another point that does not appear to be fleshed out here is how packages dependent on the variant would look. Namely if there are multiple builds of them with different variants, we need to distinguish them somehow. One could encode this in the build string, but they would have to engineer all other relevant content to fit in the build string as appropriate (e.g. build number, python version, numpy version (previously), etc.). For this reason alone, it might be better to have something like features. Actually was not aware of PKG_BUILD_STRING in conda-build 3, might be better to append to this in the build string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment