Conda hackery: namespaces

Motivation

We would like to position Conda as a language-agnostic package manager, but at present it maintains a distinct bias towards Python. This is both inevitable and, frankly, reasonable. Nevertheless, as we begin to use it to subsume other packaging ecosystems, such as CRAN, NPM, Ruby Gems, etc., we are often going to run into naming conflicts.

Our first attempt to incorporate a separate ecosystem has involved R. Our solution to the naming conflict issue has been to prepend an r- prefix to all of the R packages. In my view this is an aesthetically displeasing solution, likely to be made worse if we continue this practice with other ecosystems; e.g., node-, ruby-, etc. We have long discussed the need to implement namespaces to address this properly, but it has necessarily been a lower priority.

Conda Forge began to see this issue as well. They made a preliminary decision to solve this problem by appending an ecosystem prefix to every package that merits one---including Python. Unfortunately, they began appending this prefix even to packages that were already in defaults, a move certain to cause genuine confusion for users. They've pulled back from this approach for now, but ultimately it points out the fact that a namespace solution of some sort is needed.

Core principles

What I'd like to do here is to outline some core principles that ought to govern the design of the namespace solution, and some proposed solutions to address those principles. For the sake of discussion, we are going to pretend that the R packages are not prefixed with r-.

Principle 1: clear syntax for explicit namespace specification

While much of our design here is oriented towards minimizing the need for users to worry about namespaces, there will clearly be occasions where explicitly specifying them is necessary. We need a syntax to do so that is easy to read and consistent across the conda command line, meta.yaml files for conda build, and environment.yml files for conda env, etc.

What is more, with conda 4.1 we introduced the notion of channel prioritization, and we need to enhance this by allowing people to explicitly specify a channel for a given package. Therefore we should decide on these syntaxes together, so that they can easily work together.

Based on the discussions with Sean and Ilan, I propose we do this:

channel/namespace:package

The channel and namespace entries are optional, so we could have these combinations as well:

namespace:package channel/package

*For the purposes of this document, we are going to utilize this syntax, but it is not necessarily final.*

Principle 2: namespace names should be obvious

For example, consider a package name digest. There are Python and R packages with this package name. The Python package would be uniquely named python:digest, and the R package would be named r:digest.

Whenever possible, the name of the namespace should be identical to its "anchor" package like this. For instance, a NodeJS conda namespace should be simply node, and not npm. We propose to go further and require that every namespace be given a name identical to its anchor package. This makes it quite simple to identify which namespace most packages live in, by examining their dependency tree for anchor packages. (There are some wrinkles here that you may already be thinking of; more on this later.)

What if a namespace simply doesn't have a logical anchor package? We suspect that this siutation is rare. And yet, if it arises, we propose that when it occurs, that an artificial anchor package is created for that purpose, and that all packages in that namespace have this anchor package as a dependency (if not directly, as a leaf in in its dependency tree).

There will also be packages that, arguably, do not live in a larger ecosystem. This would include things like standalone executables, C libraries, and other packages that are readily used across multiple ecosystems. For such packages, we propose that the namespace be empty, and that a namespace-explicit reference to such a package would simply involve a bare colon. For instance, :graphviz refers to the standalone GraphViz package, whereas python:graphviz refers to the Python package with the same name. Note that this global namespace does not have an anchor package; it is the only exception to that rule.

Principle 3: determining the active namespaces in an environment should be straightforward

We seek to insure that in most cases, it will not be necessary to supply explicit namespaces every time a package is installed. For this to hold, we need to readily determine which namespaces are "active" in a given environment.

The basic rule is this: the list of active namespaces in a given environment is is determined by the set of anchor packages installed, or requested to be installed, in that environment. So if python is installed, the python: namespace is active; if R is installed, the r: namespace is active.

When determining this namespace list during a conda install command, any anchor packages included in the specs should be included. So for instance, if an environment has Python installed and the user does conda install r, the namespace list should include both Python and R for the purposes of namespace resolution.

For the purposes of this rule, we propose to treat the global namespace specially: it is never included in the namespace list _unless there are no other active namespaces. Given the principles outlined below for resolving namespaces, we suspect that this will very rarely be an issue.

OPEN QUESTION: how do we specify the set of namespaces to consider? Do we hardcode a few, like Python, R, and Lua? How do we ensure that these namespaces are properly treated if the user has a non-standard set of conda channels?

Principle 4: when there is no name conflict, an explicit namespace should never be needed

If a package has a unique name across all namespaces, it should never be necessary to explicitly attach a namespace to the package to retrieve it. So for instance, installing pyomo should never require a python: prefix, even if Python is not yet installed.

It may seem that this should go without saying, but indeed isn't true now. For instance, Continuum now prepends all R packages with the r- prefix. For the purposes of this document, such prefixing is simply a weak version of the namespacing we are proposing to do here, and it fails this principle as a result.

Principle 5: when there is no name conflict among active namespaces, an explicit namespace should never be needed

If a user is working entirely with Python packages, that user should not be forced to specify a prefix just because the package happens to have the same name as an R package, a Perl package, or a Node package.

To illustrate this core principle, consider the package name digest, which exists in both the Python and R namespaces. We argue that the following behavior should occur:

conda create -n newenv python digest, or conda install digest applied to an environment containing Python and not R, should install only python:digest.
conda create -n newenv r digest, or conda install digest applied to an environment containing R and not Python, should install only r:digest.

Principle 6: when ambiguity cannot be resolved, favor convenience

What do we do if both Python and R are in the environment? We propose that conda install both packages:

conda create -n newenv r python digest, or conda install digest applied to an environment containing both R and Python, should install both python:digest and r:digest.

Is it possible that the user did not indend this? Absolutely. But this may reveal useful information to the user; for instance, they may have been unaware that an R version existed. And of course, they can always reject the installation and re-do it with a namespace prefix if they so desire.

Here is a more complex scenario. GraphViz is a popular tool for creating visualizations of graphs. It is a standalone tool, not a Python package. But there are a variety of Python packages which utilize it, including a PyPi package with the name graphviz. This unfortunate naming choice results in confusion for Python users. For instance, conda install graphviz gives you the global Graphviz package, but not the Python module. On the other hand, pip install graphviz gives you the Python module, but not the global package---which it needs to operate properly.

For the purposes of this example, let's assume there is an R package with the name graphviz as well, with the same concomitant ambiguity concerns. Thus we now have three packages with the same name: python:graphviz, r:graphviz, and :graphviz for the global case. In our view, we propose the following behaviors:

conda create -n newenv python graphviz, or conda install graphviz applied to an environment containing Python and not R, would install python:graphviz. This would then install :graphviz by dependency, so that the Python package would be fully functioning.
conda create -n newenv r graphviz, or conda install graphviz applied to an environment containing R and not Python, would install r:graphviz, and :graphviz package by dependency.
conda create -n newenv r python graphviz, or conda install graphviz applied to an environment containing both R and Python, would install both python:graphviz and r:graphviz; and again, by dependency, :graphviz as well.
conda create -n newenv graphiz should create an environment containing nothing but :graphviz.

Suppose for the sake of argument, however, that python:graphviz and r:graphviz did not depend on the global package. For instance, perhaps both packages vendor the original Graphviz. As a result, neither one of these packages would have a :graphviz dependency. In this case, the only instance where :graphvizwould be installed is the last one, due to the principle that the global namespace is considered active _only_ if no others are. If the user wishes to override this behavior, then they can use explicit namespacing; i.e.,conda install graphviz :graphviz`.

jakirkham/spaces.md