We would like to position Conda as a language-agnostic package manager, but at present it maintains a distinct bias towards Python. Given its origins this was expected and, frankly, reasonable. Nevertheless, as we begin to use it to subsume other packaging ecosystems, such as CRAN, NPM, Ruby Gems, etc., we are going to want to overcome this history; and one key challenge is to address naming conflicts across platforms.
Our first attempt to incorporate a separate ecosystem involved R. Our
solution to the naming conflict issue was to prepend an r-
prefix
to all of the R packages. In my view this is an aesthetically displeasing solution,
likely to be made worse if we continue this practice with other ecosystems;
e.g., node-
, ruby-
, etc. We have long discussed the need to implement
namespaces to address this properly, but it has necessarily been a lower priority.
Conda Forge began to see this issue as well. They made a preliminary decision to solve this problem by appending an ecosystem prefix to every package that merits one—including Python. Unfortunately, they began appending this prefix even to packages that were already in defaults, a move certain to cause genuine confusion for users. They've pulled back from this approach for now, but ultimately it points out the fact that a namespace solution of some sort is needed.
What I'd like to do here is to outline some core principles that ought to govern the
design of the namespace solution, and some proposed solutions to address those principles.
For the sake of discussion, we are going to pretend that the R packages are not prefixed with r-
.
While much of our design here is oriented towards minimizing the need for users
to worry about namespaces, there will clearly be occasions where explicitly
specifying them is necessary. We need a syntax to do so that is easy to read
and consistent across the conda
command line, meta.yaml
files for
conda build
, and environment.yml
files for conda env
, etc.
What is more, with conda 4.1 we introduced the notion of channel prioritization, and we need to enhance this by allowing people to explicitly specify a channel for a given package. Therefore we should decide on these syntaxes together, so that they can easily work together.
Based on the discussions with Sean and Ilan, I propose we do this:
channel/
namespace:
package
namespace:
package
channel/
package
For example, consider a package name digest
. There are Python and R packages
with this package name. The Python package would be uniquely named python:digest
,
and the R package would be named r:digest
.
Whenever possible, the name of the namespace should be identical to its "anchor"
package like this. For instance, a NodeJS conda namespace should be simply node
,
and not npm
. We propose to go further and require that every namespace be
given a name identical to its anchor package. What if a namespace simply doesn't
have a logical anchor package? We suspect that this siutation is rare. And yet,
if it arises, we propose that an artificial anchor package
be created for that purpose.
A package will be considered a member of a namespace if it includes any version of that anchor package as a depednency~~—at any level of its dependency tree~~ (see EDIT 1 below).
One interesting consequence of this definition is that it
will be possible for the same package to live in more than one namespace.
For instance, Continuum's build of rpy2
includes both R and Python as
dependencies, so it will resolve as both r:rpy2
and python:rpy2
.
There will also be packages that, arguably, do not live in a larger
ecosystem. This would include things like standalone executables, C libraries,
and other packages that are readily used across multiple ecosystems. For such
packages, we propose that the namespace be empty, and that a namespace-explicit
reference to such a package would simply involve a bare colon. For instance,
:graphviz
refers to the standalone GraphViz package, whereas python:graphviz
refers to the Python package with the same name. Note that this global namespace
does not have an anchor package; it is the only exception to that rule.
OPEN ISSUE: are there instances where this dependency-based approach to namespace resolution fails? Should we offer a facility to override this determination? Are there consequences if a package maintainer accidentally drops an anchor dependency, or does so maliciously?
We seek to insure that in most cases, it will not be necessary to supply explicit namespaces every time a package is installed. For this to hold, we need to readily determine which namespaces are "active" in a given environment.
The basic rule is this: the list of active namespaces in a given environment is
is determined by the set of anchor packages installed, or requested to be installed,
in that environment. So if python
is installed, the python:
namespace is active;
if R is installed, the r:
namespace is active.
When determining this namespace list during a conda install
command, any anchor
packages included in the specs should be included. So for instance, if an environment
has Python installed and the user does conda install r
..., the namespace list should
include both Python and R for the purposes of namespace resolution.
For the purposes of this rule, we propose to treat the global namespace specially: it is never included in the namespace list _unless there are no other active namespaces. Given the principles outlined below for resolving namespaces, we suspect that this will very rarely be an issue.
OPEN QUESTION: how do we specify the set of namespaces to consider? Do we hardcode a few, like Python, R, and Lua? How do we ensure that these namespaces are properly treated if the user has a non-standard set of conda channels?
If a package has a unique name across all namespaces, it should never be necessary
to explicitly attach a namespace to the package to retrieve it. So for instance,
installing pyomo
should never require a python:
prefix, even if Python is not
yet installed.
It may seem that this should go without saying, but indeed isn't true now. For instance,
Continuum now prepends all R packages with the r-
prefix. For the purposes of this
document, such prefixing is simply a weak version of the namespacing we are proposing to
do here, and it fails this principle as a result.
Principle 5: when there is no name conflict among active namespaces, an explicit namespace should never be needed
If a user is working entirely with Python packages, that user should not be forced to specify a prefix just because the package happens to have the same name as an R package, a Perl package, or a Node package.
To illustrate this core principle, consider the package name digest
, which exists in both
the Python and R namespaces. We argue that the following behavior should occur:
conda create -n newenv python digest
, orconda install digest
applied to an environment containing Python and not R, should install onlypython:digest
.conda create -n newenv r digest
, orconda install digest
applied to an environment containing R and not Python, should install onlyr:digest
.
What do we do if both Python and R are in the environment? We propose that conda install both packages:
conda create -n newenv r python digest
, orconda install digest
applied to an environment containing both R and Python, should install bothpython:digest
andr:digest
.
Is it possible that the user did not indend this? Absolutely. But this may reveal useful information to the user; for instance, they may have been unaware that an R version existed. And of course, they can always reject the installation and re-do it with a namespace prefix if they so desire.
Here is a more complex scenario. GraphViz is a popular tool for creating
visualizations of graphs. It is a standalone tool, not a Python package. But there are a variety of
Python packages which utilize it, including a PyPi package with the name graphviz
. This unfortunate
naming choice results in confusion for Python users. For instance, conda install graphviz
gives
you the global Graphviz package, but not the Python module. On the other hand, pip install graphviz
gives you the Python module, but not the global package—which it needs to operate properly.
For the purposes of this example, let's assume there is an R package with the name graphviz
as
well, with the same concomitant ambiguity concerns. Thus we now have three packages with the
same name: python:graphviz
, r:graphviz
, and :graphviz
for the global case. In our view,
we propose the following behaviors:
conda create -n newenv python graphviz
, orconda install graphviz
applied to an environment containing Python and not R, would installpython:graphviz
. This would then install:graphviz
by dependency, so that the Python package would be fully functioning.conda create -n newenv r graphviz
, orconda install graphviz
applied to an environment containing R and not Python, would installr:graphviz
, and:graphviz
package by dependency.conda create -n newenv r python graphviz
, orconda install graphviz
applied to an environment containing both R and Python, would install bothpython:graphviz
andr:graphviz
; and again, by dependency,:graphviz
as well.conda create -n newenv graphiz
should create an environment containing nothing but:graphviz
.
Suppose for the sake of argument, however, that python:graphviz
and r:graphviz
did not depend
on the global package. For instance, perhaps both packages vendor the original Graphviz. As a result,
neither one of these packages would have a :graphviz
dependency. In this case, the only scenario
where :graphviz
would be installed is the last one, due to the principle that
the global namespace is considered active only if no others are. If the user wishes to override
this behavior, then they can use explicit namespacing; i.e., conda install graphviz :graphviz
.
We have already discussed a scenario where a package can have a dependency with the same name
from another namespace; e.g., python:graphviz
depends on :graphviz
. The corresponding
conda-build
recipe for python:graphviz
might include these run
requirements:
- run:
- python
- :graphviz
The desire to make things convenient for the conda
user should not necessarily
extend to the conda-build
user. How strict must we be when parsing the run requirements?
Ideally, we would like to define the logic here such that most packages will continue to build properly, unmodified. Obviously, when there are no naming conflicts across namespaces, this will not be an issue. If the anchor packages present in the run requirements are enough to eliminate ambiguity, that should be sufficient as well. But what we want to avoid is the selection of multiple packages due to namespace ambiguity. In effect, Principles 4 and 5 should still hold for package building, but not Principle 6.
I believe it may be sufficient to assume that all dependencies without an explicit namespace are assumed to be a member of the same namespace as the package itself.
Obviously, we can no longer rely on the simple principle that two packages with
the same name shall not be installed in the same environment. This must be modified:
two packages with the same name and namespace shall not be installed in the same
environment. Thus python:graphviz
, r:graphviz
, and :graphviz
may be
installed alongside each other.
The groups
member variable in conda.Resolve
is a dictionary with package name
for keys and package keys (channel, filename) for values. This dictionary would
now need to be modified so that its keys are (namespace, package name) pairs;
or perhaps a dictionary of dictionaries, with the outer key being the namespace,
the inner being the package name.
Originally, we considered that a package's namespace would be determined by the presence of an anchor package "at any level of its dependency tree." However, it seems clear that we should limit our view to top-level dependencies for several reasons.
- It ensures that the namespace for a package is entirely in control of the package maintainer. Otherwise, the namespace could possibly change if an anchor package were added or removed as a downstream dependency.
- The process of determining a package's namespace can now be accomplished entirely locally, without a tree search algorithm.
- We need want to support situations where a package exists in one namespace but depends incidentally on packages from another.
- Conda Build is similarly restrictive when constructing its build strings. If
you want
py27
,py35
, etc. added to the build string (orpy
in the case of a noarch package), thenpython
has to be in therun
dependencies. So package builders are likely adhering to this convention anyway.
Let us expand on point 3 here. Suppose there exists a package that behaves as a
standalone executable, and hence should live in the global namespace, but depends
on Python as its execution engine, and therefore requires Python as a dependency.
In my view, this points to the fact that Python itself must sit in the global namespace.
Packages in the python
namespace will simply include python
as a dependency,
while packages that rely on Python as a dependency but are not in the namespace
should include :python
instead.
Consider the Python packages node
. Presumably
this is going to come into conflict with the logical name of the node
namespace.
I think that for the sake of clarity, we need to at least attempt to disallow such
naming conflicts whenever possible.
However, we may be able to tolerate naming conflicts such as these if we adopt the
condition that anchor packages live in the global namespace. Thus :node
is the anchor
package for node
itself, while this package is python:node
. If a user runs
conda install node
in an existing python
environment, it will select python:node
alone; if they meant to install :node
, they will have to use the colon notation.
On the other hand, when creating a new environment, conda create -n python node
would include both of the anchor packages.
This is a corner case that we may need to study with a first implementation.
@mcg1969 is there already meta-issue out there on github I can track?