We need to PEPify a static format for writing down bootstrap information in Python source trees. The initial target is a list of PEP 508 package requirement strings. It's possible that in the future we might want to add more features like a build system backend specification (as in PEPs 516, 517), or an extension namespace feature to allow third-party developer tools (flit, pytest, coverage, flake8, etc.) to consolidate their configuration in this file in a systematic way without bumping into each other.
This file will be a central part of the Python package developer user experience, and since its role is to provide bootstrap information it will be rather difficult to change our minds about its format later. (The goal is that you can change what build system you use by editing the bootstrap file... but you can't change what bootstrap file you use by editing the bootstrap file.)
There are a number of perfectly workable options. But given its central role in developer experience, and that I want the Python packager developer experience to be one of pleasure and joy (no really), it seems worth lining up the contenders so that we at least know exactly what the trade-offs are.
In this document I review the four main options that have been suggested:
- JSON
- YAML
- ConfigParser (as used in setup.cfg)
- TOML
YAML is a widely-used data structure format. I'll use it to introduce my running example.
This example is meant to give a general flavor of what our eventual bootstrap files might look like -- I'm not actually proposing anything here as an actual standard, but hopefully it's enough to get a sense of how the different formats would feel in actual usage. Each document includes a schema-version as a hedge for future extensibility, a list of PEP 508 bootstrap requirement strings, and an extension entry to see what it would look like if we allow tools like flit to add their own namespaced configuration. In YAML, it looks like this:
schema-version: 1 # optional bootstrap-requirements: # Temporarily commented out 2016-01-10 # - magic-build-helper - setuptools >= 27 - numpy >= 1.10 # for the new frobnicate feature # Pinned until we get a fix for # https://github.com/cyberdyne/the-versionator/issues/123 - the-versionator == 0.13 # The owner of pypi name "flit" decides what goes under the # extension: flit: # key extension: flit: whatever: true
Running this through PyYAML produces:
# Python { "schema-version": 1, "bootstrap-requirements": [ "setuptools >= 27", "numpy >= 1.10", "the-versionator == 0.13", ], "extension": { "flit": { "some-flag": True, }, }, }
In my experience, YAML is full of subtle and hidden gotchas -- sometimes you need quotes for mysterious reasons, small errors tend to produce YAML that's valid but meaningless, and so forth. I don't actually understand how the parsing works, and no-one I know understands how the parsing works either. The specification is 80 pages of dense text (and that's the 1.1 spec, which most implementations seem to have settled on -- the 1.2 spec is different and subtly incompatible). This is a fuzzy metric, but a real issue -- dealing with YAML doesn't help me be a better happier person. My main experience of YAML is cursing at the screen because why did it just do that wtf, and this feeling seems to be wide-spread. On the other hand, the reason it's wide-spread is that YAML itself is wide-spread, and lots of people are at least familiar with it.
JSON needs no introduction. In JSON our example looks like:
{ "schema-version": 1, "bootstrap-requirements": [ "setuptools >= 27", "numpy >= 1.10", "the-versionator == 0.13" ], "extension": { "flit": { "some-flag": true } } }
And here's the Python parse:
# Python { "schema-version": 1, "bootstrap-requirements": [ "setuptools >= 27", "numpy >= 1.10", "the-versionator == 0.13", ], "extension": { "flit": { "some-flag": True, }, }, }
The nice thing is that the Python and JSON versions are almost identical. The not-so-nice thing is that we had to strip out all the comments. Plus there are finicky annoyances like the lack of support for trailing commas, which trips up human editors and makes diffs harder to read.
ConfigParser is an INI-like format built into the stdlib. It has many configuration options that affect the file format; the configuration traditionally used by setup.cfg is `RawConfigParser
with its default settings <https://github.com/pypa/setuptools/blob/04d10ff025e1cbef7ec93a2008c930e856045c8a/setuptools/command/setopt.py#L43>`, and these defaults are listed here.
This format has the following attributes:
- The key namespace is hierarchical with exactly 2 levels: it maps
(<section>, <keyname>)
tuples to values. - All values are strings. (But multi-line strings are supported by indenting continuation lines.)
- Both
=
and:
are allowed as assignment characters. - Comments are allowed using either
#
or;
.
Example:
; (this is innocuous-looking but broken, see below) [schema] version = 1 ; optional [bootstrap] requirements = setuptools >= 27 ; Temporarily commented out 2016-01-10 ; magic-build-helper numpy >= 1.10 ; for the frobnicate feature ; Pinned until we get a fix for: ; https://github.com/cyberdyne/the-versionator/issues/123 the-versionator == 0.13 ; The owner of pypi name "flit" decides what goes under the ; extension.flit ; section [extension.flit] whatever = True
Or well... no, actually, the above file is broken on both Python 2 and
Python 3, but in different ways. On Python 2, the version
line is
parsed correctly (because ;
comments are allowed to begin in the
middle of a line -- though #
comments are not), but comments are
not recognized inside multiline values, so the requirements
entry
gets all the comments mixed in:
# Python 2 { "schema": { "version": "1", }, "bootstrap": { "requirements": "\nsetuptools >= 27\n; Temporarily commented out 2016-01-10\n; magic-build-helper\nnumpy >= 1.10 ; for the frobnicate feature\n; Pinned until we get a fix for:\n; https://github.com/cyberdyne/the-versionator/issues/123\nthe-versionator == 0.13" }, "extension.flit": { "whatever": "True", }, }
On Python 3, comments are recognized inside multi-line values, but are never allowed to begin in the middle of a line, so we instead get:
# Python 3 { "schema": { "version": "1 ; optional", }, "bootstrap": "{ "requirements": "\nsetuptools >= 27\nnumpy >= 1.10 ; for the frobnicate feature\nthe-versionator == 0.13", }, "extension.flit": { "whatever": "True", }, }
So compared to the Python 2 parse, some (but not all) of the comments
under the "requirements"
key have disappeared -- but under the
"version"
key, a new comment has snuck in.
The obvious workaround here is to teach everyone to stick to the common subset of Python 2 ConfigParser and Python 3 ConfigParser, so that comments appear only at the beginning of lines and never in the middle of multi-line values.
; ConfigParser, corrected example [schema] ; version is optional version = 1 [bootstrap] ; numpy 1.10 needed for the frobnicator feature ; the-versionator is pinned to 0.13 until we get a fix for: ; https://github.com/cyberdyne/the-versionator/issues/123 requirements = setuptools >= 27 numpy >= 1.10 the-versionator == 0.13 ; Temporarily commented out 2016-01-10 ; magic-build-tool
The trade-off is that we've had to rearrange and rewrite the comments in awkward ways, since we can no longer place the comments next to the things being commented on.
Also, as far as I can tell from testing and web searches, in Python 2 ConfigParser has no support at all for unicode:
-- test.cfg -- [metadata] author = Stéfan van der Walt >>> sys.version '2.7.11+ (default, Apr 17 2016, 14:00:29) \n[GCC 5.3.1 20160409]' >>> import ConfigParser >>> cp = ConfigParser.RawConfigParser() >>> cp.read("test.cfg") >>> cp.items("metadata") [('author', 'St\xc3\xa9fan van der Walt')]
Fortunately, this does not cause immediate problems for the bootstrap
requirements use case, because PyPI mandates that all distribution
names be ascii-only. But it does mean
that if in the future we ever want to add new build metadata that is
genuinely textual, then we'll either need to add a new file in a
better-defined format, or else define an extended file format --
something like [ConfigParser + a mandatory post-processing step of
calling .decode("utf-8")
on all values].
Potentially a setup.cfg PEP could fix up the comment handling in a similar manner, by defining and mandating a post-processing step that strips out comments from values according to some PEP-defined grammar.
OTOH, advantages of ConfigParser include that (a) it's in the stdlib, (b) setup.cfg is a thing that has some precedence.
TOML is a relatively new contender in the config format races; possibly its most prominent deployment so far is that it's been used for some years as the standard format for Rust package metadata.
TOML is basically the good parts of INI/ConfigParser (human
friendliness) crossed with the good parts of JSON (consistent and
unambiguous grammar supported across lots of languages + a simple yet
rich data model -- TOML keeps JSON's string-keyed-dicts, lists, bools,
floats, and strings; drops null
; and adds real integers and
datetimes). The specification is short
and contains many examples. The Rust Cargo docs contain many more examples of
using TOML to configure a build system.
Our running example:
schema-version = 1 bootstrap-requirements = [ "setuptools >= 27", # Temporarily comment this out 2016-01-10 # "magic-build-tool", "numpy >= 1.10", # for the new frobnicate feature # Pinned until issue #123 is fixed: "the-versionator == 0.13", # <- trailing comma ok, unlike JSON ] # The owner of pypi name "flit" decides what goes under the # extension.flit # key [extension.flit] whatever = true
(Note the Python-like list syntax and mandatory string quoting.) This parses into a Python data structure like:
{ "version": 1, "bootstrap-requirements": [ "setuptools >= 27", "numpy >= 1.10", "the-versionator == 0.13", ], "ext": { "flit": { "whatever": True, }, }, }
Unicode is fully supported -- TOML's string type is unicode, compliant TOML files are required to be encoded in UTF-8, and pytoml handles this correctly on all Python versions:
-- test.toml -- author = Stéfan van der Walt >>> sys.version '2.7.11+ (default, Apr 17 2016, 14:00:29) \n[GCC 5.3.1 20160409]' >>> import pytoml >>> pytoml.load(open("/tmp/test.toml")) {u'author': u'St\xe9fan van der Walt'}
(NB though that the toml
package doesn't seem to handle unicode
correctly on py2, so stay away from that one.)
So as far as all this goes, TOML seems like the no-brainer best option. But the potential downsides for TOML aren't about the technical features of the language -- they're about its relative immaturity compared to the other options above. So I spent a bit of time today trying to dial in exactly what its status is.
The specification: The latest version of the TOML specification is v0.4.0, released Feb. 2015. It has a scary warning at the top: "Be warned, this spec is still changing a lot. Until it's marked as 1.0, you should assume that it is unstable and act accordingly."
This doesn't seem to be a wholly accurate reflection of their actual behavior. There are implementations for many languages and a slightly out-of-date compatibility test suite. I went back and looked at what they changed from 0.3.1 to 0.4.0, and not only were the changes small, but they actually worked with the Rust developers to check that every existing Cargo.toml file remained valid both before and after the changes. One of the two main developers wrote recently that "I'd personally be against most or all breaking changes at this point---too much has become de facto stable.".
There are almost certainly some edge cases and incompatibilities remaining to be discovered and clarified in the spec and implementations; none of these seem likely to affect our core use cases of basic strings and lists and so forth, and it's much better specified than ConfigParser. Presumably any really dire issues that might affect us have already been uncovered by Rust, given their similar use case.
I think that what all this means for us is that if we were to go with TOML, we'd just specify that our bootstrap file format is TOML v0.4.0 -- which is a stable document, by definition :-) -- and then once they finally release a v1.0.0, we can look at the changes and decide whether we want to update. Most likely, it will be tiny compatibility-preserving improvements, in which case all is fine; or if not, then we (and Rust, and others) will stick with the old version, which is exactly the same situation as happened with YAML. ("YAML" to most people means "YAML 1.1"; supposedly YAML 1.2 is the latest version, but ~nobody supports it.)
TOML implementations: As mentioned above, the best TOML parser for Python currently appears to be pytoml. It's TOML v0.4.0 compliant, passes the TOML test suite (which appears to give pytoml >90% statement coverage), and the complete parser is 300 lines of code (plus another 100 lines for the TOML writing support). (Compare to PyYAML, which is >4200 lines of code.) Nominally, pyyaml only supports Python 2.7 and 3.4+, while pip also supports 2.6 and 3.3. It turns out that this is trivially fixable, though: it took me about 15 minutes to add 2.6 and 3.3 support.
This would be an extra library that the pip maintainers would have to vendor. My impression is that this is a relatively low cost endeavour compared to the other libraries that pip vendors, given that it's a small library without external dependencies, and that it performs a fixed task processing trusted input, so it's unlikely to see much churn. However, I don't know what the pip maintainers think of this.
I don't know if pytoml's maintainers have any opinion on the prospect of suddenly finding themselves upstream for pip.
Personally, I would sum up the above as:
| | YAML | JSON | CP | TOML | |-----------------------------+------+------+-----+------| | Well-defined | yes | yes | | yes | | Real data types | yes | yes | | yes | | Sensible commenting support | yes | | | yes | | Consistent unicode support | yes | yes | | yes | | Makes humans happy | | | yes | yes |
I personally started this hoping that writing all this down would reconcile me to the momentum behind setup.cfg, but unfortunately it did the opposite... Given all of the above, I tend to think the trade-offs fall in favor of TOML. I'd be willing to contact the pytoml maintainers to get their perspective, and having taken a look at the code I'd be willing to take on the responsibility of maintaining pytoml if worst came to worst and it turned out we needed to fork it (because upstream didn't want to deal with suddenly having so many users / because the TOML specification authors decide to switch to an XML-based format / because ...). I think that'd a reasonable price for making Python packaging more fun and enjoyable.
Or if we end up going with something else, then oh well, hopefully this document is still useful to make sure we know and can write down whatever trade-offs we end up making.
I like the TOML format, but I think format that has a built-in Python support (so either ConfigParser or JSON) should have been picked up. Unless there is a plan for the standard library to have TOML support?