Summary of thoughts on extensions to, implementations of, ideas about, and protocols surrounding Racket package catalogs

Exposition

A package catalog is something that maps package names to package metadata which must include a package source. This decouples the name of a package from where it's source is located, allowing package authors to declare dependencies on names instead of sources and to relocate the source of their package without breaking the dependency specifications of client packages.

Package metadata must include a checksum, but other than that no keys are required for a properly operating package catalog. However many tools (particularly raco pkg) use certain metadata fields to their advantage - for instance a 'dependencies list for installing package dependencies before retrieving the package itself.

A package server is an HTTP/S webserver that provides a package catalog through specific routes, and additionally provides ways to create new packages and update existing packages. There are other ways to host package catalogs (e.g. through directories and databases), but they will not be covered in this document.

Server Routes

A package server must minimally provide a HTTP/S REST API with the following routes:

GET /pkgs - Returns a list of package names.
GET /pkg/:name - Returns the package metadata for the package with the given name if it exists, returns a 404 response otherwise.
GET /pkgs-all - Returns a mapping of package names to package metadata for all packages in the catalog.
PUT /pkg/:name - Either creates a new package or modifies an existing one.
DELETE /pkg/:name - Removes a package from the catalog.

Content Negotiation

Responses should take into account the HTTP Accept header for determining what format of data to respond with. The following types are good candidates to support:

application/json - JavaScript Object Notation.
application/racket-data - Text that is read-able by Racket (not sure what name to use for this)
text/html - An HTML web page for a browser to render
application/xml - XML data

The default content type for responses should be application/racket-data, as raco pkg install does not use the Content-Type header.

Access Control

Package servers wishing to provide access control for changing and deleting a package should do so by verifying an Authentication header and comparing its results against the package's authors. Some package servers may include a notion of package curators, special users of the catalog who have access to all packages and act as maintainers and administrators of the catalog. Such servers should provide the following routes:

GET /curators - Returns a list of user IDs representing the catalog's curators.
POST /curators - Adds a new user to the list of curators. Only a curator should be able to do this.
DELETE /curators/:id - Removes a curator from the list of curators. Only a curator should be able to do this.

Dependencies

Package servers that provide the 'dependencies key in package metadata should get this information from the deps specification in the info.rkt file at the package's source. This information may be inaccurate as the package author may not properly state their dependencies. Various tools may analyze a package's source to determine dependency problems. Two categories of dependency problems are:

Known undeclared dependencies, when the package analyzer can prove the package has a dependency on package A but does not declare it in it's info.rkt file. This is very common.
Known unused dependencies, when the package analyzer can prove the package has no dependency on package A but declares it in it's info.rkt file. This is relatively uncommon.

A 'known-undeclared-dependencies key in package metadata may contain 1, and a 'known-unused-dependencies key may contain 2.

Rings

Package servers may divide packages into rings. A ring is a natural number paired with a subset of packages in the catalog. Every package in a package server with rings must be in exactly one ring. A package server with rings must number the rings starting at zero and increasing by one.

The motivation for rings is to divide the packages based on how reliable, official, stable, etc. they are. The lower the ring, the more reliable, official, stable, etc. it is. In the official Racket package server, ring 0 is reserved for packages in the main distribution and a handful of other packages created by Racket maintainers.

Rings should state whether they are safe to install - that is, whether the ring is for packages whose installation through raco pkg install should not fail. If a ring n is safe to install, then the ring n-1 should also be safe to install.

Safe rings should be completely installable and not depend on the existence of higher rings. That is, if ring 1 is safe, then it should be possible to install all packages in ring 1 safely and it should be possible to do so regardless of whether or not any packages in ring 2 are available, or whether there's a ring 2 at all.

As a result, packages in safe rings should never depend on packages in higher rings. If rings 0 and 1 are safe, then ring 0 packages should only depend on ring 0 packages, ring 1 packages should only depend on ring 0 or ring 1 packages, and ring 2 packages may depend on anything (including ring 42 packages) since the ring system makes no promises that installing them should succeed.

The advantage to a ring system is tools may use the information to decide whether or not to install certain packages automatically. For instance, a service may construct a Docker image with everything in the package server up to a certain ring installed on it and update that image on a schedule, so that apps may depend on that image and not need to install dependencies when building (such an image also makes for a very useful rapid prototyping playground, for instance as a backing service to a web REPL). As another example, DrRacket may be configured to automatically install packages in certain rings when it detects you're using them but haven't installed them. Tools can detect whether modules code is failing to require are provided by packages by using the 'modules key in package metadata.

A package server with rings should provide the following routes:

GET /rings - Number of rings in the package.
GET /rings/max-safe - Redirect to /rings/n where n is the maximum safe ring.
GET /rings/:ring - Provide metadata about a given ring, for instance its description and whether it's safe to install packages in this ring.

Additionally, the package metadata for each package should state which ring it's in. If the package server has curators, then only curators may decide what ring a package is in and when authors upload packages they're placed in the highest ring of the catalog.

An example/proposed ring setup for the official package catalog is as follows:

Ring 0 - Main distribution. Only those libraries and packages provided by installing the full Racket language should go here.
Ring 1 - Curated list of user packages. Hand picked by Racket mainteners based on whether the package has documentation, tests, respects backwards compatibility, is well structured with regards to standard Racket idioms, and whose authors are responsive to user feedback. Exceptions to any of the above may be allowed as decided by the curators.
Ring 2 - Packages that install correctly with no issues, document all exports, have tests, and have no conflicts with ring 0-2 packages.
Ring 3 - Packages that install correctly and have no conflicts. Documentation and tests optional.
Ring 4 - Packages that succeed to install without conflicts, but have issues such as missing dependency declarations on packages in rings 0-3.
Ring 5 - Packages that fail to install or have conflicts.
Ring 6 - New packages that are unknown.

Given this setup, DrRacket could be configured by default to auto-install packages in rings 0 and 1 automatically. This allows a user to develop with only the minimal racket installation on their machine and lazily install packages as they require modules the packages provide, but while still knowing everything on their machine has been inspected by the core language maintainers. A more daring user could configure DrRacket to auto-install up to ring 3, auto installing any package whose installation doesn't crash.

Additionally, this helps promote good practices and well designed packages that make their way to ring 1. This helps racket developers learn new techniques from the codebases of these packages, as well as exposing particularly useful packages to developers with a higher degree of publicity and discoverability.

A package server may also choose to pre-optimize package installations up to a certain ring, by building bytecode and/or binaries ahead of time.

Versioning

A package developer should be able to define specific versions of their package and communicate this to a package server. A package client should be able to request a specific defined version of a package. This could be handled by instead having a list of package sources for a name, rather than a single source. The first source is treated as HEAD, it is the constantly-updating live version of the package. All other sources are static and map to a specific, defined version. For instance for a package foo on github a developer joe could specify three sources, git://github.com/joe/foo (a git repo), git://github.com/joe/foo#v1.0 (a branch named v1.0 in the repo), and git://github.com/joe/foo#v2.0 (a tagged commit with tag v2.0). The package server would interpret this as follows:

Look up the info.rkt file of each of the non-HEAD package sources and record the versions declared.
Compute some sort of checksum on the sources of the non-HEAD package sources and record it
Require that the non-HEAD package sources do not change their contents
By default, have all requests for the package foo use the latest versioned source
Have requests specifically for #:version 'head use the HEAD source. This helps make package installation secure by default against attacks modifying the source of the package, as the package server can check integrity before responding with the package source.

This requires developers to "opt-in" to versioning, which means the majority won't. Perhaps security warnings for installing unversioned packages would be appropriate, as by nature it's not easy to verify the contents of a package source that makes no guarantees its contents won't change. This could work with the ring system as well - packages below a certain ring must be versioned.

Additionally, specialized seperate tools could watch Github repos for new tagged commits of the form v<verison> and update the package source with new versions in response. This would be decoupled from the package server, acting as an automated tool using the package author's credentials.

Events

A package server needs some sort of push notification system. This is necessary for building decoupled automated tools around the package server. Applications include:

Getting a package's info.rkt file and using it to auto-populate more information about the package.
Triggering a build server in response to updated package details like changing the source.
Notifying a user on failed builds
Notifying a user on malformed package information, like package sources
Auto-assigning rings to packages based on build output
Notifying curators when a package author opens a request for admittance to a curated ring
Notifying completely external integration services, like chatrooms and logging
Notifying more sophisticated automated package maintenance tools, like a tool that automatically fixes a package that fails to declare all its dependencies.
Notifying a user on download / installation milestones ("Your package has been installed one thousand times!")
Notifying external services whenever packages change rings, for instance to allow an email service to send subscribed users emails about when packages get newly accepted into a curated ring.

Additionally, it needs to be possible for clients to register to receive events, both all of them and filtered by subcategories such as events for specific packages or events relating to specific services.

Package grouping

It's very common for one "logical" library to be split up into many packages. Possible reasons include:

Separating the package into foo-lib, foo-test, foo-doc, and foo packages so that clients can opt-out of installing the test and doc code.
Providing foo2, foo3, etc. packages when a major backwards-incompatible release occurs.
Putting experimental features in a foo-unstable package
Providing GUI-related functionality in a foo-gui package to avoid GUI dependencies in clients that don't need them

A good package client would group all these packages together into one entry that contains details about how the library is divided into subpackages. To support this, tags may be used (all packages in the group are tagged "foo") or some other more-specific feature may be required on the package server side.

Example workflow for hypothetical package server with lots of bells and whistles:

User [email protected] specifies to the package server she would like to upload a package named maria to the server with source location git://github.com/maria/maria-racket-pkg and herself as the sole author.
Client side checking validates the given package source is properly formatted as a package source, but does not ensure a package actually exists at the location specified.
The package server records this information in persistent storage, the package is now available at GET /pkgs/maria.
The package server sends a package creation event specifying that the package maria was created.
A package source validation service listening for package creation events sees this event and schedules a job to check the validity of the package source. When run, that job (in an isolated environment) verifies the location specified by the package source contains an info.rkt file in the manner appropriate for the given kind of package source. When complete, the job sends a package source validation event specifying whether the package source is valid. Additionally, a PATCH /pkgs/maria request is made specifying whether the package source is valid in a 'source-validity key containing a dict with a last-checked? key containing the current date and a valid? key containing a boolean indicating whether the package source is valid.
In the event of package source validation failure, an email notification service listening for such events sends an email notification to [email protected] notifying of the malformed package.
In the event of package source validation success, a package info service retrieves everything defined in the package's info.rkt file and sends a PATCH /pkgs/maria request specifying that a dictionary containing all definitions in the info.rkt file will be placed into the 'info key of the package's metadata. After success, a package info updated event is fired. 8, 9, 10, 11, ... Builds scheduled, checksum calculated, dependencies fixed, documentation built and updated on a hosted site, rings adjusted, etc. etc. (will flesh out later)

TODO

PATCH requests
Specification of automated builds (may not be necessary if events and user accounts allow for easily pluggable services)
Details about more automated services, such as dependency fixers and documentation hosting
Private catalogs
Authors routes and details about how user account management works
Docker implementation architecture ideas
Automated analysis of stability, for instance whether the package uses anything in the unstable collection or depends on packages whose semver indicates they're unstable.
Dependency analysis, for instance to identify packages that many other packages depend on and be more paranoid about things like stability analysis.

jackfirth/packages.md