Module Management for HOL4

WIP: this gist is being updated as it is being written.

This describes the design for a system to ease building/downloading and installing HOL4 and working with HOL theories. The hoped for advantages are:

Less building locally: for modules that are just being used, they can be downloaded pre-built and immediately used.
Easy to get up-and-running with HOL4 as a new user. No complex or esoteric build process.
Versioning and explicit dependence on versions. May avoid unnecessary rebuilding (although this requires care: only one version of a theory can exist in any well-formed theory graph).
Explicit tracking of which versions of which modules are installed: no timestamp-based out-of-date dependencies at runtime.
Better handling of examples: currently they either "make it" into the HOL GitHub examples directory or just live in some unknown repository in the ether. With modules we could have a centralised explorable repository at hol-theorem-prover.org. For example, CakeML releases could be actual module releases and their dependence on other HOL modules would be fine-grained and explicit (not just a corresponding commit of the entire HOL repo).
Better support for working on multiple HOL-based developments at once, including at different git versions: no more messing around with HOLDIR, CAKEMLDIR, etc. environment variables to set the environment up just right. Everything can be much more explicit and live together without conflict.

Design principles:

Follow existing practice whenever it is not too costly, to minimise friction in adopting the new system.
Keep it simple - get it done fast, while making sure it is maintainable.
Soundness is obviously paramount, but centralised trust is fine. There should always be an option to reverify everything locally if you want.

What is a module?

The point of a module is to create, when imported, an interactive environment in which some SML values are available, including especially values of type thm. Thus, importing modules should be understood as in the context of a dynamic SML environment called "the session".

Modules depend on each other explicitly. Applying updates to the session in the right order, respecting dependency, in order to import a module, is the job of the module management system.

A module consists (conceptually) of the following:

name: unique, together with the type, in the single global namespace of HOL4 modules.
type: either Theory or Library.
version: a natural number
exports: bindings (names for SML values) made available on import.
state: runtime state for the module in the session, including e.g. theorem-set data and other databases.
- There is only one instance of a module's state in the session.
- Code in modules is guaranteed to be run in any dependency-respecting order, and only once per module.
- Theory modules have no state.
effects: (optionally excludable?) updates to the session (including primarily the state of other modules) that can be applied on import.
- Library modules have no effects but their exports may include procedures that affect their (and/or other libraries') state.
- Theory modules would typically have effects on the ancestry graph, parser grammar, theory database, inductive type database, etc.
code: HOL4-flavoured SML code (i.e. to be preprocessed by the quote parser) to run when a module is imported that:
- (a) creates the module's state,
- (b) applies the module's effects,
- (c) makes the exports available to the session.
- The code of a module includes explicit imports of other modules, thereby specifying the dependency graph on modules.
- The code for a Theory module is the Script file from pre-modules HOL4. The module manager will also save the exports and effects in a format (e.g. the .dat files in pre-modules HOL4) for more quickly importing a theory without rerunning its code.
- Some special modules might have additional files they need, e.g. maybe they use code in another language. We can figure out how to deal with this as the need arises.

This is meant to basically recover the current practice of fooTheory and fooLib SML module usage in pre-modules HOL4, but with versioning and more opportunities for using pre-built contents.

Should it be permissionless to upload modules? I think probably not - closer to status quo is to have some maintainers of the official module repository. This avoids having to protect against abuse of the namespace in the design: we can leave that to the maintainers' judgement. (Otherwise maybe names should also include an author and maybe also their signature, for extra disambiguation and protection against taking over a module by upgrading it.)

The respository should build modules to ensure that they build successfully and that their version dependencies are indeed sufficient. When a new module is submitted for upload, it may be rejected if a module that depends on it breaks -- it is up to the maintainers to resolve this (maybe the depending module has a too loose import version range, or maybe the updated module is breaking semver expectations with its change).

Using the module manager

The module manager is a command line executable for which it is easy to download a prebuilt binary (from hol-theorem-prover.org) for whatever platforms we want to support. Let us call it hmm for now.

All modules in the official repository are "available" for import. The job of hmm is to maintain a local cache of the ones the user wants for faster importing. Modules (at specific versions) that are locally cached are considered "installed".

There should be just one cache location. By default $XDG_CACHE_HOME/hol, but configurable via updating a reference in .hol-config.sml (which should perhaps also be allowed and encouraged to exist in $XDG_CONFIG_HOME/hol).

hmm should be able to list (and search by name for) all available modules in the repository (indicating which are installed), and also list just the modules (and versions) that are actually installed.

hmm should be able to install a module given a name (and optionally a desired version). It will install the highest available (or requested) version and a compatible set of versions of all its dependencies.

hmm should be able to uninstall (delete from cache) modules by name, to clean up unneeded modules (according to dependencies of installed modules), and to upgrade modules to newer versions that are still compatible with their dependents.

hmm should be able to "rebuild" modules for verification purposes:

run a Theory module's code to create its imports (possibly in an interactive session) assuming its dependencies are fine
run a Theory module and all its dependencies from scratch, to create its imports in a session without any DISK_THM oracles The result here doesn't have to be an interactive session, it could just be some command line output indicating that reverification was successful.

hmm should be able to launch hol at the session of a particular module. Usually this would be a module that is under development. Under-development modules might need some special version treatment when in that state: probably they would be "locally" installed but without a corresponding published entry in the official repository.

Writing modules

Modules are defined by the files that contain their code. The name and type of a module is determined by its filename: fooScript.sml is used for a Theory module whose name is foo, and fooLib.sml is used for a Library module whose name is foo.

The version of the module is specified implicitly by the module management system: it is either a published version from the repository, or, if the code is being developed and has been changed, the next highest unpublished version.

Dependencies are specified using Import syntax, as is being discussed in this issue. Modules to be imported are specified by name and type. The version is implicit: it is determined as the version for the dependency in an optimal (cannot increase any without breaking the build) set of versions of all dependencies for the target module. There may be need to be support for explicitly requesting a version for an Import, especially for modules under development - but I think these should not be published.

The exports of Theory modules are defined implicitly as currently (i.e. in the way that pre-modules HOL4 automatically produces a Theory.sig SML signature for a script file).

The exports of Library modules should be defined explicitly either within the fooLib.sml file or (better matching common practice) in an accompanying fooLib.sig file, and should be defined as an SML signature named fooLib. The code of a Library module should be define as an SML module conforming to the signature with the same name (again matching current practice).

WIP

interactive development, how to create the session for interactive development, how to change dependencies on the fly, how to reset the session state, etc.
Uploading (publishing) modules to the repository of modules, making releases
Practicalities of developing Library modules
Text editor integration

Soundness of importing pre-built (not-reproved) theorems

For each module, the session it creates when loaded should be as-if all its dependencies were loaded (once each) in some dependency-respecting order, modulo the use of the DISK_THM oracle to shortcut reproving many theorems. The resulting logical environment is sound as long as DISK_THM is only used in a safe way, i.e. skipped proofs would have worked if they were run instead of skipped and would have produced the same logical content.

The module repository is curated so that there is a set of versions of each published (and accepted for inclusion) module that can be downloaded, and the set includes at least the latest version.

Our aim for published installed modules is that they can be loaded. To achieve this, we impose the following invariant on the module repository: for each module X with a version V available from the repository, there is a known set of versions D for all its dependencies such that X (at V) can be loaded successfully, and all versions in D are also available.

Ideally, we also strive to ensure that no later set of versions D' (i.e. for each module the version in D' is no less than the version in D) that can also be used to load X (at V). This is not enforced as an invariant, though, because we want to allow for progress to be made on some modules without waiting for other dependent modules to be updated that might be being maintained at a slower pace.

When downloading and installing X (at V) from the repository, we simply download all the dependent modules at the version specified in D, as well as X at V itself. The content of what is downloaded should be sufficient for hmm to create a session for X at V.

To maintain the invariant when a new version of a module (or a new module altogether) is added to the repository, we require the uploader to provide a working set of versions for its dependencies and, as repository managers, check that the full build is possible at these versions before publishing.

The repository managers also attempt to update the set dependency versions for modules that are dependent on the updated module. If the build succeeds with a version updated, that updated version replaces the old one in the set. If not, the dependent package is marked as needing to be updated to support the newer version of its dependencies.

How the work of updating is done is up to the repository curators: it could mean assigning the task to a maintainer (or some other volunteer), renaming the module's old version so it can be depended on without expecting any more updates, dropping some module from the repository, or some other solution.

Implementation thoughts

WIP

How to get to here from pre-modules HOL4 and Holmake
Role of Holmake code in the module manager
How and when to use Poly/ML heaps to improve loading speed