Testing Mars

Software development theorist Michael Feathers, in his classic Working Effectively with Legacy Code, simply defined “legacy code” as “simply code without tests.” More specifically, Feathers referred to specific and auditable instantiations of tests as code. By this definition, much of Urbit's codebase is de facto legacy code. Our objective is to make Urbit demonstrably robust, capable of providing guarantees, and quantifiably amenable to reasoning. A robust testing framework and comprehensive test coverage are critical to providing such reliability and enabling enterprise-grade applications to be built on top of Urbit.

Current Urbit Testing

Unit testing is available on Urbit today for all parts of the system. No specialized kernel mode is necessary to audit gate behavior (altho Arvo and vane state information are not accessible from userspace). Urbit currently supplies the following tools:

`-test` Thread

The /lib/test library and the /ted/test thread work together to enable developers to produce basic unit tests. Most of the time, envased results are compared using ++expect-eq. This facilitates the production of input–output pairs, taking into account structure, type, and value. Other types of tests, including failure, failure messages, and success, can also be performed. The -test thread runs all code through the ++mule gate so that failures can be reported and so that subsequent tests continue to run.

Unit tests can be created to test any code object. Most commonly, libraries and structure files are tested first. Generators and agents require a direct code build using ++ford. Agents in particular require the mock-up of a state noun and a fake bowl which together mimic the operation of a real ship.

While there are some ergonomic difficulties with writing tests in Hoon using the test thread pattern, it is a fundamentally sound expedient. Test threads can be triggered from outside the runtime, and are part of the standard CI/CD pipeline in place for the primary Urbit repo, urbit/urbit.

`/app/test` Agent

/app/test

`%quiz`

Property-based testing, and particularly randomized property-based testing, can be carried out using the %quiz framework.

%quiz

`%aqua` Virtualized Ship Management

The %aqua system was originally conceived as a sort of ship-in-a-bottle, the ability to run fully virtualized ships inside of a host. (In principle, even multiple live ships could be operated in tandem using %aqua.) (Altho still documented as if a functional part of the system, in reality %aqua has not been functional for many kelvins.) This “aquarium” permits a developer to instrument multiple fake ships from inside of one, including the /lib/ph integration testing library. Such tests consist of batch-like sequences of commands to the ship's Arvo and Dojo to produce particular values, often coarsely expressed as string matches.

Uqbar made a sometimes-working attempt to modernize %aqua as %pyro, but their final product was insufficient to satisfy the needs of a testing harness. It has been discussed whether Shrubbery plus an Arvo-like event handler could make %aqua simple to re-implement. In any case, it is not currently clear how much work would be necessary to update %aqua or (more likely) to replace it from scratch and prior art.

Advocacy

While good practices for testing have been advocated on Urbit for years, the insufficiency of developer tooling has tended to depress their active use. (We cite as well the failure of %base to ship with /tests in the standard pill.) Grants have only occasionally been submitted with comprehensive tests, and most release software on Urbit is spot-tested by developers rather than rigorously by automated unit testing.

That said, there has been remarkable robustness in Urbit's practical approach of spot-testing software in private, then rolling it out to core developer ships, and only thence to the broader network. This is not as pragmatic for single-application developers.

Desiderata

If we could wave the magic wand, what would testing on Urbit look like? Let's suppose that we have a functioning %aqua, and branch out from that point.

Batch files or playbooks.
Fuzz testing of Arvo and the runtime.
Structured property-based testing harness/framework.
Auditable crashes.
CI/CD integration.
Azimuth testnet. Tlon has done this before, but in general the idea that an Azimuth testnet detached from the actual network and with some traffic (perhaps AI-driven) would permit network weather, update propagation, communications reliability, and the like to be studied. Multiple host OSs, various configurations. (“Cardinality” is of course the root of “Azimuth”.)
Debugging with breakpoints. While not automated, breakpoint-driven debugging (both fixed and conditional) is a key part of contemporary software development QA. Some of this can be done with a next-level ++mock interpreter, but some sort of gdb-style symbol logging would be nice for connecting legible Hoon to executable Nock.
Cross-kelvin compatibility. We should start building parts of the system to be robust against cross-kelvin communications. Ames directed messaging as a backwards-compatible upgrade should aid in this, but certain parts of userspace should also be hardened.
Runtime testing on both Vere and Sword/Ares. Jets, internal functions, etc. Solid code coverage here against a specification.

Beyond that lies the horizon: we can there imagine fully AI-generated traffic and fuzzing attacks, as well as co-development of jet code and Hoon code (verifiability) and other research directions.

Best Practices

Certain best practices have been derived for the current configuration of the Urbit ecosystem. For instance, documentation and testing should be part of core milestones on a grant, rather than a standalone milestone which may be ignored.

A common problem with unit tests is that the bunt of many values is valid in the code. In particular, if testing on a fake ship ~zod, one does not notice any bunted ship names. To mitigate this, ++en-beam should be preferred to /=== in almost all cases.

Error messages should provide context (a criticism which currently holds for the kernel, alas). ++expect-fail-message can facilitate testing for expected error messages. Prettyprinter improvements can also assist in crash assessment.

Roadmap

How can we use a testing framework strategy to convert Urbit from legacy code into a verifiably diamond kernel? Together with ~mopfel-winrux's forthcoming work on standardizing Urbit into a specification, we can employ unit testing and integration testing to establish baseline behavior, then expand out into desirable features.

Concretely, we proceed along the following path:

Essential unit testing. Start at /sys/zuse and work backwards towards /sys/hoon. Coordinate with ~mopfel-winrux on the Urbit Specification.
Simple regression testing. All bug fixes should include a test which would have revealed the initial bug, if possible.
Integration testing. Systems components should be tested against each other at the runtime level, kernel level, and userspace level.
Coverage measurement. Coverage targets should be set once they are measurable; this may require modification of existing code coverage tools or (better) reimplementation of such functionality in Hoon.
Breakpoint debugging. Altho not trivially automatic, breakpoint-based debugging should be developed alongside proper testing and test coverage arcs.
Batch operation support. This could end up looking like the “Valence Shell” proposal, which was a thread-based shell to manipulate nouns more powerfully than Dojo; or simply additional functionality introduced to Dojo.

sigilante/testing-on-mars.md