In abstract, the job of ghc --make
is very simple: compute
a dependency graph between modules, and rebuild the ones that
have changed. Unfortunately, the code in GhcMake
is quite
a bit more complicated than that, and some of this complexity
is essential to the design of GHC:
Interaction with GHCi involves mutating process-global state.
GhcMake
today handles some of this mutation: in particular, it handles updating theHomePackageTable
as we typecheck modules (expedient because we need this up-to-date to typecheck later modules), theInteractiveContext
(which records all of the things defined on the REPL), and the global linker state (for object code we load in-process.)These updates should not be handled by
GhcMake
, and as much of this state is represented with mutable pointers to immutable objects, they would be easy to extricate. But some state, like the linker, is best updated incrementally. It would be a big waste to unload all loaded objects; we only want to unload things that will be invalidated by the new compilation (this is tied to the notion of stability in the current implementation). Unfortunately, traditional build interfaces don't report enough information to handle this. To complicate things further, this state must be updated accurately even when the build fails.On-disk stored interface files are deserialized into an immutable, cyclic graph which is consulted during typechecking. In the absence of mutual recursion, this graph grows monotonically and all is well; but when we compile using an
hs-boot
file, we compile against an incomplete version of this graph, which is must be updated to complete form when we compile the finalhs
file.GhcMake
handles this by re-typechecking module loops when they are completed: this is not something a traditional build system knows how to do!
The biggest source of incidental complexity in GhcMake
is support
for parallel builds, which means that the logic for upsweep is
implemented twice.
So, what would a rewrite of GhcMake
look like? Here are some things
I know need to happen:
- The lesson of Shake is that dependencies should not be specified before a build starts: they should be specified as you are building. Adopting a design like this would allow us to collapse the upsweep and downsweep passes into straight-line build code, for a massive increase in clarity and abstraction (for example, choice of parallelization can be done implicilty). ghc-shake is an existence proof for such a design.
- For backwards compatibility,
hi
ando
files should be the only form of cached state. This sets the build abstraction on a different evolutionary path than that described by Shake. Referring to Section 2.3, we should not introduce aResult
type; rather, theTime
andValue
should be assumed to be derivable from theKey
(by reading the relevant output file from disk.) Similarly, dependencies (depends
) must be extractable from theValue
(e.g., this is true forModIface
). - Although a
ModIface
is the value associated with an interface file, it is not what is fed into a subsequent compilation: the typechecked form, aModDetails
is what you need. TheModDetails
should be cached, but only in memory, not in disk. This cache needs to be maintained independently from the build system, so that GHCi can decide when to clear the cache. - In most build systems, the dependency graph is represented
implicitly and cannot be accessed. Cases like determining what
portions of the graph to unload and retypechecking module loops make
it clear that we must compute this graph in order to perform certain
operations. But like
ModDetails
, this information should be cached. - To build a module, one needs a complete
HomePackageTable
for all home modules transitively below it. This means that the dependencies of a module are not its direct imports, as one might imagine; it is the transitive home dependencies. This indicates that we need a fast way of computing the transitive dependencies from direct information: this information can be gotten from the dependency graph, or even just a cached "transitive deps" per module.