Skip to content

Instantly share code, notes, and snippets.

@ezyang
Created August 22, 2016 08:49
Show Gist options
  • Save ezyang/50004e0f842eec5848acad5cf1c22240 to your computer and use it in GitHub Desktop.
Save ezyang/50004e0f842eec5848acad5cf1c22240 to your computer and use it in GitHub Desktop.
Notes on GHC make

How ghc --make works

In abstract, the job of ghc --make is very simple: compute a dependency graph between modules, and rebuild the ones that have changed. Unfortunately, the code in GhcMake is quite a bit more complicated than that, and some of this complexity is essential to the design of GHC:

  • Interaction with GHCi involves mutating process-global state. GhcMake today handles some of this mutation: in particular, it handles updating the HomePackageTable as we typecheck modules (expedient because we need this up-to-date to typecheck later modules), the InteractiveContext (which records all of the things defined on the REPL), and the global linker state (for object code we load in-process.)

    These updates should not be handled by GhcMake, and as much of this state is represented with mutable pointers to immutable objects, they would be easy to extricate. But some state, like the linker, is best updated incrementally. It would be a big waste to unload all loaded objects; we only want to unload things that will be invalidated by the new compilation (this is tied to the notion of stability in the current implementation). Unfortunately, traditional build interfaces don't report enough information to handle this. To complicate things further, this state must be updated accurately even when the build fails.

  • On-disk stored interface files are deserialized into an immutable, cyclic graph which is consulted during typechecking. In the absence of mutual recursion, this graph grows monotonically and all is well; but when we compile using an hs-boot file, we compile against an incomplete version of this graph, which is must be updated to complete form when we compile the final hs file. GhcMake handles this by re-typechecking module loops when they are completed: this is not something a traditional build system knows how to do!

The biggest source of incidental complexity in GhcMake is support for parallel builds, which means that the logic for upsweep is implemented twice.

So, what would a rewrite of GhcMake look like? Here are some things I know need to happen:

  • The lesson of Shake is that dependencies should not be specified before a build starts: they should be specified as you are building. Adopting a design like this would allow us to collapse the upsweep and downsweep passes into straight-line build code, for a massive increase in clarity and abstraction (for example, choice of parallelization can be done implicilty). ghc-shake is an existence proof for such a design.
  • For backwards compatibility, hi and o files should be the only form of cached state. This sets the build abstraction on a different evolutionary path than that described by Shake. Referring to Section 2.3, we should not introduce a Result type; rather, the Time and Value should be assumed to be derivable from the Key (by reading the relevant output file from disk.) Similarly, dependencies (depends) must be extractable from the Value (e.g., this is true for ModIface).
  • Although a ModIface is the value associated with an interface file, it is not what is fed into a subsequent compilation: the typechecked form, a ModDetails is what you need. The ModDetails should be cached, but only in memory, not in disk. This cache needs to be maintained independently from the build system, so that GHCi can decide when to clear the cache.
  • In most build systems, the dependency graph is represented implicitly and cannot be accessed. Cases like determining what portions of the graph to unload and retypechecking module loops make it clear that we must compute this graph in order to perform certain operations. But like ModDetails, this information should be cached.
  • To build a module, one needs a complete HomePackageTable for all home modules transitively below it. This means that the dependencies of a module are not its direct imports, as one might imagine; it is the transitive home dependencies. This indicates that we need a fast way of computing the transitive dependencies from direct information: this information can be gotten from the dependency graph, or even just a cached "transitive deps" per module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment