Skip to content

Instantly share code, notes, and snippets.

@asterite
Last active August 29, 2015 14:09
Show Gist options
  • Select an option

  • Save asterite/195bfec1de40607a64bf to your computer and use it in GitHub Desktop.

Select an option

Save asterite/195bfec1de40607a64bf to your computer and use it in GitHub Desktop.

Crystal's type inference algorithm

The type inference starts at type_inference.cr. Here, the program's AST is visited with a TypeVisitor. This visitor's does many things:

  1. It declares classes and defs (when the visitor visits a ClassDef node it declares a class, etc.). Note that the visitor doesn't visits a def's contents.

  2. It types literals. For example the number 1, represented by a NumberLiteral node, is assigned the type Int32. Every ASTNode has a type, which initially is nil.

  3. It keeps track of variables defined a method (or at the top level) and their types. For example, when the visitor finds an assignment like a = 1 it first visits 1 and assigns the type Int32 to it. Then it binds the node a to the node 1, making 1s type be propagated to a. The key here is that a is bound to 1. Here it doesn't make much sense, but for a case like a = b, a's type will be bound to b's type, so whenever bs type changes a's type will change to that type. This dependency tracking is in semantic/ast.cr, specifically the bind methods.

  4. When a call is found, for example foo.bar(baz), these things happen:

    1. The receiver (foo), if any, is visited.
    2. The arguments (baz), if any, are visited.
    3. The call attaches listeners in the previous nodes (called input_observers in the code).

    Then the call is recalculated, meaning the compiler will lookup methods named bar in foos type that match the arguments' types.

    Once lookup matches are found (regular defs), they are instantiated: the arguments are bound to the call arguments' types, a new TypeVisitor is created and the newly instantiated def is traversed. The call is then bound to those def instantiations (the type of their bodies).

    The instantiations are cached. The key to the cache is the arguments types. So, when a method is found and the compiler already typed a particular instantiation for a set of argument types, there's nothing to do.

For classes and types containing instance variables, their instance variables are bound to types assigned to them. So for example in a class Foo that has @a = 1, @a's type will be bound to Int32. If later a different type is assigned, say @a = "hi", @a's type will be bound to both nodes, resulting in a new type, Int32 | String.

The key point here is that nodes are bound to other nodes and as their type change (like the previous instance variable), types are propagated. If a call's argument type changes matches are looked up again for the new types.

This might seem inefficient but the algorithm works quite well. Most of the time our code deals with a fixed set of types, so type unions are not that frequent. Compiling the whole compiler takes between 5.5 and 7.6 seconds on my machine (we do some incremental compliation in the code generation phase). The compiler consists of about 30k lines of code (what's underneath src/compiler.

The algorithm is actually a bit more complicated than the above. In point 3, where it was mentioned that the compiler keeps track of variables' types, the compiler tries to be smarter. For example if you do a = 1; a = "hello"; a.length, the compiler knows that in a.length, a's type is a String, even though you assigned an int to it. You can read more about this here, and also here (read the whole section), here and here.

Then blocks are involved and many more features and syntax sugars, but they are all additions to the above model. It's all about type propagation through AST nodes, def instantiations (considering all defs like C++ templates) and caching to speed up things. More complications involve tracking closured variables, for example. More or less everything is done in that TypeVisitor. It makes code a bit more complicated to follow, but the code is traversed fewer times, making the compiler fast. I believe fast compilers are better than slow but better modeled compilers (the programmer wants speed, the programmer doesn't even care about the compiler's code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment