The type inference starts at type_inference.cr. Here, the program's AST is visited with a TypeVisitor. This visitor's does many things:
-
It declares classes and defs (when the visitor visits a ClassDef node it declares a class, etc.). Note that the visitor doesn't visits a def's contents.
-
It types literals. For example the number 1, represented by a NumberLiteral node, is assigned the type
Int32. Every ASTNode has a type, which initially is nil. -
It keeps track of variables defined a method (or at the top level) and their types. For example, when the visitor finds an assignment like
a = 1it first visits1and assigns the typeInt32to it. Then it binds the nodeato the node1, making1s type be propagated toa. The key here is thatais bound to1. Here it doesn't make much sense, but for a case likea = b,a's type will be bound tob's type, so wheneverbs type changesa's type will change to that type. This dependency tracking is in semantic/ast.cr, specifically thebindmethods. -
When a call is found, for example
foo.bar(baz), these things happen:- The receiver (
foo), if any, is visited. - The arguments (
baz), if any, are visited. - The call attaches listeners in the previous nodes (called input_observers in the code).
Then the call is recalculated, meaning the compiler will lookup methods named
barinfoos type that match the arguments' types.Once lookup matches are found (regular defs), they are instantiated: the arguments are bound to the call arguments' types, a new TypeVisitor is created and the newly instantiated def is traversed. The call is then bound to those def instantiations (the type of their bodies).
The instantiations are cached. The key to the cache is the arguments types. So, when a method is found and the compiler already typed a particular instantiation for a set of argument types, there's nothing to do.
- The receiver (
For classes and types containing instance variables, their instance variables are bound to types assigned to them. So for example in a class Foo that has @a = 1, @a's type will be bound to Int32. If later a different type is assigned, say @a = "hi", @a's type will be bound to both nodes, resulting in a new type, Int32 | String.
The key point here is that nodes are bound to other nodes and as their type change (like the previous instance variable), types are propagated. If a call's argument type changes matches are looked up again for the new types.
This might seem inefficient but the algorithm works quite well. Most of the time our code deals with a fixed set of types, so type unions are not that frequent. Compiling the whole compiler takes between 5.5 and 7.6 seconds on my machine (we do some incremental compliation in the code generation phase). The compiler consists of about 30k lines of code (what's underneath src/compiler.
The algorithm is actually a bit more complicated than the above. In point 3, where it was mentioned that the compiler keeps track of variables' types, the compiler tries to be smarter. For example if you do a = 1; a = "hello"; a.length, the compiler knows that in a.length, a's type is a String, even though you assigned an int to it. You can read more about this here, and also here (read the whole section), here and here.
Then blocks are involved and many more features and syntax sugars, but they are all additions to the above model. It's all about type propagation through AST nodes, def instantiations (considering all defs like C++ templates) and caching to speed up things. More complications involve tracking closured variables, for example. More or less everything is done in that TypeVisitor. It makes code a bit more complicated to follow, but the code is traversed fewer times, making the compiler fast. I believe fast compilers are better than slow but better modeled compilers (the programmer wants speed, the programmer doesn't even care about the compiler's code).