Referral statements.md

The goal

We need a syntax for declaring "Mojo references", i.e. identifiers that refer to pre-existing variables.

I will refer to a Mojo reference's identifier as an alias. This is a widely-used term; it means "a secondary name for something". Henceforth, every time I talk about an "alias", I am referring to this concept. (I am not talking about Mojo's alias keyword for declaring compile-time variables. I suspect this keyword will eventually be renamed, because it's misleading.)

As discussed in this thread, we want the ability to declare aliases in the following locations:

Argument lists and parameter lists (already implemented)
Assignment targets
- i.e. =, :=, and for-loop variables
Patterns
Struct fields

I've spent the last week searching through various syntaxes and semantics that can handle all of these cases well. I believe I've finally found a design that is clean, expressive, and backwards-compatible with Python.

Note: This design is very different to the design I presented a week ago. It's the outcome of about 10x the time investment. You can completely ignore my earlier designs; this document is self-contained.

One of the key aspects of this design is that references/aliases remain a second-class language construct, i.e. it would not be possible to declare a List[ref[origin] String]. This is a good thing, because the behaviour of first-class auto-dereffing references becomes difficult to define and to comprehend when you mix them with generics. If a programmer wants to store references in a generic collection, they can use Mojo's first-class Pointer type, or they can declare their own type where one of the fields is a reference/alias.

With that in mind, here is my proposed design.

We need a "referral statement"

We need a syntax for declaring function-scoped aliases. I propose the following syntax:

numbers = Tuple(3, 4, 5)
x -> numbers[2]
x += 1
print(numbers)    # prints (3, 4, 6)

The second line is a new kind of statement that I will describe as a referral statement. The referral statement x -> numbers[2] binds the identifier x to the variable that the expression numbers[2] denotes. In other words, it makes x an alias for the variable numbers[2]. We can read x -> numbers[2] as "x refers to numbers[2]", or "x points to `numbers[2]".

For now, let's not stress about the particular symbol -> that I've used. We can always change the symbol if we find something better.

This notion of a "referral statement" is notably different to the keyword-based syntaxes (e.g. ref x = numbers[2]) that we have been considering so far. There are a number of reasons why a new kind of statement works better than a keyword. For one thing, it gives us a very clean syntax for destructuring:

numbers = Tuple(3, 4, 5)
x, y, z -> numbers
y = 0
print(numbers)   # prints (3, 0, 5)

In comparison, with a keyword-based syntax you'd probably need to repeat the keyword for each alias being declared:

ref x, ref y, ref z = numbers

Or maybe you could parenthesize the left-hand side:

ref (x, y, z) = numbers

Either way, the syntax is noisier than ->.

Perhaps the most important reason to introduce a new kind of statement is that we need the ability to distinguish between rebinding an alias and assigning to the variable that the alias refers to. Having both referral statements and assignment statements makes this easy:

numbers = Tuple(3, 4, 5)
x -> numbers[0]    # Bind 'x' to the first element
x = 0              # Update the first element
x -> numbers[1]    # Bind 'x' to the second element
x = 0              # Update the second element
print(numbers)     # prints (0, 0, 5)

The value of this syntax is perhaps more clear when you need to rebind aliases stored in structs:

thing.x = 0
thing.x -> numbers[2]

For a keyword-based syntax, it's not clear where you'd put the keyword:

ref thing.x = numbers[2]     # weird

thing.x = ref numbers[2]

Putting the keyword on the right-hand side seems more sensible, but this suggests that ref should be an operator. For various reasons, that doesn't make sense. For example, the operator would have no purpose except when it appears immediately after the = symbol. So loosely speaking, the "actual" operator would be = ref. But that's just a clunky equivalent of the -> symbol.

Another case we need to consider is patterns. In Python, the syntax is:

match node:
    case Node(position=(x, y)):
        x = 0  # This does nothing useful, because x is a copy of node.position.x

This program looks up the position attribute of node and destructures it, as if we had written:

x, y = node.position

For Mojo, we ideally want the use of = in patterns to mean the same as it does in Python: copy the value into the variable(s).

Now that we have referral statements, and the corresponding -> symbol, we have a good way to specify that a pattern should bind aliases to parts of the matched variable:

match node:
    case Node(position<-(x, y)):
        x = 0     # This mutates 'node.position.x'

Note that I've flipped the arrow around. This makes sense when you consider that the above program is equivalent to writing:

x, y -> node.position

In both cases, the arrow points toward the data that the aliases are being bound to.

Side note: This is one of the reasons why I've chosen the syntax -> rather than =>. You can't flip => backwards, because <= means "less than or equal to".

Python also allows positional destructuring, for example:

match node:
    case Node(size, position):

In this case, the variables size and position are bound to the attributes that are defined by the special __match_args__ dunder attribute:

@dataclass
class Node:
    position: tuple[int, int]
    size: int
    __match_args__ = ('size', 'position')

Mojo will need to either copy this (strange) syntax, or invent a new syntax for specifying the order of fields when destructuring. Whatever approach we take, I suggest that we allow the user to simultaneously choose whether destructuring should bind aliases, or bind new variables. For example, we could offer an alternative to __match_args__:

struct Node:
    var position: Tuple[Int, Int]
    var size: Int
    __match_aliases__ = ('size', 'position')

By having the type author specify whether positional destructuring should bind aliases or copies, we avoid the need for programmers to make this decision at the call site. If the type author chooses to bind aliases, that's what the user gets:

match node:
    case Node(size, position):
        size = 1     # Mutates the field, because 'size' is an alias.

Whether or not an identifier is an alias can be communicated through syntax highlighting and/or tooltips. (This is similar to how Mojo users discover which arguments of a function are inout.)

Loop variables

Let's consider the semantics of for-loops. Here's a basic Python program:

numbers: list[int] = [3, 4, 5]
for num in numbers:
    num = 0
print(numbers)    # prints [3, 4, 5]

We need to ensure this code behaves the same when it is compiled as a Mojo program. This means that num must be an independent variable, rather than an alias of an item in the list.

HOWEVER, just because Python's list type returns copies, this doesn't mean Mojo's List type (with a capital L) needs to behave the same! The following behaviour would be perfectly reasonable, because List's behaviour already diverges from list in other ways. There's no need for iteration to behave 100% identically:

numbers = List(3, 4, 5)
for num in numbers:
    num = 0
print(numbers)    # prints [0, 0, 0]

Binding aliases is strictly more expressive than returning copies, and Mojo actually needs to offer this capability, because not all types in Mojo are copyable. The only question is whether we need a dedicated syntax for letting users choose between aliasing and copying, e.g. for ref num in numbers.

I claim that no, we don't need a dedicated syntax for this. Instead, we can let the type's author decide how the default iterator should behave, and we can let users explicitly select a different iterator when necessary. In the next few paragraphs, I explain how this would work.

Whether an iterator binds aliases or copies would be determined by how the __next__ method is defined on the iterator. An iterator that binds copies would look like:

fn __next__(self) -> T:

Whereas an iterator that binds aliases would look like:

fn __next__(self) -> ref[<origin>] T:

This suggests we might needs two different iterator traits: one for collections that iterate over their elements (binding aliases), and one for streams and generators (etc.) that copy, move, or emplace values into the caller-provided result slot. Python's list type belongs to the latter category.

If we decide that List needs to offer both kinds of iterator, we can make them both available. Non-default iterators can be obtained through explicit method calls, e.g:

for num in numbers.copies():

This would give the caller an iterator that produces copies, just like Python's list type.

Annotating an alias with its origin

This is a digression, but it's worth briefly clarifying how you'd write the origin of an alias.

So far, I've let the origin and the type of an alias be inferred:

x = 0
y = 0
selected_num -> x

Given that referral statements obviate the need for a ref keyword, one might wonder how to write an alias's origin.

I propose the following syntax:

x: Int = 0
y: Int = 0
selected_num{x, y}: Int -> x

The curly braces fulfil the same role as today's ref[_] syntax. I've chosen curly braces to avoid confusion with the subscript operator (and other things), but that's not the important detail. The important detail is that we're listing a set of targets/origins after the variable name, and before the type.

The motivation for this syntax is:

It's concise.
It clarifies that the origin of an alias is not part of its type. In other words, it clarifies that aliases/references are not a first-class type. An alias is just a secondary name for an existing variable. If you want a first-class type, we have Pointer for that.
By having the type annotation be plain old Int, we are clarifying that selected_num behaves exactly like an Int. No matter what you do (except rebinding the alias), it behaves like an Int.

If desired, you can omit an alias's type, and only declare its origin:

selected_num{x, y} -> x

We can also use these curly braces in function signatures:

fn foo[origin: MutableOrigin](my_argument{origin}: String):
    ...
fn longest(x: String, y: String) -> {x, y} String:
    ...

With this syntax, we can understand longest as a function that returns variables. The syntax -> {x, y} String can be read as "returns a string with origin x or y". In today's Mojo, you'd write this as:

fn longest(x: String, y: String) -> ref[__lifetime_of(x, y)] String:

Side note: This syntax would work especially well if we can come up with a design for origins where we describe exactly the set of variables that an alias can refer to, as opposed to today's semantics, wherein an origin is more like a memory region that the alias's target resides within. I have been working on such a model for several months now. Eventually, I will publish something about it. In such a model, we can reduce the signature of longest to:

fn longest(x: String, y: String) -> {x, y}:

And we can read -> {x, y} as simply "returns x or y".

About the word "reference"

Notice that I've just written a very long post about this proposed language feature, and yet I haven't actually used the word "reference" to describe it.

Those who have seen my earlier work know that I am opposed to using the word "reference" for this Mojo feature, because in Python, a "reference" is something totally different. It's a first-class value that can be stored in collections, and so on. Furthermore, assigning to a Python reference overwrites the reference, whereas assigning to a Mojo alias overwrites the target of the alias. By using the term "alias", we avoid all of that confusion. We can continue to use the term "reference" to describe instances of str etc., and save Python programmers (and Rust programmers, and JavaScript programmers, and Java programmers...) a lot of confusion.

This also avoids conflation with "pass by reference" in C++, which has no relationship to Mojo aliases. In Mojo, every function argument is an alias—a secondary name for a variable—and the value of the variable may be passed by value (in registers), or by address. The argument convention that the compiler (or the type author) chooses is orthogonal to the fact that the variable is being accessed through an alias, so this suggests that referring to arguments as "references" is misleading for C++ programmers as well.

Another detail worth mentioning: Functions with return type -> {_} T can be described as returning variables, as opposed to functions with return type -> T, that return values. Similarly, we can describe for-loops that bind aliases of List elements as iterating over variables, whereas for-loops that produce copies of Python list elements can be described as iterating over values.

In my opinion, all of this terminology works really well. It also connects nicely to the concept of aliasing, which is of paramount importance in Mojo, given that Mojo has a "borrow checker" that imposes restrictions on aliasing.

All that said, this is not a hill I'm going to die on. If there are strong reasons why "reference" makes more sense than "alias", we can go with the former. Perhaps we could use a mixture of both terms: an "alias" could be considered a named reference. We could then describe an alias as storing a reference.

Conclusion

In my opinion, "referral statements" are an elegant solution for creating and updating variable aliases. The referral symbol -> can be used in many of the same places where the assignment symbol = is used in Python, and it behaves in an analogous manner, except it binds aliases instead of assigning values. At the risk of being overly subjective, I'd say it "feels like" Python, in the sense that it allows us to bind names to runtime values without writing any keywords. Given that the -> symbol is also used in function types, we need to double-check that this new usage won't confuse humans or parsers. If necessary, we can consider other symbols.

Obviously, the semantics are important too, and I think I've presented a design that covers all of the use-cases that we need to cover. That said, the design hasn't been battle-tested yet, and there are likely some flaws that I haven't noticed. I would love to hear what people think.

The biggest TODO is definitely the syntax and semantics of origins associated with aliases. Origins are something that have been constantly evolving over the past year, and I expect they'll continue to evolve rapidly over the next few months. But exploring origins in detail is beyond the scope of this proposal, so we should probably avoid focusing on that aspect.

The main questions we should be asking are:

Does this design allow us to declare/bind/rebind aliases in all of the places that we need to? Is it sufficiently expressive?
Does this design integrate well with Python, or will it lead to surprising behaviours/footguns that an alternative design might prevent?