Slightly Simplified Earley Parsing

Cody Miller

This is a quick note on Scott and Johnstone's "Recognition is not parsing - SPPF-style parsing from cubic recognisers" paper. It aims to explain one part of the Earley parser implementation of this algorithm and how it can be simplified by trading one problem for another. This is stitched together from my unpublished notes. Think of this less as a tutorial and more as a brain dump.

An interactive tool to parse and display the shared packed parse forests (SPPFs) described in the paper can be accessed here. The implementation there does not currently use the optimized implementation described here (and sketched in the original paper) nor the simplified form given below. It's probably useful to play around with that tool to get a feel for the structure of the SPPF nodes, as they aren't described much here.

I'm not going to do a big picture overview of the algorithm here (for now).

Implementing the `make_node` function

The paper details the complete Earley parsing algorithm. The node creation portion is encapsulated in the make_node function, rewritten below in pseudo-code.

def make_node(item, terminus, left_node, right_node):
  # Optimize away intermediate nodes that will have a single child
  if item.dot == 1 and not item.is_complete():
    return right_node
  
  if item.is_complete():
    label = item.subject
  else:
    label = item.slot
  
  if left_node is None:
    family = [right_node]
  else:
    family = [left_node, right_node]
  
  node = get_or_create_node(label, item.origin, terminus)
  
  if len(node.children) == 0:
    node.children.append(family)
  elif not node.has_family(family):
    if node.children[0] is not a PackNode:
      node.children = [PackNode(node.children)]
    family.children.append(PackNode(family))

Our presentation differs from the paper's. The following table maps our notation back to the original paper:

Paper's Notation	Our Notation	Description
$S ::= \alpha \cdot \beta, j, i, u$	`item`	An Earley item in set $i$ with node $u$
$S ::= \alpha \cdot \beta$	`item.slot`	Represents progress of parsing some production
$S ::= \alpha \beta \cdot$	A complete item	A fully recognized item
$\cdot$	`item.dot`	The 0-based index of the $\cdot$ in the slot
$S$ in $S ::= \alpha \beta$	subject	The nonterminal to the left of the `::=`
$j$	`item.origin`	The index in the input string where an item began
$i$	`terminus`	The index in the input string where an item ended
$u$	`item.node`	The node attached to an item (may be null)
$w$	`left_node`	The node representing an item's parse up to this point
$v$	`right_node`	A node corresponding to a term that was just parsed

The function is responsible for creating new parse nodes that correspond to an item that has just been created. It is called after we have scanned a terminal or completed a nonterminal. Its job is to create a new node that represents the parsed portions of left_node and right_node together, deduplicating nodes that may be created multiple times due to ambiguities.

Important to note is that all SPPF nodes have a label, an origin, and a terminus. The label may be a slot, a terminal, or a nonterminal. The origin and terminus are indices into the input string, just as with items.

The `has_family` check

This part of the algorithm is tricky to implement. The naive approach is to iterate through all of the node's children and compare them to the current family. This can be slow for particularly ambiguous grammars -- so slow, in fact, that the algorithm cannot run in $O(n^3)$ worst case like the Earley recognizer!

def has_family_naive(node, family):
  if node.children[0] is not a PackNode:
    return node.children == family

  for pack in node.children:
    if pack.children == family:
      return True

  return False

The has_family function can actually be executed in $O(1)$ time as the paper explains. I find the reasoning for how to achieve this to be subtle.

We implemented unit time lookup for SPPF nodes and families by exploiting the fact that a family of $(s, j, i)$ is determined by an item $t = (X ::= \alpha x \cdot \beta)$ and the left index, $k$, of its right child, $(x, k, i)$. (So $s$ will be either $t$ or $X$.) We create a two dimensional array, indexed by $j$ and $s$, whose entries are triples $(L, N, K)$. When an SPPF node $w = (s, j, i)$ node is created, the corresponding cell’s $L$-field is set to $i$, and the $N$-field is set to point to $w$. When a second family is added to $w$, a vector of possible $k$ values is created, and pointed to by the cell’s $K$-field. As each subsequent family is created, the vector’s $k$-th element is set to $i$.

I think the logic of why this works is made simpler by thinking about which parts of the SPPF structure are "fixed" and which parts may vary.

Let's assume a call to make_node is made with the following values for its arguments:

make_node(
  item = Item(label₁, i₁, j₁),
  terminus = j₁,
  left_node = Node(label₂, i₂, j₂),
  right_node = Node(label₃, i₃, j₃))

This call has 9 variables (three labels, three origins, and three termini). Our goal is to reduce these to the minimum number of independent variables.

Let's start with labels.

Assume an item Item(A ::= α β • δ, i, j) as an example. Its associated node will always have the label A ::= α β • δ because the label of a node is determined purely by the slot: the implementation of make_node above decides the label purely based on the label₁.

The left_node argument will be the node attached to the item that was advanced through Scanning or Completion to produce our current item. The predecessor item of our assumed item must have had slot A ::= α • β δ. The node for such an item will be optimized away by the first branch in make_node, yielding the node for α.

The right_node argument is the node that represents the terminal used to execute the Scan step or the nonterminal used to execute the Complete step that produced our current item. The right_node of our assumed item must have a label of β because that is the only term our assumed item could be advanced over.

This line of reasoning applies to all possible items and does a lot of lifting for us. Using only label₁, we are able to know what label₂ and label₃ must be. This makes them dependent variables (or "fixed" values) that can be ignored when seeking out families. In other words, we know that all families of children under a node will have identical labels, so we never need to check them to verify.

Let's focus on node indices now.

In the case where left_node is not null, make_node effectively pastes these two nodes together to produce a new node that spans the contiguous input of left_node and right_node. In the case where left_node is null, we produce a node that spans the input of right_node.

Knowing that the pasted span must be contiguous, we know that left_node and right_node share an index. The terminus of the left_node is identical to the origin of the right_node. Equationally, we can write j₂ = i₃.

The nodes attached to an item span the same input span as the item itself. This means that the left_node (which represents all of the already parsed input matched by this item) must begin where the item itself begins, so i₂ = i₁.

By symmetric reasoning, we know that our right_node has to end at the same place that the item does, so j₁ = j₃.

Using substitution in our call to make_node (using _ as a "don't care, can ignore" wildcard) produces:

make_node(
  item = Item(label₁, i₁, j₁),
  terminus = j₁,
  left_node = Node(_, i₁, i₃),
  right_node = Node(_, i₃, j₁))

Overall, this means we have four pieces of independent information to consider:

label₁ corresponding to the item and node label
i₁ corresponding to the item and node origin
j₁ corresponding to the item and node terminus
i₃ corresponding to where the two nodes meet (or, equivalently, where the predecessor item ended and the current item resumed), also called the "midpoint."

We can reduce this a bit more.

The Earley parser (and usually the recognizer) completely process one item set before advancing to the next through scanning, making the j₁ terminus consistent throughout the entirety of the item set. If we clear the cached nodes set (the set that implicitly backs the get_or_create_node function) before advancing to the next item set, we can consider j₁ to be fixed and ignorable.

This leaves us with three pieces of data that can uniquely identify a node and its families: label₁, i₁, i₃; or item.label, item.origin, right_node.origin

The item.label is either a slot or a nonterminal, both of which are constant according to the grammar. These can be given unique numeric indexes, allowing them to be used as integer keys into a fixed-size array.
The item.origin spans the entirety of the input and can be represented as an array.
The right_node.origin similarly spans the entirety of the input.

This suggests the three-dimensional array the original paper recommended. In my own experience implementing this structure, I found that their suggestion to lazily create the third dimension corresponding to the right_node.origin was crucial to getting acceptable performance.

def has_family_optimized(node, origin, midpoint):
   first = nodes[node.label]
   if first is None:
     return False
   second = first[origin]
   if second is None:
     return False
   # Assuming a dynamic language, we set the third dimension to an array lazily.
   # We assign it to the midpoint on the first creation and then create an array of
   # values on the second family being added.
   if isinstance(second, Array):
     return second[midpoint] != None
   else:
     return second == midpoint

A simpler mechanism

Even though we're able to implement has_family in constant time, the three-dimensional array structure enabling this is unsatisfyingly complicated.

If we rethink the problem and alter the Earley algorithm, we can avoid the has_family check by doing a different check entirely. I think this different check is easy to understand and easy to implement.

The revision to the algorithm is in the Complete step:

Complete: Given Item(B ::= ... •, i, j), add Item(A ::= ... B • ..., h, j) to the chart for every Item(A ::= ... • B ..., h, i) in the chart if (B, i) has not been completed in set j.

The Complete step is identical to the Earley algorithm except that it first checks to see if the nonterminal being completed for the same item origin has already been done; if it has, it doesn't bother doing the Complete step again.

To understand why this check obviates the has_family check, it's important to understand what situations lead to the has_family check in the first place. The simplest grammar I know of to demonstrate this is the following:

Productions
$S ::= A$
$A ::= B \| D$
$B ::= b$
$D ::= b$

The following item sets show an in-progress parse of the above grammar. We've scanned a b token and performed completions to produce the second item set. The explicit nodes in the items aren't shown where they aren't of particular interest.

Set₀	Set₁
$S ::= \cdot A, 0$	$B ::= b \cdot, 0$
$A ::= \cdot B, 0$	$D ::= b \cdot, 0$
$A ::= \cdot D, 0$	$A ::= B \cdot, 0, Node(A, 0, 1)$
$B ::= \cdot b, 0$	$A ::= D \cdot, 0, Node(A, 0, 1)$
$D ::= \cdot b, 0$	$S ::= A \cdot, 0$

After we've completed $B$ and $D$, we produce two complete items with $A$ as their subject. These two items share the same node instance Node(A, 0, 1). We have two completions of $(A, 0)$ to do. They're both going to attempt to complete $S$ and add their nodes as children to Node(S, 0, 1). Without the has_family check, we will add the same Node(A, 0, 1) family to Node(S, 0, 1). This produces the following incorrect SPPF. Node(A, 0, 1) should only be a child of Node(S, 0, 1) a single time.

flowchart TD
    S[S, 0, 1]
    A[A, 0, 1]
    B[B, 0, 1]
    D[D, 0, 1]
    b[b, 0, 1]
    Amb1((.))
    Amb2((.))
    Amb3((.))
    Amb4((.))
    Amb1 --> A
    Amb2 --> A
    S --> Amb1
    S --> Amb2
    Amb3 --> B
    Amb4 --> D
    A --> Amb3
    A --> Amb4
    B --> b
    D --> b

This situation occurs when (1) we have a nonterminal B as the right-most child of an item; and (2) B has multiple productions matching the same input span. By refusing to perform multiple completions of the same (nonterminal, origin, terminus) triple, we prevent additional derivations beyond the first to be added to another node. A nice benefit is that for certain ambiguous grammars, we do less work in repeated and wasteful completions.

Implementing this duplicate completion check is simpler than the has_family check. The check can be done using a two-dimensional array with one dimension over indices in the input and the other over the (constant-sized) nonterminals in the grammar.

In my experiments with the C++ grammar and inputs, this had large improvements over the three-dimensional array approach despite being simpler to implement. When parsing 442,590 tokens of Eigen's preprocessed source code on an Apple MacBook Pro M1, the parser using has_family ran in 10.4s. The parser using the already_completed check ran in 3.2s.

Nostracodus/simplified-earley-parsing.md

Slightly Simplified Earley Parsing

Implementing the make_node function

The has_family check

A simpler mechanism

Implementing the `make_node` function

The `has_family` check