Cody Miller
This is a quick note on Scott and Johnstone's "Recognition is not parsing - SPPF-style parsing from cubic recognisers" paper. It aims to explain one part of the Earley parser implementation of this algorithm and how it can be simplified by trading one problem for another. This is stitched together from my unpublished notes. Think of this less as a tutorial and more as a brain dump.
An interactive tool to parse and display the shared packed parse forests (SPPFs) described in the paper can be accessed here. The implementation there does not currently use the optimized implementation described here (and sketched in the original paper) nor the simplified form given below. It's probably useful to play around with that tool to get a feel for the structure of the SPPF nodes, as they aren't described much here.
I'm not going to do a big picture overview of the algorithm here (for now).
The paper details the complete Earley parsing algorithm. The node creation portion is encapsulated in the make_node
function, rewritten below in pseudo-code.
def make_node(item, terminus, left_node, right_node):
# Optimize away intermediate nodes that will have a single child
if item.dot == 1 and not item.is_complete():
return right_node
if item.is_complete():
label = item.subject
else:
label = item.slot
if left_node is None:
family = [right_node]
else:
family = [left_node, right_node]
node = get_or_create_node(label, item.origin, terminus)
if len(node.children) == 0:
node.children.append(family)
elif not node.has_family(family):
if node.children[0] is not a PackNode:
node.children = [PackNode(node.children)]
family.children.append(PackNode(family))
Our presentation differs from the paper's. The following table maps our notation back to the original paper:
Paper's Notation | Our Notation | Description |
---|---|---|
item |
An Earley item in set |
|
item.slot |
Represents progress of parsing some production | |
A complete item | A fully recognized item | |
item.dot |
The 0-based index of the |
|
|
subject | The nonterminal to the left of the ::=
|
item.origin |
The index in the input string where an item began | |
terminus |
The index in the input string where an item ended | |
item.node |
The node attached to an item (may be null) | |
left_node |
The node representing an item's parse up to this point | |
right_node |
A node corresponding to a term that was just parsed |
The function is responsible for creating new parse nodes that correspond to an item that has just been created. It is called after we have scanned a terminal or completed a nonterminal. Its job is to create a new node that represents the parsed portions of left_node
and right_node
together, deduplicating nodes that may be created multiple times due to ambiguities.
Important to note is that all SPPF nodes have a label, an origin, and a terminus. The label may be a slot, a terminal, or a nonterminal. The origin and terminus are indices into the input string, just as with items.
This part of the algorithm is tricky to implement. The naive approach is to iterate through all of the node's children and compare them to the current family. This can be slow for particularly ambiguous grammars -- so slow, in fact, that the algorithm cannot run in
def has_family_naive(node, family):
if node.children[0] is not a PackNode:
return node.children == family
for pack in node.children:
if pack.children == family:
return True
return False
The has_family
function can actually be executed in
We implemented unit time lookup for SPPF nodes and families by exploiting the fact that a family of
$(s, j, i)$ is determined by an item$t = (X ::= \alpha x \cdot \beta)$ and the left index,$k$ , of its right child,$(x, k, i)$ . (So$s$ will be either$t$ or$X$ .) We create a two dimensional array, indexed by$j$ and$s$ , whose entries are triples$(L, N, K)$ . When an SPPF node$w = (s, j, i)$ node is created, the corresponding cell’s$L$ -field is set to$i$ , and the$N$ -field is set to point to$w$ . When a second family is added to$w$ , a vector of possible$k$ values is created, and pointed to by the cell’s$K$ -field. As each subsequent family is created, the vector’s$k$ -th element is set to$i$ .
I think the logic of why this works is made simpler by thinking about which parts of the SPPF structure are "fixed" and which parts may vary.
Let's assume a call to make_node
is made with the following values for its arguments:
make_node(
item = Item(label₁, i₁, j₁),
terminus = j₁,
left_node = Node(label₂, i₂, j₂),
right_node = Node(label₃, i₃, j₃))
This call has 9 variables (three labels, three origins, and three termini). Our goal is to reduce these to the minimum number of independent variables.
Let's start with labels.
Assume an item Item(A ::= α β • δ, i, j)
as an example. Its associated node will always have the label A ::= α β • δ
because the label of a node is determined purely by the slot: the implementation of make_node
above decides the label purely based on the label₁
.
The left_node
argument will be the node attached to the item that was advanced through Scanning or Completion to produce our current item. The predecessor item of our assumed item must have had slot A ::= α • β δ
. The node for such an item will be optimized away by the first branch in make_node
, yielding the node for α
.
The right_node
argument is the node that represents the terminal used to execute the Scan step or the nonterminal used to execute the Complete step that produced our current item. The right_node
of our assumed item must have a label of β
because that is the only term our assumed item could be advanced over.
This line of reasoning applies to all possible items and does a lot of lifting for us. Using only label₁
, we are able to know what label₂
and label₃
must be. This makes them dependent variables (or "fixed" values) that can be ignored when seeking out families. In other words, we know that all families of children under a node will have identical labels, so we never need to check them to verify.
Let's focus on node indices now.
In the case where left_node
is not null, make_node
effectively pastes these two nodes together to produce a new node that spans the contiguous input of left_node
and right_node
. In the case where left_node
is null, we produce a node that spans the input of right_node
.
Knowing that the pasted span must be contiguous, we know that left_node
and right_node
share an index. The terminus of the left_node
is identical to the origin of the right_node
. Equationally, we can write j₂ = i₃
.
The nodes attached to an item span the same input span as the item itself. This means that the left_node
(which represents all of the already parsed input matched by this item) must begin where the item itself begins, so i₂ = i₁
.
By symmetric reasoning, we know that our right_node
has to end at the same place that the item does, so j₁ = j₃
.
Using substitution in our call to make_node
(using _
as a "don't care, can ignore" wildcard) produces:
make_node(
item = Item(label₁, i₁, j₁),
terminus = j₁,
left_node = Node(_, i₁, i₃),
right_node = Node(_, i₃, j₁))
Overall, this means we have four pieces of independent information to consider:
label₁
corresponding to the item and node labeli₁
corresponding to the item and node originj₁
corresponding to the item and node terminusi₃
corresponding to where the two nodes meet (or, equivalently, where the predecessor item ended and the current item resumed), also called the "midpoint."
We can reduce this a bit more.
The Earley parser (and usually the recognizer) completely process one item set before advancing to the next through scanning, making the j₁
terminus consistent throughout the entirety of the item set. If we clear the cached nodes set (the set that implicitly backs the get_or_create_node
function) before advancing to the next item set, we can consider j₁
to be fixed and ignorable.
This leaves us with three pieces of data that can uniquely identify a node and its families: label₁
, i₁
, i₃
; or item.label
, item.origin
, right_node.origin
- The
item.label
is either a slot or a nonterminal, both of which are constant according to the grammar. These can be given unique numeric indexes, allowing them to be used as integer keys into a fixed-size array. - The
item.origin
spans the entirety of the input and can be represented as an array. - The
right_node.origin
similarly spans the entirety of the input.
This suggests the three-dimensional array the original paper recommended. In my own experience implementing this structure, I found that their suggestion to lazily create the third dimension corresponding to the right_node.origin
was crucial to getting acceptable performance.
def has_family_optimized(node, origin, midpoint):
first = nodes[node.label]
if first is None:
return False
second = first[origin]
if second is None:
return False
# Assuming a dynamic language, we set the third dimension to an array lazily.
# We assign it to the midpoint on the first creation and then create an array of
# values on the second family being added.
if isinstance(second, Array):
return second[midpoint] != None
else:
return second == midpoint
Even though we're able to implement has_family
in constant time, the three-dimensional array structure enabling this is unsatisfyingly complicated.
If we rethink the problem and alter the Earley algorithm, we can avoid the has_family
check by doing a different check entirely. I think this different check is easy to understand and easy to implement.
The revision to the algorithm is in the Complete step:
- Complete: Given
Item(B ::= ... •, i, j)
, addItem(A ::= ... B • ..., h, j)
to the chart for everyItem(A ::= ... • B ..., h, i)
in the chart if(B, i)
has not been completed in setj
.
The Complete step is identical to the Earley algorithm except that it first checks to see if the nonterminal being completed for the same item origin has already been done; if it has, it doesn't bother doing the Complete step again.
To understand why this check obviates the has_family
check, it's important to understand what situations lead to the has_family
check in the first place. The simplest grammar I know of to demonstrate this is the following:
Productions |
---|
The following item sets show an in-progress parse of the above grammar. We've scanned a b
token and performed completions to produce the second item set. The explicit nodes in the items aren't shown where they aren't of particular interest.
Set₀ | Set₁ |
---|---|
After we've completed Node(A, 0, 1)
. We have two completions of Node(S, 0, 1)
. Without the has_family
check, we will add the same Node(A, 0, 1)
family to Node(S, 0, 1)
. This produces the following incorrect SPPF. Node(A, 0, 1)
should only be a child of Node(S, 0, 1)
a single time.
flowchart TD
S[S, 0, 1]
A[A, 0, 1]
B[B, 0, 1]
D[D, 0, 1]
b[b, 0, 1]
Amb1((.))
Amb2((.))
Amb3((.))
Amb4((.))
Amb1 --> A
Amb2 --> A
S --> Amb1
S --> Amb2
Amb3 --> B
Amb4 --> D
A --> Amb3
A --> Amb4
B --> b
D --> b
This situation occurs when (1) we have a nonterminal B
as the right-most child of an item; and (2) B
has multiple productions matching the same input span. By refusing to perform multiple completions of the same (nonterminal, origin, terminus)
triple, we prevent additional derivations beyond the first to be added to another node. A nice benefit is that for certain ambiguous grammars, we do less work in repeated and wasteful completions.
Implementing this duplicate completion check is simpler than the has_family
check. The check can be done using a two-dimensional array with one dimension over indices in the input and the other over the (constant-sized) nonterminals in the grammar.
In my experiments with the C++ grammar and inputs, this had large improvements over the three-dimensional array approach despite being simpler to implement. When parsing 442,590 tokens of Eigen's preprocessed source code on an Apple MacBook Pro M1, the parser using has_family
ran in 10.4s. The parser using the already_completed
check ran in 3.2s.