Skip to content

Instantly share code, notes, and snippets.

@jeffreykegler
Last active December 15, 2015 02:09
Show Gist options
  • Save jeffreykegler/5185355 to your computer and use it in GitHub Desktop.
Save jeffreykegler/5185355 to your computer and use it in GitHub Desktop.
A plan for allowing external scanners with the SLIF interface

Name

External SLIF Scanning - External scanning for the Scanless interface

About this document

In this document, use of the Scanless interface's (SLIF's) internal scanner is called internal scanning. The current SLIF only allows internal scanning. This document describes an interface that allows the use of an external scanner. Use of the external scanner is called external scanning.

The term "scanless" interface is from the parsing literature. It is sometimes criticized as misleading because, of course, an internal scanner is a scanner.

But in the parsing theory literature, the context of the term is the classic strict definition of parsing, which required a separate, external scanner as a previous phase. In that context, "scannerless parser" meant a parser which did not require a separate, external scanner.

Design goals

The interface is designed to be implemented more quickly than more fully featured alternatives: It is low-level but open-ended.

This interface allows the user to bring in, or to create, other scanning tools. The interface allows the SLIF's internal scanner to by completely bypassed in favor of an external scanner. It also allows internal and external scanners to take turns reading the input string.

The other tools can be arbitrarily powerful. Technically, in this interface, they will always be "children" of a "parent" Marpa parser, but the boundary between scanning and parsing is malleable enough that, if an application thinks it necessary, Marpa and the other tools can in effect share the task of parsing.

Methods for internal scanning

Changes to Marpa::R2::Scanless::R::read().

The Marpa::R2::Scanless::R::read() method remains, but has two additional arguments, so that it can be called as $slr->read($input_string, $start, $length). The two new arguments, $start and $length, are optional.

The first argument, $input_string, is as before. As a reminder, this first argument sets the input string for the SLIF. The input string can only be set once for an SLIF recognizer and cannot be changed once set.

In $slr->read($input_string, $start, $length), $start is the position in the input string to begin at, and $length is the number of codepoints to read. Negative lengths follow the usual convention so that -1 is the length of the input string. $start defaults to the "current" position, and $length defaults to -1, so that the default In $slr->read($input_string) continues to behave as currently implemented.

At present read() is documented as returning a non-negative integer. In this interface, its return value will be the new current position, as defined in the next section.

Current position

The current position in the input stream, more usually simply called the current position, is a position in the input string. Positions are zero-based, so that position 0 points to the first character, if it exists. A current position equal to the length of the input string indicates "end of string", and does not point to any character.

In this interface, when the current position moves, it does not necessarily advance -- it can skip forward, or can be positioned to an earlier location. The application can skip sections of the input string. The application is also free to revisit spans of the input string as often as it wants.

Here are the guarantees:

  • Initially, the current position is 0.

  • The current position will never be negative.

  • The current position will never be past the "end of string".

Current position can be queried with $slr->pos() and set with $slr->pos_set($new_pos).

How internal scanning works

After the read() method, the current position will indicate how far in the input stream read() actually read.

  • As already stated, the current position will never advance past the end of string.

  • If a non-negative $length argument was specified to read(), read() will not read past a current position of $start+$length.

  • If a negative $length argument was specified to read(), read() will not read past a current position of $length + 1 + length $input_string.

  • The application can specify that the read will "break" when it encounters certain lexemes, as specified below.

  • An abend in the read() will cause the current position to be an unspecified value.

New $slr->resume($start, $length) method

The $slr->resume($start, $length) method resumes the SLIF's internal scanning. The scan continues as specified by $start and $length, according to the rules for internal scanning, as described above.

Methods for external scanning

New $slr->g1_alternative($symbol, $value) method

The $slr->g1_alternative($symbol, $value) method bypasses G0, reading an alternative directly into G1 with symbol name $symbol, lexeme value $value.

New $slr->g1_lexeme_complete($start, $length) method

A new $slr->g1_lexeme_complete($start, $length) method will correspond to an earleme_complete() for G1. Current position in the input stream will be moved to $start+$length.

New $slr->g1_read_lexeme($symbol_name, $value, $start, $length) method

A new $slr->g1_read_lexeme($symbol_name, $value, $start, $length) method will be the equivalent of

$slr->g1_alternative($symbol, $value)
$slr->g1_lexeme_complete($start, $length)

Current position in the input stream will be moved to $start+$length.

The input string

For error message purposes, even external lexemes are required to correspond to a span of the input string. If an application uses external lexemes which have no natural relationship to a span of the input string, it must arrange an artificial relationship.

One way to do this is to have a artificial preamble to the real text. For example, the first 7 characters of the input string could be the artificial preamble containing the characters "NO TEXT", which is then followed by what the application considers the "real" text. In this case, the first read() could take the form $slr->read($input_string, 7). Lexemes corresponding to the artificial preamble would be read with method calls similar to $slr->g1_read_lexeme($symbol_name, $value, 0, 42).

Breaking away from the internal scan

:lexeme psueudo-rules

:lexeme ::= variable break => reject

The SLIF DSL will be extended to include lexeme pseudo-rules. Initially, their purpose would be to support the break adverb, as described in the next section.

As a side effect, the :lexeme will allow applications to declare symbols to be lexemes. The previous definition of lexeme will continue to apply, so that the only effect of these "lexeme declarations" will be to cause a fatal error if the declared lexeme is not a lexeme according to the previous rules. Applications may find this behavior useful for debugging, and for documentating their grammars.

The break adverb

The new :lexeme pseudo-rules of the SLIF DSL will allow a new break adverb. If break is set, a Marpa::R2::Scanless::R::read() will stop when that lexeme is encountered. Possible values of break are accept, reject and skip. If the value of break is accept, the lexeme is read by G1. Otherwise, the lexeme will be discarded.

Multiple lexemes may be found at a single G1 parse location, and these may have any mixture of break adverbs. The following rules govern the location of the current position in the input stream when internal scanning is broken off because of one or more lexemes with a break setting:

  • If the break value of any of the lexemes causing the break is either accept or skip, the current position in the input stream will be the position where the lexeme ended.

  • If the break value of all of the lexemes causing the break is reject, the current position in the input stream will be the position where the lexeme started.

Capabilities

The intent of the design is that possibilities for external scanning be extremely open-ended. The following ideas are intended to be suggestive rather than exhaustive.

Replacing the Stuifzand (BNF) interface

In a pedantic sense, the SLIF must always start off with the internal scanner, but it can end the internal scan after zero codepoints, One way to do this is with the following code:

$slr->read($input_string, 0, 0)

Once an application has switched to external scanning, it need never switch back. This means that a separate Stuifzand (BNF) interface no longer serves any purpose. The BNF interface will be de-emphasized, and the documentation rearranged to reflect this.

Lookahead and backtracking

Most of the desire to use lookahead and backtracking with Marpa comes from an underappreciation of its capabilities. Nonetheless, there are some situations where lookahead and/or backtracking could be genuinely useful in combination with Marpa. Both can be performed in an external scanner.

Sub-parsers

An external scanner could, in fact, be a sub-parser. The sub-parser could be another, more specialized, Marpa grammar. Or it could be another grammar that is thought to work better for portions of the parse, hooked up as a sub-parser. For example, the application could switch back and forth between Marpa and regular expressions, or between Marpa and recursive descent.

G0 actions

Currently the G0 grammar cannot perform actions -- it can only recognize literal lexemes. As a special case of sub-parser use, another Marpa parser can be hooked up that performs actions and returns the result to the parent grammar via g1_read.

Copyright and License

Copyright 2013 Jeffrey Kegler
This file is part of Marpa::R2.  Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.

Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2.  If not, see
http://www.gnu.org/licenses/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment