TL;DR: Regression tests for the SuperCollider lexer, parser, and interpreter.
Brian Heim, 28 jan 2017
This document proposes the form and high-level implementation details of a series of rigorous regression tests for the SuperCollider lexer, parser, and interpreter. Although these are treated as three separate components for the purposes of testing, testing one component through the compiled language naturally involves some testing of the other two.
Lexer: essentially, the method yylex()
in PyrLexer.cpp
. The component that
handles character-to-character understanding of an input text, identifying tokens
for the parser.
Parser: the method yyparse()
, generated by the Bison file lang11d
. The
component that handles token-to-token understanding of an input text, identifying
parse nodes for the interpreter.
Interpreter (or compiler): the component that transforms the AST generated by
the parser into an interpreted sequence of byte codes. The code for this
component lies in PyrParseNode.cpp
.
There are three general approaches to testing:
- Full component stress-testing
- Targeted sub-component stress-testing
- Pseudorandom testing
Full component stress-testing involves testing the component by passing it input consisting of all possible strings of a given length. In addition to guaranteeing the performance of the component over a clearly bounded field, this stress-tests garbage handling.
Targeted sub-component stress-testing means testing a specialized behavior of the component, again with all possible strings of a given length, but limiting the alphabet and/or grammar so as to test a larger string length.
Pseudorandom testing means traversing a very large input space by passing a
random (usually uniformly so) subset of the possible input strings. Imagine the
string as being represented by a number represented in the base of the alphabet
size, with a maximum value of (alphabet size)^(string length).
For exmaple, if the alphabet size is 16, and the desired string length 6, a good
pseudorandom test might involving starting with the string represented by
0x000000
, then testing every "number" that is a multiple of a number coprime
with the alphabet size, such as 11 or 13.
A reasonable number of inputs to test for any stress-test is (in the author's opinion) is 10-20 million. This results in the following (soft) limits on alphabet size, assuming there are no limitations on grammar (i.e., all combinations are to be tested):
String Length | 10 million tests | 20 million tests |
---|---|---|
1 | 10000000 | 20000000 |
2 | 3162 | 4472 |
3 | 215 | 271 |
4 | 56 | 67 |
5 | 25 | 29 |
6 | 15 | 16 |
7 | 10 | 11 |
8 | 7 | 8 |
Alphabet: character literals in the range -128 to 127, excluding 0.
- Full component test of strings up to 3 characters in length
- Sub-component test of literal recognition (symbols, floats, radix floats, etc.)
of strings 4+ characters in length. (By my reckoning, the longest relevant string
worth testing is 7 characters:
-1.1e-5
)
Alphabet: any symbol recognized by Bison, plus some duplicates to test more
complex behavior. Specific alphabet TBD. This includes many of the 41 tokens
explicitly defined at the top of lang11d
, plus []{}();,
, maybe more.
- Full component tests of strings up to length 4
- Sub-component tests??
- Pseudorandom testing of strings up to length 10 (?)
The compiler has no specific alphabet. Rather, these tests will reuse input spaces
from the lexer and parser tests and surround them with { }.def.code
to get the
byte code array.
In an output file, store the result of the regression test for comparison with pre-validated output.
The storage format should include some sort of compression,
since many strings in the lexer and parser are likely to be garbage and fail,
producing nil
. One idea is to just count the number of consecutive strings that
produce the same result and encode that as a base-10 number at the end of the
current output line. An output-comparing utility method would have to be written
(and tested!) to properly decode this.
The current idea is to give the input string as a hex sequence (to make sure that
non-printing characters are correctly shown) followed by the class of the result
of interpretation or compilation. In the case of the lexer, this could be
both the class of the object returned by .interpret
and the result of .asString
called on that object.
Strings should be encoded as strings of hex character values to ensure they do not interfere with delimiters, and so that non-printing characters are made visible.
For the main test, .interpret
ing the string in three ways:
- first, the string alone;
- second, the string prefixed by a known and easily identified expression;
- third, the string suffixed by a known and easily identified expression.
Other tests do not need to be that rigorous.
Purpose: since we cannot directly test the output of the lexer without significant effort, we can use this to gain some insight into the lexer's behavior. Compare the results of these tests when run on a few test strings:
Test String | str |
"\\before;" ++ str |
str ++ ";\\after" |
---|---|---|---|
"//" |
nil |
before |
nil |
"10" |
10 |
10 |
after |
"abc" |
nil |
nil |
nil |
"\\\'" |
(empty symbol) | (empty symbol) | (empty symbol) |
Thoughts: could also suffix with "\after" to catch inoperative strings like " ", or surround with { } (but this gives less info. How many cases should we limit ourselves to?
Component tests: make little grammars to test radix notation, accidental notation, hex notation, exponential notation, and unholy mixtures of those. Testing should focus on that because the interactions among these notations is complicated.
Pseudorandom tests: maybe not needed for the lexer, but could lead to something interesting.
Error handling: some inputs will cause an error when interpreted, not lexed. This
should be handled by surrounding the .interpret
call with a try-catch. If the
catch block is entered, set the result to some unique identifier like \error
and use that as the recorded output.
Tokens from Bison:
%token NAME INTEGER SC_FLOAT ACCIDENTAL SYMBOL STRING ASCII PRIMITIVENAME CLASSNAME CURRYARG
%token VAR ARG CLASSVAR SC_CONST
%token NILOBJ TRUEOBJ FALSEOBJ
%token PSEUDOVAR
%token ELLIPSIS DOTDOT PIE BEGINCLOSEDFUNC
%token BADTOKEN INTERPRET
%token BEGINGENERATOR LEFTARROW WHILE
%left ':'
%right '='
%left BINOP KEYBINOP '-' '<' '>' '*' '+' '|' READWRITEVAR
%left '.'
%right '(backtick)'
%right UMINUS
Should have at least 2 variable names and 2 real class names to test multiple declarations. And should also include punctuation characters not shown here.
Can probably test strings up to length 4, might want to do the same tests
both raw and within {}
?
Some important functions here are .interpret
, .compile
, and .def.code
.
What to use when is still kind of up in the air.
a lot can be reused between these three sets of tests. the abstracted inputs are:
and also there should be methods that will test the string prefixed, postfixed, and enclosed with given input strings.
string should be represented in compressed format that appears uniform to the human eye. hex seems like a logical choice.
<output value(s)>
pseudocode for one lexer test:
pseudocode for testAllCombinationsWithInterpret:
and then write similar ones for
compile
anddef.code
problem with the above--what if "for each possible string" is too complex of a loop to house in this one function? (i think it can be done iteratively though, just check for rollover/overflow)
important to be able to switch out parameters easily: each test + stringLength combo should be its own unit/file
other convenience ideas: