Version 0.10.
Authors:
- SoniEx2
ClEx is a regex-like pattern matching that attempts to K.I.S.S. Basically, everything is a match class. It is also highly flexible and easily extended.
ClEx was designed for use with binary data. Attempting to match non-binary data with ClEx may be met with discrimination and bigotry.
A Simple ClEx Pattern is just a string with no "special characters". Example: example
.
Everything from a simple character to a group capture is a "match class". All matches are comparable, and return a number < 0 for less than, == 0 for equal, > 0 for greater than.
Quantifiers are the most basic feature of ClEx. They work differently from the ones in regex so we can KISS.
The +
quantifier creates a match class that matches 1 or more times, always matching as much as possible. Returns the result of the first match attempt.
The -
quantifier creates a match class that matches 1 or more times, always matching as little as possible. Returns the result of the first match attempt.
The ?
quantifier creates a match class that matches 1 or 0 times, in that order. Thus, the construction a+?
matches "a" 0 or more times in a greedy way, and the construction a-?
matches "a" 0 or more times in a non-greedy way. Returns 0
.
Groups are created by enclosing anything between ()
. For example, (abc)
is a group that matches the string "abc". Comparing groups is simple, (abc)
compares to the string "abc" as 0, "abb" as a positive value, and "abd" as a negative value. The most significant is compared first, thus (abc)
compares to "aad" as a positive value. They also capture their contents. As a special case an empty group ()
captures the current position. Groups can be made non-capturing by adding a *
right after the (
.
Groups can be "backmatched" by using the <
modifier. This will make the last non-0 comparison more significant than the first. This, combined with sets, is useful when matching little-endian integers in big-endian strings.
ClEx supports ranges and alternations, collectively called "sets". Sets are made by putting a [
, then things to match, then ]
. Empty sets are allowed and match empty string. A range is made by putting a :
between things to match. A set can be negated with a *
.
A set that starts with *
matches anything "not in set". Thus, the set [*(abc)(123)(welp)]
wouldn't match "abcd", "1234", "welp", but would match "help", "1334", "aelp".
Range matching is done by comparing the start and end matches. For example, the range [(aaa):(acc)]
would match "aaa", "acc", "aac", but not "acd", "bbb". (aaa)
compares "aaa" < "aac", and (acc)
compares "acc" > "aac". It would also match aa\xFF
, this is intended.
As a special case, [x:y:z]
, where x
, y
and z
are match classes, is semantically equivalent to [x:z]
, although the latter is strongly preferred. (This might change in a later version.)
Ranges follow short-circuit evaluation: If the lower limit of a range doesn't apply (i.e. input < lower), we don't evaluate the upper limit.
Attempting to use a negated set in a less-than or greater-than comparison (as is the case with ranges) is undefined.
A set that doesn't match shall return the last attempt's result.
Matching at start and end of string are done with ^
and $
, respectively. ^
returns the current position (thus 0 when position = 0 = start of string), $
returns length-position (thus 0 when position = length = end of string).
To match any char you use a .
. It matches any character.
To escape characters in any of the above constructs just use %
, e.g. %(
, %-
, %+
, %?
would match literals (
, -
, +
, ?
, respectively. To escape a literal %
just prefix it with another %
, as in %%
. The %
character was chosen because it doesn't conflict with most languages' string escaping character (\
, used in C, Java, Lua, JavaScript, Python, PHP, etc).
Anything outside the ASCII range [A:Za:z0:9]
(that is, anything that's not uppercase letter, lowercase letter or number) can be escaped like this.
Special matchers are written as %
followed by one of [A:Za:z]
, followed by any metadata the matcher requires.
Currently specified special matchers are:
%nxy
, wherex
is a number between 1 and 8, optionally preceded by<
or>
, andy
is a match object, matches an unsigned number of sizex
, and attempts to matchy
exactly that many times. The number is read in native endianness by default.<
means little-endian,>
means big endian. For example,%n<4.
matches a string prefixed by a 4-byte/32-bit little-endian unsigned length.%Nxy
, wherex
is a number between 1 and 8, optionally preceded by<
or>
, andy
is a match object, matches a signed number of sizex
, and attempts to matchy
exactly that many times. The number is read in native endianness by default.<
means little-endian,>
means big endian. For example,%n>4.
matches a string prefixed by a 4-byte/32-bit big-endian signed length.%bxy
, wherex
andy
are different characters, matches strings that start withx
, end withy
, and where thex
andy
are balanced. This means that, if one reads the string from left to right, counting +1 for anx
and -1 for ay
, the endingy
is the firsty
where the count reaches 0. For example,%b()
matches expressions with balanced parentheses.
All unknown special matchers are an error condition.
When reaching end of string, it is recommended to signal an end of string. If an end of string is signaled while matching a range, the range must be discarded, and the set matching should continue.
Any simple character returns the difference between the expected character and the character found. This can be either (expected - found)
or (found - expected)
. Ranges must be coded accordingly.
Proprietary extensions are allowed.
ClEx operates on raw byte streams.
Wow! This spec sounds really nice!
Just a small suggestion/question: Could you provide some examples? With example functions and what they are expected to return
I would be interested in writing a Lua module (probably in C) so maybe how would this look compared to the builtin "RegEx" would be nice. You could even take from some String Recipes here and show how they would be done with Clex
Just tips! I really like the overall look of it (although the examples would help clarify some stuff haha)