Node Type | Usage |
---|---|
NT_STR | String Node |
NT_CCLASS | Character class i.e. [abc] |
NT_CTYPE | character type as in \w |
NT_CANY | anychar node such as . |
NT_BREF | Backreference node |
NT_QTFR | Quantifier node |
NT_ENCLOSE | Enclosing ndoe such as (abc) |
NT_ANCHOR | Location anchors such as \A |
NT_LIST | List of nodes which must accure in order |
NT_ALT | Alternation such as a|b |
NT_CALL | call referencing a previous sub expression |
Last active
April 26, 2023 19:26
-
-
Save HParker/1767b7a525ac7cc594a36c41062d7639 to your computer and use it in GitHub Desktop.
Onigmo Tokens
Limited list of some common operations.
OP Code (variants) | Arguments | Example |
---|---|---|
OP_FINISH | None | |
OP_END | None | |
OP_EXACT (1-5, MB, IC) | 1-5 characters per bytes | /abc/ |
OP_EXACTN | 1 byte specifying length, 1 byte per character | /[abc]/ |
OP_CCLASS (NOT, MB) | 32 bytes for standard character class | [ |
OP_ANYCHAR (ML, STAR, PEEK_NEXT) | None | |
OP_WORD (ASCII) | None | \w |
OP_NOT_WORD (ASCII) | None | \W |
OP_ASCII_WORD_BOUND | None | \b |
OP_NOT_ASCII_WORD_BOUND | None | \B |
OP_BEGIN_BUF | None | \A |
OP_END_BUF | None | \Z |
OP_BEGIN_LINE | None | ^ |
OP_END_LINE | None | $ |
OP_BACKREF (1-2) | None | /(?<=a)/ |
OP_BACKREFN (MULTI, IC) | 1 byte for backref number | /(?<=a)/ |
OP_FAIL | None | Stop trying to match. |
OP_JUMP | 4 byte relative offset | commonly found in alternation/optional patterns. |
OP_PUSH | 4 byte for relative offset | used to generate an alternative path for backtracking. |
OP_POP | None | |
OP_PUSH_OR_JUMP_EXACT1 | 4 bytes relative offset, 1 byte character to consume | Optimization optimizes some jump patterns. |
OP_PUSH_IF_PEEK_NEXT | 4 bytes relative offset, 1 byte character to peek | if the next character is what this code specifies |
OP_REPEAT (NG) | 2 bytes memory location, 4 bytes relative offset | start of repeat pattern |
OP_PUSH_POS | None | (?=) start |
OP_POP_POS | None | (?=) end |
OP_PUSH_POS_NOT | 4 bytes relative offset | (?!) start |
OP_FAIL_POS | None | (?!) end |
OP_PUSH_STOP_BT | 4 bytes relative offset | (?>) start |
OP_POP_STOP_BT | None | (?>) end |
OP_LOOK_BEHIND | 4 bytes relative offset | (?<=) start |
Token | example | Usage |
---|---|---|
TK_EOT | End of Token. One of the two tokens that can end a subexpression. | |
TK_RAW_BYTE | /\xA1 | Raw Byte |
TK_CHAR | Character literal. Used internally and often changes type before finishing parsing. | |
TK_STRING | /abc/ | One or many characters. |
TK_CODE_POINT | /\n/ /\t/ | Codepoint literal for characters including control characters. |
TK_ANYCHAR | /./ | Any character. |
TK_CHAR_TYPE | /\h/, /\w/ | Represents a type of character like whitespace or word characters. |
TK_BACKREF | /(?<=thing)/ | Reference to something that is not included in the match. |
TK_CALL | /(abc)\g'0'/ | Call will re-run the referenced subexpression. in this case this is equivalent to /(abc)(abc)/ |
TK_ANCHOR | /\A/, /\Z/, /^/, /$/ | Start, End or other match locations. |
TK_OP_REPEAT | /a+/, /a*/ | Represents characters that happen repeatedly. |
TK_INTERVAL | /a{3,4} | Represents character patterns that happen between two numbers of times. |
TK_ANYCHAR_ANYTIME | /.*/ | Special token for any character anytime. |
TK_ALT | /ab/ | Represents either one character or another. |
TK_SUBEXP_OPEN | /*(*ab)/ | start of subexpression. |
TK_SUBEXP_CLOSE | /(ab*)*/ | end of subexpression. |
TK_CC_OPEN | /[a]/ | character class containing different alternative character matches. |
TK_CC_CLOSE | /[a]/ | close of a character class. |
TK_QUOTE_OPEN | /\Q/ | Start of a quote sequence. Do not include in match. (not enabled in Ruby). |
TK_CHAR_PROPERTY | /\p{Alnum}/, /\p{Katakana} | Match based on a character property like is it alphanumeric or is it katakana. |
TK_LINEBREAK | \n | Literal newline character for multiline regular expressions. |
TK_EXTENDED_GRAPHEME_CLUSTER | /\X0067/ | numer literal form of UTF-8 characters. |
TK_KEEP | /abc\Kdef/ | Keep is another look behind everything before the \K is not included in the match. |
TK_CC_RANGE | /[a-z] | the - meaning that all characters between the two characters are included in the range. |
TK_POSIX_BRACKET_OPEN | /[:word:]/ | POSIX style character matching classes. |
TK_CC_AND | /[a-k&&h-z]/ | Takes the intersection of two character classes. |
TK_CC_CC_OPEN | /[[ab]c]/ | Start of a character class within a character class. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment