Skip to content

Instantly share code, notes, and snippets.

@alycda
Created April 23, 2024 01:08
Show Gist options
  • Select an option

  • Save alycda/e926915bebaa41e4b2e0d44c037f1d4a to your computer and use it in GitHub Desktop.

Select an option

Save alycda/e926915bebaa41e4b2e0d44c037f1d4a to your computer and use it in GitHub Desktop.
1. Regex // History
set of symbols representing a text pattern
formal language interpreted by regex processor
matches text if it correctly describes the text
--engines
C/C++
Java
Javascript/Actionscript (ECMAScript)
.NET
Perl
PHP (PCRE)
Python
Ruby
Unix (POSIX BRE, POSIX, ERE)
Apache (v1: POSIX ERE, v2: PCRE)
MySQL (POSIX ERE)
--notation conventions and modes
/re/ standard
/re/g global
/re/i case-insensitive
/re/m multiline
/re/s dot matches all
2. Characters
--literal
space
regular expressions are eager to return results (left-most preferred unless global is turned on)
--metacharacters (like mathematical operators)
. any character except newline (wildcard)
challenge of regex is matching what you want and ONLY what you want
\ escape next character
/ sometimes denotes the start or end of a regex, (not in javascript) sometimes need to escape with backslash
\t tab character
\r return
\n newline
\r\n sometimes need both
\a bell
\e escape
\f form feed
\v vertical tab
ASCII / ANSI codes
\xA( 0xA9
3. Character Sets
[ begins character set
] ends character set
will match any ONE literal character within brackets
- range character, all characters between starting and ending character
outside a set, its a literal dash
only works with single digits, 50-99 is really 0-9
^ excludes all characters in set from being matched (could include whitespace, punctuation, etc)
outside of a set, denotes the beginning of a line
\ escape character, only needed for ] - ^ \
\d [0-9] shorthand for digit
\w [a-zA-Z0-9_] shorthand for word char (upper and lowercase characters, numbers and underscores)
\s [ \t\r\n] shorthand for (white)space, tab or line return
\D [^0-9] not a digit
\W [^a-zA-Z0-9_] not a word character (upper and lowercase characters, numbers and underscores)
\S [^ \t\r\n] not a (white)space, tab or line return
[^\d\d] != [\D\S]
[^\d\d] NOT digit OR space character
[\D\S] EITHER NOT digit OR NOT space character
--POSIX bracket expressions
[:alpha:] [A-Z-a-z] alphabetic characters
[:digit:] [0-9] numeric characters
[:alnum:] [A-Za-z0-9] alphanumeric
[:lower:] [a-z] lower-case
[:upper:] [A-Z] upper-case
[:punct:] punctuation
[:space:] \s space, tab, new line
[:blank:] space, tab
[:print:] printable characters, including space
[:graph:] printable characters, excluding space
[:cntrl:] control characters (non-printing)
[:xdigit:] [A-Fa-f0-9] hexadecimal characters (0-9, A-F, a-f)
must all be placed within character class (set)
not supported by Java, JavaScript/Actionscript (ECMAScript), .NET, Python
[[:alpha]] or [^[:alpha:]]
4. Repetition Expressions
--Repetition Metacharacters
* ≥0 preceding item 0 or more times (all regex engines)
+ ≥1 preceding item 1 or more times (no support in BRE)
? ≥0<2 preceding item 0 or 1 times (no support in BRE)
--Quantified Repetition
{ start of quantified repetition of preceding item
} end of quantified repetition of preceding item
,
{min,max} min is required, must be positive, can be 0
comma is optional (min is max without)
max is optional, no value if infinite
\d{4,8} 4≥x≤8 matches number with 4 to 8 digits
\d{4} x=4 matches numbers with exactly 4 digits
\d{4,} x≥4 matches number with 4 or more digits
\d{0,} same as \d*
\d{1,} same as \d+
--Greedy Expressions
standard repetition quantifiers match as much as possible before giving control to the next expression part. gives back as little as possible
/.*[0-9]+/ wildcard actually matches all of "Page 266", but can't return a match if it doesn't allow subsequent parts of regex to do their job (in this case there has to be at least one digit character found by the [0-9] character set, followed by the +), so it tries to be less greedy, and goes back just one character. /.*/ matches "Page 26" and /[0-9]+/ matches just the 6
by default:
regular expressions are eager
regular expressions are greedy
--Lazy Expressions
match as little as possible before giving control to the next expression part
? 0 or 1 times (lazy)
*? lazy star (prefer 0 instead of 1[greedy])
+? lazy plus (prefer 1 instead of more than 1[greedy])
{min, max}?
?? can be useless
unix tools are always greedy (BRE, ERE) so the lazy strategy is not supported
/.*[0-9]+/ wildcard first tries to match nothing, but since the digit can't match the P, the wild card tries to be a little less greedy and turns over the expression. The digit can't take over until the 2, so /.*/ matches "Page " and /[0-9]+/ matches "266"
--Efficiency
with indeterminate lengths of expressions, engine can do a lot of backtracking because it doesn't have a global view and can't make assumptions, must go through each character one by one.
efficient matching + less backtracking = speedy results
define quantity of repeated expressions
/.+/ is faster than /.*/
/.{5}/ and /.{3,7}/ are even faster
narrow the scope of the repeated expression
/.+/ can become /[A-Za-z]+/
provide cleared starting and ending points (use anchors and word boundaries)
/<.+>/ can become /<[^.]+>/
5. Grouping Metacharacters
( start grouped expression
) end grouped expression
apply repetition operators to a group
makes expressions easier to read
captures group for use in matching and replacing
cannot be used inside a character set
--Alternation
| pipe, matches previous or next expression, OR operator (some languages[php] use ||)
ordered, leftmost expression gets precedence (without scanning entire string)
multiple choices can be daisy-chained
group alternations to keep them distinct
--Logical and Efficient Alternations
regex engines are eager
regex engines are greedy
put simplest and most efficient expression first
--Repeating and Nesting Alternations
first matched alternation does not effect the next matches
check nesting carefully, tradeoff between precision, readability and efficiency
6. Anchored Expressions
^ start of string/line (supported in all regex engines)
$ end of string/line (supported in all regex engines)
\A start of string, never end of line (are supported in Java, .NET, Perl, PHP, Python, Ruby)
\Z end of string, never end of line (are supported in Java, .NET, Perl, PHP, Python, Ruby)
anchors refer to position, not actual character, zero-width
--Single-line mode (Unix tools)
^ and $ do not match at line breaks
\A and \Z do not match at line breaks
--Mulitline mode (most languages)
^ and $ will match at line breaks
\A and \Z do not match at line breaks
Javascript /^regex$/m
Perl m/^regex$/m
PHP preg_match(/^regex$/m, "string")
--Word Boundaries
\b word boundary (start/end of word)
\B not a word boundary
refer to position, not actual character
most regex engines support, except Unix BRE's
less backtracking
a space is not a word boundary (in regex)
7. Capturing Groups and Backreferences
grouped expressions are captured, stores data matched not the expression automatically by default
backreferences allow access to captured data
\1 through \9 back reference for positions 1 through 9 (most regex engines support this)
\10 through \99 some engines will support this
$1 through $9 some engines use this instead (.htaccess/mod_rewrite)
can be used in same expression as group
can be accesses after the match is complete
can't be used inside character classes
/(apples) to \1/ matches "apples to apples"
/<(i|em)>.+?</\1>/ matches "<i>Hello</i>" or "<em>Hello</em>" NOT "<i>Hello</em>"
++++Backreferences to optional expressions
captures occur on zero-width matches, so backreferences become zero-width too
captures do not always occur on optional groups, so backreferences to a group that failed to match (except in (Java|Action)script)
element is optional, group/capture is not optional
element is not optional, group/capture is optional
--Find and Replace Using Backreferences
1. write regex that matches target data (test and revise as needed)
2. add capture groups (capture anything that varies row to row)
3. write replacement expression (use all captures, adding anything not captured but needed)
--Non-Capturing Group Expressions
?: specify a non-capturing group (these must be first 2 characters at beginning of group
can optimize regex for speed, also preserve space for more captures
supported by most regex engines (except Unix tools), came along with Perl
/(?:regex)/ ? = give this group a different meaning
: = this meaning is non-capturing
8. Lookaround Assertions
assertion of what ought to lie ahead/behind, if look(ahead|behind) fails, match will fail
any valid regex can be used. zero-width, does not include group in match
supported by most regex engines (except Unix) introduced by Perl
--Positive Lookahead
?= positive lookahead
/(?=regex)/ ? = give this group a different meaning
= = lookahead assertion(do not put a space after the equals sign)
/(?=seashore)sea/ matches "sea" in "seashore" but not "seaside" --OR--
/sea(?=shore)/ same match
--Double-Testing
testing assertion before matching "sea", or tries to match "sea" before testing assertion
can allow you to match a pattern that also matches another pattern, run 2 or more different regex test on same string
--Negative Lookahead
?! negative lookahead
/(?!regex)/ ? = give this group a different meaning
! = negative lookahead assertion(do not put a space after the exclamation point)
/(?!seashore)sea/ matches "sea" in "seaside" but not "seashore" --OR--
/sea(?!shore)/ same match
when to not match an entire pattern, expression that should be rejected
/online(?! training)/ does not match "online training"
/online(?!.*training)/ does not match "online video training"
(\bblack\b)(?!.*\1) find last occurrence of word (when its not followed by itself)
--Positive Lookbehind
?<= positive lookbehind
/(?<=regex)/ ? = give this group a different meaning
<= = negative lookbehind assertion(do not put a space after the equals sign)
prefer to put the assertion at beginning of regex pattern, to avoid unnecessary backtracking. a few exceptions
--Negative Lookbehind
?<! negative lookbehind
/(?<!regex)/ ? = give this group a different meaning
<! = negative lookbehind assertion(do not put a space after the exclamation point)
support for look behind:
simple expressions in .NET, Java, Perl, PHP, Python, Ruby 1.9
not supported in Javascript, Ruby 1.8, Unix
simple expressions means fixed length
literal text
character classes
no repetition or optional expressions
alternation only with fixed-length items
allowed: (?<=cat|dog|rat) [3 characters each]
not allowed: (?<=apple|banana|plum) [different number of characters in each word]
--Power of positions
going to a position, not selecting any width, but putting something at that position
/(?<=\d)(?=(\d\d\d)+(?!\d))/ "149597870.7" will return this: 149|597|870.9 allowing your cursor to be placed at correct insertion points for commas
9. Unicode and Multibyte Characters
++
single byte
uses one byte (eight bits) to represent a character
allows for 256 characaters
A-Z, a-z, 0-9, punctuation, common symbols
double bytes
uses two bytes (16 bits) to represent a character
allows for 65,536 characters
many more characters than English alphabet
àáâäãåā latin
≤≥≠¢£ symbols
Arabic, Chinese, Greek, Hebrew, Korean, Thai,…
over 109,000 characters
--Unicode
variable byte size
maintains compatibility with one and two-byte encoding
allows for over 1 million characters
mapping between character and a number
"U+" followed by a four-digit hexadecimal number
infinity symbol is written as U+221E
é can be U+00E9 (single byte) or U+0065 U+0301 (double byte)
--Unicode in Regex
complications for regular expressions:
words can be spelled multiple ways
"cafe", "café"
words can be encoded multiple ways
"café" can be encoded as 4 or 5 characters
wildcard matching
backtracking
unicode is relatively new
\u unicode indicator (followed by 4-digit hexadecidmal number (0000-FFFF)
/caf\u00E9/ matches "café" but not "cafe"
supported by Java, Javascript, .NET, Python, Ruby
\x Perl and PHP unicode indicator
not supported in older Unix tools
--Unicode wildcard
\X matches any single character, always matches line breaks
ONLY SUPPORTED IN Perl AND PHP
\p{property} unicode property
\p{Letter} or \p{L}
\p{Mark} or \p{M}
\p{Separator} or \p{Z}
\p{Symbol} or \p{S}
\p{Number} or \p{N}
\p{Punctuation} or \p{P}
\p{Other} or \p{C}
\P not unicode property
supported in Java, .NET, PHP and RUBY
10. Useful Expressions
--Match Names
^([A-Z][A-Za-z.'\- ]+) ([A-Z][A-Za-z.'-]+)$ First name (optional middle name) with Last name (2 refs)
^([A-Z][A-Za-z.'-]+) (?:([A-Z][A-Za-z.'-]+) )?([A-Z][A-Za-z.'-]+)$ First name optional middle name(s) last name (3 references)
--Postal codes
^\d{5}(-\d{4})?$ us zip codes
^[A-Z]\d[A-Z] \d[A-Z]\d$ canada
^([A-Z]{1,2}\d{1,2}|[A-Z]{1,2}\d[A-Z]) \d[A-Z]{2}$ uk http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom
--Email Addresses
^[\w.%+\-]+@[\w.\-]+\.[A-Za-z]{2,6}$ valid email
^[\w.%+\-]+@[\w.\-]+\.([A-Za-z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx)$ tld verification
--URLs
^(?:http|https):\/\/[\w\-_]+(?:\.[\w\-_]+)+[\w\-.,@?^=%&:/~\\+#]+$
--Decimal Numbers and Currency
^(\d*\.\d+|\d+)$
^\$(\d*\.\d{2}|\d+)$ U.S. Dollar
^(\$|£)(\d*\.\d{2}|\d+)$ British Pound
^(\$|\u00A3|\u00A5|\uFFE5)(\d*\.\d{2}|\d+)$ Japanese Yen
--IP Addresses
^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
--Dates
^\d{4}[-/](0?[1-9]|1[012])[-/](0?[1-9]|[12][0-9]|3[01])$ 2012-06-13
--Times
^(0?[1-9]|1[0-2]):[0-5][0-9]([aApP][mM])?$ 12hr time
^([0-1]?[0-9]|[2][0-3]):[0-5][0-9]?$ 24 hour time
^([0-1]?[0-9]|[2][0-3]):[0-5][0-9](:[0-5][0-9])?$ seconds
^([0-1]?[0-9]|2[0-3]):[0-5][0-9](:[0-5][0-9])?( ([A-Z]{3}|GMT [-+]([0-9]|1[0-2])))?$ timezones
--HTML Tags
^<(?:([A-Za-z][A-Za-z0-9]*)\b[^>]*>(?:.*?)</\1>|[A-Za-z][A-Za-z0-9]*\b[^/>]*/>)$
--Passwords
^(?=.*\d)(?=.*[~!@#$%^&*()_\-+=|\\{}[\]:;<>?/])(?=.*[A-Z])(?=.*[a-z])\S{8,15}$ Must be 8-15 (valid) characters in length with at least 1 digit and 1 or more Capital letters, and 1 or more symbols
--Credit Cards
^(?:3[47]\d{2}([\- ]?)\d{6}\1\d{5}|(?:4\d{3}|5[1-5]\d{2}|6011)([\- ]?)\d{4}\2\d{4}\2\d{4})$ amex/visa/mc/discover
--Words near words
# Word1 near word2
Regex: /[Aa].*[Mm]an/
Regex: /[Aa](.+)[Mm]an/
Regex: /[Aa](?:.+)[Mm]an/
Regex: /[Aa](?:.+?)[Mm]an/
# Use word boundaries
Regex: /\b[Aa]\b(?:.+?)\b[Mm]an\b/
# Don't cross punctuation
Regex: /\b[Aa]\b(?:[^.,;]+?)\b[Mm]an\b/
# Up to 20 characters between
Regex: /\b[Aa]\b(?:[^.,;]{1,20}?)\b[Mm]an\b/
# Up to 5 words between
Regex: /\b[Aa]\b (?:\w+[\- :;.]){0,5}\b[Mm]an\b/
# Lookahead assertions can control the match
Regex: /\b[Aa]\b(?= (?:\w+[\- :;.]){0,5}\b[Mm]an\b)/
# Lookbehind assertions are not possible due to repetition and variable length
--(re)Formating
http://regexpal.com/
http://www.gskinner.com/RegExr/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment