Skip to content

Instantly share code, notes, and snippets.

@abdullahbutt
Created March 23, 2014 02:26
Show Gist options
  • Select an option

  • Save abdullahbutt/9717742 to your computer and use it in GitHub Desktop.

Select an option

Save abdullahbutt/9717742 to your computer and use it in GitHub Desktop.
regular expressions backslash sequences
Backslash sequences
regular expressions backslash sequences
A backslash sequence is a sequence of characters, the first one of which is a backslash. Perl ascribes special meaning to many such sequences, and some of these are character classes. That is, they match a single character each, provided that the character belongs to the specific set of characters defined by the sequence.
Here's a list of the backslash sequences that are character classes. They are discussed in more detail below. (For the backslash sequences that aren't character classes, see perlrebackslash.)
\d Match a decimal digit character.
\D Match a non-decimal-digit character.
\w Match a "word" character.
\W Match a non-"word" character.
\s Match a whitespace character.
\S Match a non-whitespace character.
\h Match a horizontal whitespace character.
\H Match a character that isn't horizontal whitespace.
\v Match a vertical whitespace character.
\V Match a character that isn't vertical whitespace.
\N Match a character that isn't a newline.
\pP, \p{Prop} Match a character that has the given Unicode property.
\PP, \P{Prop} Match a character that doesn't have the Unicode property
\N
\N , available starting in v5.12, like the dot, matches any character that is not a newline. The difference is that \N is not influenced by the single line regular expression modifier (see The dot above). Note that the form \N{...} may mean something completely different. When the {...} is a quantifier, it means to match a non-newline character that many times. For example, \N{3} means to match 3 non-newlines; \N{5,} means to match 5 or more non-newlines. But if {...} is not a legal quantifier, it is presumed to be a named character. See charnames for those. For example, none of \N{COLON} , \N{4F}, and \N{F4} contain legal quantifiers, so Perl will try to find characters whose names are respectively COLON , 4F, and F4 .
Digits
\d matches a single character considered to be a decimal digit. If the /a regular expression modifier is in effect, it matches [0-9]. Otherwise, it matches anything that is matched by \p{Digit} , which includes [0-9]. (An unlikely possible exception is that under locale matching rules, the current locale might not have [0-9] matched by \d , and/or might match other characters whose code point is less than 256. Such a locale definition would be in violation of the C language standard, but Perl doesn't currently assume anything in regard to this.)
What this means is that unless the /a modifier is in effect \d not only matches the digits '0' - '9', but also Arabic, Devanagari, and digits from other languages. This may cause some confusion, and some security issues.
Some digits that \d matches look like some of the [0-9] ones, but have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks very much like an ASCII DIGIT EIGHT (U+0038). An application that is expecting only the ASCII digits might be misled, or if the match is \d+ , the matched string might contain a mixture of digits from different writing systems that look like they signify a number different than they actually do. num() in Unicode::UCD can be used to safely calculate the value, returning undef if the input string contains such a mixture.
What \p{Digit} means (and hence \d except under the /a modifier) is \p{General_Category=Decimal_Number} , or synonymously, \p{General_Category=Digit} . Starting with Unicode version 4.1, this is the same set of characters matched by \p{Numeric_Type=Decimal} . But Unicode also has a different property with a similar name, \p{Numeric_Type=Digit} , which matches a completely different set of characters. These characters are things such as CIRCLED DIGIT ONE or subscripts, or are from writing systems that lack all ten digits.
The design intent is for \d to exactly match the set of characters that can safely be used with "normal" big-endian positional decimal syntax, where, for example 123 means one 'hundred', plus two 'tens', plus three 'ones'. This positional notation does not necessarily apply to characters that match the other type of "digit", \p{Numeric_Type=Digit} , and so \d doesn't match them.
The Tamil digits (U+0BE6 - U+0BEF) can also legally be used in old-style Tamil numbers in which they would appear no more than one in a row, separated by characters that mean "times 10", "times 100", etc. (See http://www.unicode.org/notes/tn21.)
Any character not matched by \d is matched by \D .
Word characters
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit); or a connecting punctuation character, such as an underscore ("_"); or a "mark" character (like some sort of accent) that attaches to one of those. It does not match a whole word. To match a whole word, use \w+ . This isn't the same thing as matching an English word, but in the ASCII range it is the same as a string of Perl-identifier characters.
If the /a modifier is in effect ...
\w matches the 63 characters [a-zA-Z0-9_].
otherwise ...
For code points above 255 ...
\w matches the same as \p{Word} matches in this range. That is, it matches Thai letters, Greek letters, etc. This includes connector punctuation (like the underscore) which connect two words together, or diacritics, such as a COMBINING TILDE and the modifier letters, which are generally used to add auxiliary markings to letters.
For code points below 256 ...
if locale rules are in effect ...
\w matches the platform's native underscore character plus whatever the locale considers to be alphanumeric.
if Unicode rules are in effect ...
\w matches exactly what \p{Word} matches.
otherwise ...
\w matches [a-zA-Z0-9_].
Which rules apply are determined as described in Which character set modifier is in effect? in perlre.
There are a number of security issues with the full Unicode list of word characters. See http://unicode.org/reports/tr36.
Also, for a somewhat finer-grained set of characters that are in programming language identifiers beyond the ASCII range, you may wish to instead use the more customized Unicode Properties, \p{ID_Start} , \p{ID_Continue} , \p{XID_Start} , and \p{XID_Continue} . See http://unicode.org/reports/tr31.
Any character not matched by \w is matched by \W .
Whitespace
\s matches any single character considered whitespace.
If the /a modifier is in effect ...
In all Perl versions, \s matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, the newline, the form feed, the carriage return, and the space. Starting in Perl v5.18, experimentally, it also matches the vertical tab, \cK . See note [1] below for a discussion of this.
otherwise ...
For code points above 255 ...
\s matches exactly the code points above 255 shown with an "s" column in the table below.
For code points below 256 ...
if locale rules are in effect ...
\s matches whatever the locale considers to be whitespace.
if Unicode rules are in effect ...
\s matches exactly the characters shown with an "s" column in the table below.
otherwise ...
\s matches [\t\n\f\r\cK ] and, starting, experimentally in Perl v5.18, the vertical tab, \cK . (See note [1] below for a discussion of this.) Note that this list doesn't include the non-breaking space.
Which rules apply are determined as described in Which character set modifier is in effect? in perlre.
Any character not matched by \s is matched by \S .
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
\v matches any character considered vertical whitespace; this includes the platform's carriage return and line feed characters (newline) plus several other characters, all listed in the table below. \V matches any character not considered vertical whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
\R matches anything that can be considered a newline under Unicode rules. It's not a character class, as it can match a multi-character sequence. Therefore, it cannot be used inside a bracketed character class; use \v instead (vertical whitespace). It uses the platform's native character set, and does not consider any locale that may otherwise be in use. Details are discussed in perlrebackslash.
Note that unlike \s (and \d and \w ), \h and \v always match the same characters, without regard to other factors, such as the active locale or whether the source string is in UTF-8 format.
One might think that \s is equivalent to [\h\v] . This is indeed true starting in Perl v5.18, but prior to that, the sole difference was that the vertical tab ("\cK" ) was not matched by \s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment