-
-
Save gruber/249502 to your computer and use it in GitHub Desktop.
The regex patterns in this gist are intended to match any URLs, | |
including "mailto:[email protected]", "x-whatever://foo", etc. For a | |
pattern that attempts only to match web URLs (http, https), see: | |
https://gist.github.com/gruber/8891611 | |
# Single-line version of pattern: | |
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])) | |
# Multi-line commented version of same pattern: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
[a-z][\w-]+: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | |
| # or | |
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>]+ # Run of non-space, non-()<> | |
| # or | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
)+ | |
(?: # End with: | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char | |
) | |
) |
Using node 14.2, it hangs when I try to match the string
https://en.wikipedia.org/wiki/Learning_to_Fly_(Tom_Petty_and_the_Heartbreakers)
Looks like some kind of catastrophic backtracking in the balanced parens clauses, but i'm not sure how to fix it.
My version of this:
(?i)\b(?:[a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\x60!()\[\]{};:'".,<>?«»“”‘’])
Changes:
- Supports Go (Changed backtick to
\x60
) - Non-URLs like
bit.com/test
aren't recognized - Protocol section is required
- Applied change mentioned above
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]
):
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
«
is two bytes: "\xc2\xab", which means the pattern will accept \xc2
and \xab
anywhere in the sequence not in a specific order or not even close to each other!
php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt
you need to open foo.txt with a program which can print you bytes.
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression (
[ ]
)
It depends on the language/library. Works fine in Python and node.js
Hi
Sorry for asking but regex like this are a bit over my head :-)
I was trying to parse some wsdl files (basicaly xml) and I was wondering: Is there any way to avoid matching things like
ab:1234
,xs:complexType
orthis:isnotanurl
?