Last active
November 4, 2024 20:04
-
-
Save gruber/249502 to your computer and use it in GitHub Desktop.
Liberal, Accurate Regex Pattern for Matching All URLs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The regex patterns in this gist are intended to match any URLs, | |
including "mailto:[email protected]", "x-whatever://foo", etc. For a | |
pattern that attempts only to match web URLs (http, https), see: | |
https://gist.github.com/gruber/8891611 | |
# Single-line version of pattern: | |
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])) | |
# Multi-line commented version of same pattern: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
[a-z][\w-]+: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | |
| # or | |
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>]+ # Run of non-space, non-()<> | |
| # or | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
)+ | |
(?: # End with: | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char | |
) | |
) |
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]
):
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
«
is two bytes: "\xc2\xab", which means the pattern will accept \xc2
and \xab
anywhere in the sequence not in a specific order or not even close to each other!
php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt
you need to open foo.txt with a program which can print you bytes.
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression (
[ ]
)
It depends on the language/library. Works fine in Python and node.js
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
My version of this:
Changes:
\x60
)bit.com/test
aren't recognized