-
-
Save gruber/249502 to your computer and use it in GitHub Desktop.
The regex patterns in this gist are intended to match any URLs, | |
including "mailto:[email protected]", "x-whatever://foo", etc. For a | |
pattern that attempts only to match web URLs (http, https), see: | |
https://gist.github.com/gruber/8891611 | |
# Single-line version of pattern: | |
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])) | |
# Multi-line commented version of same pattern: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
[a-z][\w-]+: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999." | |
| # or | |
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash | |
) | |
(?: # One or more: | |
[^\s()<>]+ # Run of non-space, non-()<> | |
| # or | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
)+ | |
(?: # End with: | |
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char | |
) | |
) |
I've come up with the follow version. This version requires that you begin with a protocol like http:// https:// and even mailto:
No I'm not a regex genius, but I've been plugging away at test this variation, and it seems to work so far.
_(?i)\b((?:(?:https?|ftps?)://|ftp\.|ftps\.|mailto:|www\d{0,3}[.])(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))_iuS
Those experiencing hanging problems should try the pattern in a real regular expression engine, that is, one which does not backtrack.
Could anybody help me out with a version of this that also optionally allows the url to be enclosed like so: URL:thefullurl This is a format I come across rather often still in an old forum.
To add support for URIs with schemes like stratum+tcp
and xmlrpc.beep
or paths starting with +
or ?
(e.g. sms:
, magnet:
), I'm using a version with [a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%])
as the first section.
you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?
Belated @mattauckland - guessing the reason is for URLs like http://localhost/
to be matched.
The two balanced parens parts use capturing groups, while the rest of the regex uses non-capturing groups (except for the outermost match, obviously). May i suggest to change (
into (?:
in those four places?
Hi
To make the regex to match against
example.com
abv.bg
google.com
but also unfortunately also against
filename.txt
I added {0,1} at the end of the 'balanced parens, up to 2 levels'
and made backslash optional
(the second 2 groups was eating characters from matched string when match against domain.[a-z]{2,4}, and so {2,4} becomes incorrect in that case)
my final regex is:
\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\)){0,}(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\!()\[\]{};:\'\"\.\,<>?«»“”‘’]){0,})
I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
but it return some urls like :
https://t.co/h…
i need your help please !
How looks the actual code? I have string $str = "Blaa lorem ipsum domain-name.studio blaa blaa another.com blaa blaa"; and I want to get output:
Yes it contains one or more domains:
domain-name.studio
another.com
Thanks if you have time to help!
I tried:
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"; // SCHEME
$found_url = "";
if(preg_match("~^$regex$~i", $description, $m)) $found_url = $m;
if(preg_match("~^$regex$~i", $description, $m)) $found_url .= $m;
But got error: PHP Parse error: syntax error, unexpected ','
Hi
Sorry for asking but regex like this are a bit over my head :-)
I was trying to parse some wsdl files (basicaly xml) and I was wondering: Is there any way to avoid matching things like ab:1234
, xs:complexType
or this:isnotanurl
?
Using node 14.2, it hangs when I try to match the string
https://en.wikipedia.org/wiki/Learning_to_Fly_(Tom_Petty_and_the_Heartbreakers)
Looks like some kind of catastrophic backtracking in the balanced parens clauses, but i'm not sure how to fix it.
My version of this:
(?i)\b(?:[a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\x60!()\[\]{};:'".,<>?«»“”‘’])
Changes:
- Supports Go (Changed backtick to
\x60
) - Non-URLs like
bit.com/test
aren't recognized - Protocol section is required
- Applied change mentioned above
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]
):
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
«
is two bytes: "\xc2\xab", which means the pattern will accept \xc2
and \xab
anywhere in the sequence not in a specific order or not even close to each other!
php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt
you need to open foo.txt with a program which can print you bytes.
putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression (
[ ]
)
It depends on the language/library. Works fine in Python and node.js
After a little testing, both the original @gruber version and @cscott version, I've found that you can omit the domain extension, and it is still considered a valid URL. Surely that's a bit of a hole in the logic, or is there a reason for that?