Skip to content

Instantly share code, notes, and snippets.

@gruber
Last active November 4, 2024 20:04
Show Gist options
  • Save gruber/249502 to your computer and use it in GitHub Desktop.
Save gruber/249502 to your computer and use it in GitHub Desktop.
Liberal, Accurate Regex Pattern for Matching All URLs
The regex patterns in this gist are intended to match any URLs,
including "mailto:[email protected]", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
https://gist.github.com/gruber/8891611
# Single-line version of pattern:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
# Multi-line commented version of same pattern:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
)
)
@wbolster
Copy link

wbolster commented Nov 9, 2015

The two balanced parens parts use capturing groups, while the rest of the regex uses non-capturing groups (except for the outermost match, obviously). May i suggest to change ( into (?: in those four places?

@vaderdan
Copy link

vaderdan commented Dec 13, 2016

Hi

To make the regex to match against

example.com
abv.bg
google.com

but also unfortunately also against

filename.txt

I added {0,1} at the end of the 'balanced parens, up to 2 levels'
and made backslash optional
(the second 2 groups was eating characters from matched string when match against domain.[a-z]{2,4}, and so {2,4} becomes incorrect in that case)

my final regex is:
\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\)){0,}(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\!()\[\]{};:\'\"\.\,<>?«»“”‘’]){0,})

@takwaIMR
Copy link

takwaIMR commented Jul 4, 2017

I have the same problem , i used
GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

but it return some urls like :
https://t.co/h…

i need your help please !

@e-stonia
Copy link

e-stonia commented Sep 7, 2018

How looks the actual code? I have string $str = "Blaa lorem ipsum domain-name.studio blaa blaa another.com blaa blaa"; and I want to get output:

Yes it contains one or more domains:
domain-name.studio
another.com

Thanks if you have time to help!

@e-stonia
Copy link

e-stonia commented Sep 7, 2018

I tried:

$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"; // SCHEME

$found_url = "";
if(preg_match("~^$regex$~i", $description, $m)) $found_url = $m;
if(preg_match("~^$regex$~i", $description, $m)) $found_url .= $m;

But got error: PHP Parse error: syntax error, unexpected ','

@DanieleQ97
Copy link

Hi

Sorry for asking but regex like this are a bit over my head :-)

I was trying to parse some wsdl files (basicaly xml) and I was wondering: Is there any way to avoid matching things like ab:1234, xs:complexType or this:isnotanurl?

@jonpincus
Copy link

Using node 14.2, it hangs when I try to match the string

https://en.wikipedia.org/wiki/Learning_to_Fly_(Tom_Petty_and_the_Heartbreakers)

Looks like some kind of catastrophic backtracking in the balanced parens clauses, but i'm not sure how to fix it.

@makew0rld
Copy link

My version of this:

(?i)\b(?:[a-z][\w.+-]+:(?:/{1,3}|[?+]?[a-z0-9%]))(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\x60!()\[\]{};:'".,<>?«»“”‘’])

Changes:

  • Supports Go (Changed backtick to \x60)
  • Non-URLs like bit.com/test aren't recognized
  • Protocol section is required
  • Applied change mentioned above

@glensc
Copy link

glensc commented Dec 27, 2021

putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]):

    [^\s`!()\[\]{};:'".,<>?«»“”‘’]		# not a space or one of these punct char

« is two bytes: "\xc2\xab", which means the pattern will accept \xc2 and \xab anywhere in the sequence not in a specific order or not even close to each other!

php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt

you need to open foo.txt with a program which can print you bytes.

@solaluset
Copy link

putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ])

It depends on the language/library. Works fine in Python and node.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment