-
-
Save winzig/8894715 to your computer and use it in GitHub Desktop.
| # Single-line version: | |
| (?i)\b(https?:\/{1,3})?((?:(?:[\w.\-]+\.(?:[a-z]{2,13})|(?<=http:\/\/|https:\/\/)[\w.\-]+)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)(?:\w+(?:[.\-]+\w+)*\.(?:[a-z]{2,13})|(?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4})\b\/?(?!@)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))*(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])?)) | |
| # Commented multi-line version: | |
| (?xi) | |
| \b | |
| (https?:\/{1,3})? # Capture $1: (optional) URL scheme, colon, and slashes | |
| ( # Capture $2: Entire matched URL (other than optional protocol://) | |
| (?: | |
| (?: | |
| [\w.\-]+\. # looks like domain name | |
| (?:[a-z]{2,13}) # ending in common popular gTLDs | |
| | # | |
| (?<=http:\/\/|https:\/\/)[\w.\-]+ # hostname preceded by http:// or https:// | |
| ) | |
| \/ # followed by a slash | |
| ) | |
| (?: # One or more: | |
| [^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| | # or | |
| \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | | |
| \([^\s]+?\) # balanced parens, non-recursive: (…) | |
| )+ | |
| (?: # End with: | |
| \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | | |
| \([^\s]+?\) # balanced parens, non-recursive: (…) | |
| | # or | |
| [^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
| ) | |
| | # OR, the following to match naked domains: | |
| (?: | |
| (?<!@) # not preceded by a @, avoid matching foo@_gmail.com_(?<![@.]) | |
| (?: | |
| \w+ | |
| (?:[.\-]+\w+)* | |
| \. # avoid matching the last two parts of an email domain like co.uk in [email protected] | |
| (?:[a-z]{2,13}) # ending in common popular gTLDs | |
| | # or | |
| (?:(?:[0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4} # IPv4 address, as seen in https://stackoverflow.com/a/13166657/650558 | |
| ) | |
| \b | |
| \/? | |
| (?!@) # not succeeded by a @, avoid matching "foo.na" in "[email protected]" | |
| (?: # One or more: | |
| [^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| | # or | |
| \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | | |
| \([^\s]+?\) # balanced parens, non-recursive: (…) | |
| )* | |
| (?: # End with: | |
| \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | | |
| \([^\s]+?\) # balanced parens, non-recursive: (…) | |
| | # or | |
| [^\s`!()\[\]{};:'\".,<>?«»“”‘’] # not a space or one of these punct chars | |
| )? | |
| ) | |
| ) |
Public domain is fine by me. I published this to be freely used.
❤️
Thanks!
Thank you both for the prompt reply!
It seems like the regex would still have catastrophic backtracking issue when string has multiple trailing punctuation:
e.g. https://www.google.co.jp/search?q=hello&client=safari?????????????
Check https://regex101.com
Possibly, but I'm able to run it in an environment (.NET) where I'm able to specify a timeout for my regex, to handle edge cases like this that have never come up for me.
That being said, if you solve the backtracking issue, definitely let me know. ![]()
To put it in context: I just tested your URL on regex101. When I end your URL with 12 question marks, it executes in TWELVE MILLISECONDS. When I add the 13th question mark, regex101 complains about catastrophic backtracking...
"But is it illegal though."
License for this code?
Thanks!