-
-
Save gruber/8891611 to your computer and use it in GitHub Desktop.
The regex patterns in this gist are intended only to match web URLs -- http, | |
https, and naked domains like "example.com". For a pattern that attempts to | |
match all URLs, regardless of protocol, see: https://gist.github.com/gruber/249502 | |
# Single-line version: | |
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@))) | |
# Commented multi-line version: | |
(?xi) | |
\b | |
( # Capture 1: entire matched URL | |
(?: | |
https?: # URL protocol and colon | |
(?: | |
/{1,3} # 1-3 slashes | |
| # or | |
[a-z0-9%] # Single letter or digit or '%' | |
# (Trying not to match e.g. "URI::Escape") | |
) | |
| # or | |
# looks like domain name followed by a slash: | |
[a-z0-9.\-]+[.] | |
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw) | |
/ | |
) | |
(?: # One or more: | |
[^\s()<>{}\[\]]+ # Run of non-space, non-()<>{}[] | |
| # or | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
)+ | |
(?: # End with: | |
\([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (…(…)…) | |
| | |
\([^\s]+?\) # balanced parens, non-recursive: (…) | |
| # or | |
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars | |
) | |
| # OR, the following to match naked domains: | |
(?: | |
(?<!@) # not preceded by a @, avoid matching foo@_gmail.com_ | |
[a-z0-9]+ | |
(?:[.\-][a-z0-9]+)* | |
[.] | |
(?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj| Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw) | |
\b | |
/? | |
(?!@) # not succeeded by a @, avoid matching "foo.na" in "[email protected]" | |
) | |
) |
I just created a regex pattern that aims to help with this... If you feel so inclined to do so, give this a try:
https://github.com/Traumatizn/RegEx
This regex helped me a lot, but it crashed my react project when used to check the following string:
https://avatars.githubusercontent.com/u/65315866?
It gives the following error:
Unhandled Rejection (InternalError): too much recursion test C:/source/front-end/node_modules/yup/es/string.js:113 validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests</< C:/source/front-end/node_modules/yup/es/object.js:160 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/object.js:177 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:26 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/object.js:139 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests[idx] C:/source/front-end/node_modules/yup/es/array.js:102 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/array.js:105 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/array.js:72 validate C:/source/front-end/node_modules/yup/es/schema.js:245 _validate/</tests</< C:/source/front-end/node_modules/yup/es/object.js:160 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate/< C:/source/front-end/node_modules/yup/es/object.js:177 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:26 _validate/< C:/source/front-end/node_modules/yup/es/schema.js:229 once/< C:/source/front-end/node_modules/yup/es/util/runTests.js:8 finishTestRun C:/source/front-end/node_modules/yup/es/util/runTests.js:58 validate/< C:/source/front-end/node_modules/yup/es/util/createValidation.js:60 promise callback*validate C:/source/front-end/node_modules/yup/es/util/createValidation.js:59 runTests C:/source/front-end/node_modules/yup/es/util/runTests.js:30 _validate C:/source/front-end/node_modules/yup/es/schema.js:220 _validate C:/source/front-end/node_modules/yup/es/object.js:139 validate/< C:/source/front-end/node_modules/yup/es/schema.js:245 validate C:/source/front-end/node_modules/yup/es/schema.js:245
I have absolutely no idea whether the problem lies on my own code or this regex though, but I was only able to fix this by replacing this regex with a simpler and less complete one that I made.
P.S. In this project, i'm using React, Formik and yup
Anybody have an example with python named groupings? e.g. (?P<tld>...)
, so on?
Does not work with python
I feel the TLD's should be just generalized, due to the amount of new ones that pop up. This is not maintainable, and hard to read, so IMO I think it would be better to just match against Alpha of 2 or more. That way some poor sap with some bizarre TLD doesn't have any issues just because the regex doesn't match the domain.
It doesn't work for urls with russian letters. I guess any letter may be in the pure url without punicode, can anyone provide another regexp?
@danila-schelkov Can you provide an example URL that isn't matching?
Ofc, url: https://www.bagandwallet.ru/collection/sumki-bellroy/product/sumka-bellroy-venture-hip-pack-15l-kupit?variant_id=617976795&utm_source=pnn&utm_medium=email&utm_campaign=Сумка%20Bellroy%20Venture%20Hip%20Pack%201.5L
@danila-schelkov When I try that URL in regex101 along with Gruber's one-liner, it seems to match it correctly?
Is it possible that your code is not treating the URL string as unicode (e.g. utf-8), and therefore might not be handling the Cyrillic correctly? (I'm not that familiar with Cyrillic alphabet.)
thanks this works for me in both js and pcre, here is a demo.
Many thanks, gruber, for making this available!
@lukapaunovic your PHP version works for me. Thank you.
Because of this gist's popularity, I wanted to call a few things out:
Important detail: This appears to be a Perl-compatible regex, but not every language has the same regex engine. Perl, Ruby, PHP, and JavaScript have a PCRE implementation, so should be more or less compatible with the above, although some tweaking MAY be required.
Java, C#, Go, Rust, and Python use different engines. They tend to be Perl-like, but not Perl-identical. Keep that in mind if you work in these languages. https://regex101.com (no affiliation) is a great tool for working with these other engines.
There is an official spec for URLs, and I'm aware of libraries in JavaScript and Go which implement it and pass all relevant Web Platform Tests.
If you want a complete list of all canonical TLDs, ICANN provides an official list. I also have a project which runs weekly and converts the list to a JSON array. With a little code generation, one could produce a regex with a complete list of supported TLDs for their preferred language. The list changed a dozen times in the last 365 days from the date of this post.
if i try to define a varaible in python like this
a =r' Single-line version'
it give me invalid syntax