Last active
October 6, 2015 06:38
-
-
Save ohaal/2952574 to your computer and use it in GitHub Desktop.
Regular expression to grab (most) URLs from any string, pre gTLD craze
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Regex to grab (most) URLs from any string (except most gTLD's, but can be added manually) | |
# This will never be perfect (see [1]&[2]), but it does its job fair enough, currently only 2 exceptions are added (museum & travel) | |
# | |
# Capture groups: | |
# 1. Full URL | |
# 2. The protocol | |
# 3. Hostname+path including `www*.` | |
# 4. Hostname+path excluding `www*.` | |
# TODO: Capture group for path | |
# | |
# [1] http://en.wikipedia.org/wiki/Top-level_domain | |
# [2] http://newgtlds.icann.org/en/program-status/application-results/strings-1200utc-13jun12-en | |
/ | |
( | |
\b | |
(?:(https?|ftp):\/\/)? | |
( | |
(?:www\d{0,3}\.)? | |
( | |
[a-z0-9.-]+\. | |
(?:[a-z]{2,4}|museum|travel) | |
(?:\/[^\/\s]+)* | |
) | |
) | |
\b | |
) | |
/ix |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment