It can be important to match a string as a URL in form or database validation. For example, a job finding platform that asks the user to enter their portfolio website could validate the user's input to make sure their input isn't gibberish. The following regex will tell the user if a string is a valid URL that will be recognized by most popular browsers: /^(https?://)?([\da-z.-]+).([a-z.]{2,6})(:\d{1,5})?([/\w .-])/?$/
I used the mdn reference https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL as a reference to break down what the regex is matching. According to the reference, this regex will match the scheme, domain name, port, and path to resource
I include a begining '^' and ending '$' anchor in this regex to match the entire string as a URL. This was done to be consisitent with the desired applications of form and database validations stated earlier.
Quantifiers are explained one at a time going through the regex from left to right.
- '?' checks for either one or zero 's' characters in the 'https?' section of the regex. This basically allows for secure and non secure URLs.
- '?' checks for either one or zero '(https?://)' tokens. This token is optional as browsers don't require that part of the URL to be typed in.
- '+' in '[\da-z.-]+' checks for one or more characters that make up the subdomain
- '{2,6}' in '[a-z.]{2,6}' matches a token 2 to 6 characters long that is the top level domain
- '{1,5}' in ':\d{1,5}' will look for a valid port number that matches the constraints setup in the IANA assignments
- '?' in '(:\d{1,5})?' allows the inclusion of a port but doesn't require it
- '*' in '[/\w .-]*' checks for either zero or any number of segments in a path to resource
- '?' in '/?' includes a trailing forward slash, if it exists
There are five main grouping constructs in the regular expression.
- (https?://) matches a scheme section in a URL
- ([\da-z.-]+) matches a domain section in a URL
- ([a-z.]{2,6}) matches a subdomain section in a URL
- (:\d{1,5}) matches a port section in a URL
- ([/\w .-]*) matches a path to resource section in a URL
There are three bracket expressions in this regular expression that help match unique parts of a URL
- [\da-z.-] looks for a digit first, then any letter a through z, then a period character (the metacharacter '.' that matches any single character is escaped here), then a dash character
- [a-z.] looks for any letter a through z and then a period character
- [/\w .-] looks for a forward slash, then a word character, then a period character, and then a dash character
The most used character classes in this regex are the digit (\d), alphabet (a-z), and word (\w) metacharacters. They're used here to match a subdomain with digits and alphabet characters, match a top level domain that could include combinations of letters and periods, and match a path to resource with any word characters and forward slashes throughout.
Character escapes are used to match forward slashes in the scheme section, period characters through the URL structure, and a potential trailing forward slash in the URL
Regex run-through by Lane Pemberton. Let me know if you were able to use my work or improve upon it!
🔗 LinkedIn Github 📧 Email