Skip to content

Instantly share code, notes, and snippets.

@vicly
Created September 24, 2017 10:18
Show Gist options
  • Save vicly/4853f4c9996be62ab9b599b073cca5cd to your computer and use it in GitHub Desktop.
Save vicly/4853f4c9996be62ab9b599b073cca5cd to your computer and use it in GitHub Desktop.
[URL fundamental] #Note #Fundamental

What every web developer must know about URL encoding

http://blog.lunatech.com/2009/02/03/what-every-web-developer-must-know-about-url-encoding

General URL syntax

https://bob:[email protected]:8080/file;p=1?q=2#third

Part Data
Scheme https
Users bob
Password bobby
Host address www.lunatech.com
Port 8080
Path /file
Path parameter p=1
Query parameter q=2
Fragment third

HTTP URL syntax

HTTP URLs is URL with http or https schemes.

"/photos/egypt/cairo/first.jpg" has four path segments: "photos", "egypt", "cairo" and "first.jpg"

Each path segment can have optional path parameters (aka. Matrix parameter), e.g. /photos;px=1;py=2/egypt/..

URL grammar

The reserved characters must be URL-encoded, e.g. ://, /, ?, &, e.g. http://example.com/xyz?.jpg needs to be encoded to http://example.com/xyz%3F.jpg

Common pitfalls of URLs

What to be encoded?

ASCII chars no need to escaped except the reserved chars. Non-ASCII, we must know which encoding used to encode chars. Latest version of URI standard defines that new URI schemes and host names use UTF-8, but how about path??

The reserved chars are different for each part

In path fragment, a space is encoded to %20, while + can be left unencoded.

In query part, a space could be encoded to either +(for backwards compatibility) or %20, while + is encoded to %2B.

blue+light blue: http://example.com/blue+light%20blue?blue%2Blight+blue

The reserved chars are not what you think they are

  • "?" is allowed unescaped anywhere within a query part,
  • "/" is allowed unescaped anywhere within a query part,
  • "=" is allowed unescaped anywhere within a path parameter or query parameter value, and within a path segment,
  • ":@-._~!$&'()*+,;=" are allowed unescaped anywhere within a path segment part,
  • "/?:@-._~!$&'()*+,;=" are allowed unescaped anywhere within a fragment part.

A URL cannot be analysed after decoding

Analysis of reserved chars and URL parts has to be done before URL-decoding.

The implication is that URL-rewriting filters should NEVER decode a URL before attempting to match it if reserved chars are allowed to be URL-encoded.

Handling URLs correctly in Java

Do not use java.net.URLEncoder or java.net.URLDecoder for whole URLs

Do not construct URLs without encoding each part

// BAD - http://example.com/a/b?c is INCORRECT
String pathSegment = "a/b?c";
String url = "http://example.com/" + pathSegment;
// GOOD
String url = "http://example.com/" + URLUtils.encodePathSegment(pathSegment);


// BAD
//   "http://example.com/?query=a&b==c" is not what we want
//   "http://example.com/?query=a%26b==c" is what we want
String value = "a&b==c";
String url = "http://example.com/?query=" + value;

Do not expect URI.getPath() to give you structured data

Paring URL should happen before URL decoding, while getPath() will decode then parse.

URI uri = new URI("http://example.com/a%2Fb%3Fc");
// BAD
for(String pathSegment : uri.getPath().split("/"))
  System.err.println(pathSegment);
// GOOD
for(String pathSegment : uri.getRawPath().split("/"))
  System.err.println(URLUtils.decodePathSegment(pathSegment));

Do not expect Apache Commons HTTPClient's URI to get this right

Fixing URL encoding at every level in a web application

Always encode URLs as you build them

// In HTML
//    BAD
var url = "#{vl:encodeURL(contextPath + "/view/ + resource.name)}";
//    GOOD
var url = "#{contextPath}/view/{vl:encodeURLPathSegment(resource.name)}";

Ensure your URL-rewrite filters deal with URLs correctly

Using Apache mod-rewrite correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment