Intro

urllib.unquote doesn't fit Python's paradigms so well:

unquote(u'%C3')
u'\xc3'
unquote('%C3')
'\xc3'

No errors, because Python doesn't care. Realistically what's happening is that a decode (specifically, a percent_decode) can yield bytes. This is the opposite of Python's "decode always yields text" paradigm.

Not super important above, but the reason it doesn't fail above is because looks can be deceiving. We're actually looking at two very different things:

>>> u'\xc3'.encode('utf8')
'\xc3\x83'
>>> '\xc3'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

The u'\xc3' is a Unicode codepoint (valid unicode character), the '\xc3' is a byte which happens to be reserved by utf8 (hence the error on encode). The key is that the codepoint != the byte and only by Python's nonchalance do we end up equating the two.

Thus, no roundtripping:

>>> quote(unquote(u'%C3').encode('utf8'))
'%C3%83'

TLDR Percent decodes always result in bytes. Or rather, they can always result in bytes.

Options

Here are options for hyperlink._url._percent_decode and other parts of the internal hyperlink workings to support DecodedURL

Use charmap

Sort of two options:

Keep internal state = roundtrippable
Not keep internal state = Not roundtrippable

Use surrogateescape handler

State tracked by being folded into the string
Not in Python 2 by default, use the following by haypo:
- https://github.com/PythonCharmers/python-future/blob/master/src/future/utils/surrogateescape.py

Get unicode or bytes back

Having DecodedURL, especially with that URL, return either bytes or text silently and dynamically is definitely a confusing API.

Non-option: leave it percent-encoded

No way to tell whether than percent is from the user or from a failed decoding

What this really means for hyperlink

Right now this happens:

>>> hyperlink._url._decode_path_part(u'hello%c3goodbye')
u'hello%c3goodbye'

Basically if hyperlink can't utf-8 decode the bytes underlying the percent encoding, it will give just give up and return the percent-encoded text without error. See the code here.

This worked before, because in almost all cases, URLs with extra percent-encoding were still valid URLs and would still roundtrip fine, even across URI-IRI and back. And any Python code using hyperlink was fine, too, because nothing about the unicode sandwich got broken here. No bytes were returned, no "Big Mac" with an extra internal bun of bytes.

But now, with DecodedURL in the works, we want every part of the URL be in its totally decoded state. Every reserved character resolved, every structure applied, every byte turned into text, for the highest-level URL yet.

That means every percent symbol (%) to literally mean a percent symbol (%). Thus, any percent sign left in the result would result in the dreaded double escaping. Any % would become %25, so on a single roundtrip of the un-utf8-decodable %C3, we would get %25C3, with more and more 25s being added for each further roundtrip.

Update on surrogateescaping

From https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

However... UTF-8.. can encode these codepoints [(U+D800 to U+DFFF)] in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors.

Basically Python didn't respect the Unicode standard up until 3.3 and those older versions do not raise errors when encoding these illegal characters.

mahmoud/percent_decode_notes.md