Skip to content

Instantly share code, notes, and snippets.

@mahmoud
Last active December 30, 2017 22:25
Show Gist options
  • Save mahmoud/7bc696254a738404bc281c270b169613 to your computer and use it in GitHub Desktop.
Save mahmoud/7bc696254a738404bc281c270b169613 to your computer and use it in GitHub Desktop.

Intro

urllib.unquote doesn't fit Python's paradigms so well:

unquote(u'%C3')
u'\xc3'
unquote('%C3')
'\xc3'

No errors, because Python doesn't care. Realistically what's happening is that a decode (specifically, a percent_decode) can yield bytes. This is the opposite of Python's "decode always yields text" paradigm.

Not super important above, but the reason it doesn't fail above is because looks can be deceiving. We're actually looking at two very different things:

>>> u'\xc3'.encode('utf8')
'\xc3\x83'
>>> '\xc3'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

The u'\xc3' is a Unicode codepoint (valid unicode character), the '\xc3' is a byte which happens to be reserved by utf8 (hence the error on encode). The key is that the codepoint != the byte and only by Python's nonchalance do we end up equating the two.

Thus, no roundtripping:

>>> quote(unquote(u'%C3').encode('utf8'))
'%C3%83'

TLDR Percent decodes always result in bytes. Or rather, they can always result in bytes.

Options

Here are options for hyperlink._url._percent_decode and other parts of the internal hyperlink workings to support DecodedURL

Use charmap

Sort of two options:

  • Keep internal state = roundtrippable
  • Not keep internal state = Not roundtrippable

Use surrogateescape handler

Get unicode or bytes back

Having DecodedURL, especially with that URL, return either bytes or text silently and dynamically is definitely a confusing API.

Non-option: leave it percent-encoded

  • No way to tell whether than percent is from the user or from a failed decoding

What this really means for hyperlink

Right now this happens:

>>> hyperlink._url._decode_path_part(u'hello%c3goodbye')
u'hello%c3goodbye'

Basically if hyperlink can't utf-8 decode the bytes underlying the percent encoding, it will give just give up and return the percent-encoded text without error. See the code here.

This worked before, because in almost all cases, URLs with extra percent-encoding were still valid URLs and would still roundtrip fine, even across URI-IRI and back. And any Python code using hyperlink was fine, too, because nothing about the unicode sandwich got broken here. No bytes were returned, no "Big Mac" with an extra internal bun of bytes.

But now, with DecodedURL in the works, we want every part of the URL be in its totally decoded state. Every reserved character resolved, every structure applied, every byte turned into text, for the highest-level URL yet.

That means every percent symbol (%) to literally mean a percent symbol (%). Thus, any percent sign left in the result would result in the dreaded double escaping. Any % would become %25, so on a single roundtrip of the un-utf8-decodable %C3, we would get %25C3, with more and more 25s being added for each further roundtrip.

Update on surrogateescaping

From https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

However... UTF-8.. can encode these codepoints [(U+D800 to U+DFFF)] in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors.

Basically Python didn't respect the Unicode standard up until 3.3 and those older versions do not raise errors when encoding these illegal characters.

See also:

So, reassessing options:

  1. Write a custom utf8 encoding/decoding wrapper that checks for surrogates in pure Python for Python 2, and rely on surrogateescape in Python 3.
    • re.match(u"[\u1000-\U00021111]", possible_surrogate_having_text)
  2. Use surrogatepass in Python 3 and use the same surrogateescaping code for both 2 and 3
    • Suggested by Mark: seems unnecessarily slow for Py3
  3. Back to using 'charmap' decoding if 'utf8' fails and try to keep track of which we did
    • Content is bytes, probably containing text, probably in ascii or utf8, it's not so simple as "if not utf8 then bytes"
  4. Reduce scope and only do UTF8. Raise an exception if any encoding/decoding fails. No handling for undecodable bytes. Use Hyperlink's normal URL for that.

Right now I'm leaning toward #4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment