urllib.unquote
doesn't fit Python's paradigms so well:
unquote(u'%C3')
u'\xc3'
unquote('%C3')
'\xc3'
No errors, because Python doesn't care. Realistically what's happening
is that a decode (specifically, a percent_decode) can yield
bytes. This is the opposite of Python's "decode
always yields
text" paradigm.
Not super important above, but the reason it doesn't fail above is because looks can be deceiving. We're actually looking at two very different things:
>>> u'\xc3'.encode('utf8')
'\xc3\x83'
>>> '\xc3'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data
The u'\xc3' is a Unicode codepoint (valid unicode character), the '\xc3' is a byte which happens to be reserved by utf8 (hence the error on encode). The key is that the codepoint != the byte and only by Python's nonchalance do we end up equating the two.
Thus, no roundtripping:
>>> quote(unquote(u'%C3').encode('utf8'))
'%C3%83'
TLDR Percent decodes always result in bytes. Or rather, they can always result in bytes.
Here are options for hyperlink._url._percent_decode
and other parts
of the internal hyperlink workings to support DecodedURL
Sort of two options:
- Keep internal state = roundtrippable
- Not keep internal state = Not roundtrippable
- State tracked by being folded into the string
- Not in Python 2 by default, use the following by haypo:
Having DecodedURL, especially with that URL, return either bytes or text silently and dynamically is definitely a confusing API.
- No way to tell whether than percent is from the user or from a failed decoding
Right now this happens:
>>> hyperlink._url._decode_path_part(u'hello%c3goodbye')
u'hello%c3goodbye'
Basically if hyperlink can't utf-8 decode the bytes underlying the percent encoding, it will give just give up and return the percent-encoded text without error. See the code here.
This worked before, because in almost all cases, URLs with extra percent-encoding were still valid URLs and would still roundtrip fine, even across URI-IRI and back. And any Python code using hyperlink was fine, too, because nothing about the unicode sandwich got broken here. No bytes were returned, no "Big Mac" with an extra internal bun of bytes.
But now, with DecodedURL
in the works, we want every part of the URL
be in its totally decoded state. Every reserved character resolved,
every structure applied, every byte turned into text, for the
highest-level URL yet.
That means every percent symbol (%
) to literally mean a percent
symbol (%
). Thus, any percent sign left in the result would result
in the dreaded double escaping. Any %
would become %25
, so on a
single roundtrip of the un-utf8-decodable %C3
, we would get %25C3
,
with more and more 25s being added for each further roundtrip.
From https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF
However... UTF-8.. can encode these codepoints [(U+D800 to U+DFFF)] in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors.
Basically Python didn't respect the Unicode standard up until 3.3 and those older versions do not raise errors when encoding these illegal characters.
See also:
- http://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates
- https://tools.ietf.org/html/rfc3629#section-3
- https://www.python.org/dev/peps/pep-0383/
So, reassessing options:
- Write a custom utf8 encoding/decoding wrapper that checks for
surrogates in pure Python for Python 2, and rely on surrogateescape
in Python 3.
re.match(u"[\u1000-\U00021111]", possible_surrogate_having_text)
- Use surrogatepass in Python 3 and use the same surrogateescaping
code for both 2 and 3
- Suggested by Mark: seems unnecessarily slow for Py3
- Back to using 'charmap' decoding if 'utf8' fails and try to keep
track of which we did
- Content is bytes, probably containing text, probably in ascii or utf8, it's not so simple as "if not utf8 then bytes"
- Reduce scope and only do UTF8. Raise an exception if any encoding/decoding fails. No handling for undecodable bytes. Use Hyperlink's normal URL for that.
Right now I'm leaning toward #4.