Last active
June 12, 2023 12:24
-
-
Save rspeer/7559750 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
This file contains code that, when run on Python 2.7.5 or earlier, creates | |
a string that should not exist: u'\Udeadbeef'. That's a single "character" | |
that's illegal in Python because it's outside the valid Unicode range. | |
It then uses it to crash various things in the Python standard library and | |
corrupt a database. | |
On Python 3... well, this file is full of syntax errors on Python 3. But | |
if you were to change the print statements and byte literals and stuff: | |
* You'd probably see the same bug on Python 3.2. | |
* On Python 3.3, you'd just get an error making the string on the first line. | |
* On Python 3.3.3, the error even makes sense. | |
On narrow builds of Python, u'\Udeadbeef' gets immediately truncated to | |
u'\ubeef', a totally safe character. (It's a nonsense syllable in | |
Korean.) For once, narrow Python's half-assed Unicode support has saved you. | |
The relevant bug is: http://bugs.python.org/issue19279 | |
""" | |
# Use a bug in the UTF-7 decoder to create a string containing codepoint | |
# U+DEADBEEF. (Keep in mind that Unicode ends at U+10FFFF.) | |
deadbeef = '+d,+6t,+vu8-'.decode('utf-7', 'replace')[-1] | |
print repr(deadbeef) | |
# outputs u'\Udeadbeef'. That's not a valid string literal. | |
import codecs | |
with codecs.open('deadbeef.txt', 'w', encoding='utf-8') as outfile: | |
print >> outfile, deadbeef | |
# writes a non-UTF-8 file | |
try: | |
with codecs.open('deadbeef.txt', encoding='utf-8') as infile: | |
print infile.read() | |
except UnicodeDecodeError: | |
print "Boom! Broke your text file." | |
import re | |
try: | |
re.match(u'[A-%s]' % deadbeef, u'test') | |
except MemoryError: | |
print "Boom! Broke your regular expression." | |
import sqlite3 | |
db = sqlite3.connect('deadbeef.db') | |
db.execute(u'CREATE TABLE deadbeef (id integer primary key, value text)') | |
db.execute(u'INSERT INTO deadbeef (value) VALUES (?)', u'\U0001f602') | |
db.execute(u'SELECT * FROM deadbeef').fetchall() | |
# This works fine. I'm just convincing you that SQLite has no problem with | |
# Unicode itself. | |
db.execute(u'INSERT INTO deadbeef (value) VALUES (?)', deadbeef) | |
try: | |
db.execute(u'SELECT * FROM deadbeef').fetchall() | |
except sqlite3.OperationalError: | |
print "Boom! Corrupted your database." | |
# As a bonus, if you run that SQLite query at the IPython prompt, it gets | |
# a second error trying to print out the error message. |
Slightly simpler way to get a hold of deadbeef:'+d,+6t,+vu8-'.decode('utf-7', 'ignore')
It works fine on OSX only because OSX's default Python is a narrow build. (Kind of disappointing for an OS with otherwise good support for lots of characters, including emoji.) The character just ends up being '\ubeef'.
Might be related: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=469644
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@peterbe: it's a Python compile flag which controls whether Unicode support includes only the Basic Multilingual Plane or the full range of Unicode characters (i.e. does it end at 0x10000 or 0x10FFFF). See http://www.python.org/dev/peps/pep-0261/
This used to only be of interest to those of us working with relatively obscure multilingual content but has become a lot more important for most people now that things outside the BMP like Emoji have become very common. It means that
len()
won't work as expected on those characters in most Python 2.x builds. Try running https://github.com/acdha/unix_tools/blob/master/bin/unicode-characters.py under both Python 2 and 3 if you're severely bored.