Python & Unicode

text = open('a_unicode_file.txt', 'r').read()
print text
print 'type:', type(text)       # str is a container for binary data
print 'bytes:', len(text)       # The number of bytes, not characters!
print ' '.join(repr(b) for b in text)
print 'first byte:', text[:1]   # Prints an invalid character!

try:
    # This will fail because it first does a decode('ascii') and the first bytes are not valid!
    print text.encode('utf-8')
except Exception as e:
    print e

print 'codecs'
print '------'
import codecs
text = codecs.open('a_unicode_file.txt', 'r', 'utf-8').read()
print text
print 'type:', type(text)
print 'chars:', len(text)
print ' '.join(repr(c) for c in text)
print 'first char:', text[:1]

Basics

An encoding is a set of rules for converting 1-or-more bytes into characters.

Unicode is not an encoding!

Unicode does not map bytes to characters! Unicode is a numeric mapping, essentially an id for each character.

UTF-8 is an encoding

UTF-8 is variable width, and is a superset of ASCII. Characters beyond ASCII are represented with 2, 3, and 4 bytes.

Character	Codepoint	ASCII	UTF-8
A	0x41	41	41
B	0x42	42	42
€	0x20AC	N/A	E2 82 AC

How does 0x20AC == 0xE282AC?

E2	82	AC
11100010	10000010	10101100

First byte begins with 1110. This means it belongs with the next 2 bytes. All following bytes will begin with 10.
Removing the prefixes on each byte leaves: 0010 000010 101100 = 0x20AC

Other Common Misunderstandings

str does not contain text data! str contains binary data, ie. bytes.
Only unicode contains text, ie. characters.
You can't tell the encoding of a text file - only guess.

Golden Rules

Decode everything that comes in
Keep everything as unicode inside your program
Encode everything as it goes out (UTF-8 is safest)

Reading files

Use codecs.open to read text files. This will give you unicode
Use open to read binary files. This will give you str

Writing files

Use codecs.open to write text files. This means you only have to worry about encoding when you open the file for writing.
Use open to write binary files. The bytes in any str you write to the file will be written byte-for-byte.

Converting between str and unicode

str.decode(encoding) -> unicode
unicode.encode(encoding) -> str

Do not make a mistake!

str.encode(encoding): does decode('ascii') and then encode(encoding). If your string contains non-ascii characters, the first step will fail!

judy2k/unicode.md