text = open('a_unicode_file.txt', 'r').read()
print text
print 'type:', type(text) # str is a container for binary data
print 'bytes:', len(text) # The number of bytes, not characters!
print ' '.join(repr(b) for b in text)
print 'first byte:', text[:1] # Prints an invalid character!
try:
# This will fail because it first does a decode('ascii') and the first bytes are not valid!
print text.encode('utf-8')
except Exception as e:
print e
print 'codecs'
print '------'
import codecs
text = codecs.open('a_unicode_file.txt', 'r', 'utf-8').read()
print text
print 'type:', type(text)
print 'chars:', len(text)
print ' '.join(repr(c) for c in text)
print 'first char:', text[:1]An encoding is a set of rules for converting 1-or-more bytes into characters.
Unicode does not map bytes to characters! Unicode is a numeric mapping, essentially an id for each character.
UTF-8 is variable width, and is a superset of ASCII. Characters beyond ASCII are represented with 2, 3, and 4 bytes.
| Character | Codepoint | ASCII | UTF-8 |
|---|---|---|---|
| A | 0x41 | 41 | 41 |
| B | 0x42 | 42 | 42 |
| € | 0x20AC | N/A | E2 82 AC |
| E2 | 82 | AC |
|---|---|---|
| 11100010 | 10000010 | 10101100 |
- First byte begins with
1110. This means it belongs with the next 2 bytes. All following bytes will begin with10. - Removing the prefixes on each byte leaves:
0010 000010 101100=0x20AC
strdoes not contain text data! str contains binary data, ie. bytes.- Only
unicodecontains text, ie. characters. - You can't tell the encoding of a text file - only guess.
- Decode everything that comes in
- Keep everything as
unicodeinside your program - Encode everything as it goes out (UTF-8 is safest)
- Use
codecs.opento read text files. This will give youunicode - Use
opento read binary files. This will give youstr
- Use
codecs.opento write text files. This means you only have to worry about encoding when you open the file for writing. - Use
opento write binary files. The bytes in anystryou write to the file will be written byte-for-byte.
str.decode(encoding)-> unicodeunicode.encode(encoding)-> str
Do not make a mistake!
str.encode(encoding): does decode('ascii') and then encode(encoding).
If your string contains non-ascii characters, the first step will fail!