text = open('a_unicode_file.txt', 'r').read()
print text
print 'type:', type(text) # str is a container for binary data
print 'bytes:', len(text) # The number of bytes, not characters!
print ' '.join(repr(b) for b in text)
print 'first byte:', text[:1] # Prints an invalid character!
try:
# This will fail because it first does a decode('ascii') and the first bytes are not valid!
print text.encode('utf-8')
except Exception as e:
print e
print 'codecs'
print '------'
import codecs
text = codecs.open('a_unicode_file.txt', 'r', 'utf-8').read()
print text
print 'type:', type(text)
print 'chars:', len(text)
print ' '.join(repr(c) for c in text)
print 'first char:', text[:1]
An encoding is a set of rules for converting 1-or-more bytes into characters.
Unicode does not map bytes to characters! Unicode is a numeric mapping, essentially an id for each character.
UTF-8 is variable width, and is a superset of ASCII. Characters beyond ASCII are represented with 2, 3, and 4 bytes.
Character | Codepoint | ASCII | UTF-8 |
---|---|---|---|
A | 0x41 | 41 | 41 |
B | 0x42 | 42 | 42 |
€ | 0x20AC | N/A | E2 82 AC |
E2 | 82 | AC |
---|---|---|
11100010 | 10000010 | 10101100 |
- First byte begins with
1110
. This means it belongs with the next 2 bytes. All following bytes will begin with10
. - Removing the prefixes on each byte leaves:
0010 000010 101100
=0x20AC
str
does not contain text data! str contains binary data, ie. bytes.- Only
unicode
contains text, ie. characters. - You can't tell the encoding of a text file - only guess.
- Decode everything that comes in
- Keep everything as
unicode
inside your program - Encode everything as it goes out (UTF-8 is safest)
- Use
codecs.open
to read text files. This will give youunicode
- Use
open
to read binary files. This will give youstr
- Use
codecs.open
to write text files. This means you only have to worry about encoding when you open the file for writing. - Use
open
to write binary files. The bytes in anystr
you write to the file will be written byte-for-byte.
str.decode(encoding)
-> unicodeunicode.encode(encoding)
-> str
Do not make a mistake!
str.encode(encoding)
: does decode('ascii')
and then encode(encoding)
.
If your string contains non-ascii characters, the first step will fail!