I know I'm late with this article for about 5 years or so, but people are still using Python 2.x, so this subject is relevant I think.
Some facts first:
- Unicode is an international encoding standard for use with different languages and scripts
- In python-2.x, there are two types that deal with text.
str
is an 8-bit string.unicode
is for strings of unicode code points.
A code point is a number that maps to a particular abstract character. It is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal)
- Encoding (noun) is a map of Unicode code points to a sequence of bytes. (Synonyms: character encoding, character set, codeset). Popular encodings: UTF-8, ASCII, Latin-1, etc.
- Encoding (verb) is a process of converting
unicode
to bytes ofstr
, and decoding is the reverce operation. - Python 2.x uses ASCII as a default encoding. (More about this later)
When you sees something like this
SyntaxError: Non-ASCII character '\xd0' in file /tmp/p.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
you just need to define encoding in the first or second line of your file.
All you need is to have string coding=utf8
or coding: utf8
somewhere in your comments.
Python doesn't care what goes before or after those string, so the following will work fine too:
# -*- encoding: utf-8 -*-
Notice the dash in utf-8. Python has many aliases for UTF-8 encoding, so you should not worry about dashes or case sensitivity.
>>> str(u'café')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
str()
function encodes a string. We passed a unicode
string, and it tried to encode it using a default encoding, which is ASCII. Now the error makes sence because ASCII is 7-bit encoding which doesn't know how to represent characters outside of range 0..128.
Here we called str()
explicitly, but something in your code may call it implicitly and you will also get UnicodeEncodeError
.
How to fix: encode unicode
string manually using .encode('utf8')
before passing to str()
>>> utf_string = u'café'
>>> byte_string = utf_string.encode('utf8')
>>> unicode(byte_string)
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Let's say we somehow obtained a byte string byte_string
which contains encoded UTF-8 characters. We could get this by simply using a library that returns str
type.
Then we passed the string to a function that converts it to unicode
. In this example we explicitly call unicode()
, but some functions may call it implicitly and you'll get the same error.
Now again, Python uses ASCII encoding by default, so it tries to convert bytes to a default encoding ASCII. Since there is no ASCII symbol that converts to 0xc3
(195 decimal) it fails with UnicodeDecodeError
.
How to fix: decode str
manually using .decode('utf8')
before passing to your function.
Make sure your code works only with Unicode strings internally, converting to a particular encoding on output, and decoding str
on input.
Learn the libraries you are using, and find places where they return str
. Decode str
before return value is passed further in your code.
I use this helper function in my code:
def force_to_unicode(text):
"If text is unicode, it is returned as is. If it's str, convert it to Unicode using UTF-8 encoding"
return text if isinstance(text, unicode) else text.decode('utf8')
Good information to me.
Thank you.