My quick summary of Python 2.7 strings and Unicode.
Python has two string types:
- type
str
- type
unicode
Examples:
>>> type("abc")
<type 'str'>
>>> type(u"def")
<type 'unicode'>
They come with the following restrictions:
- type
str
strings may only contain byte values - type
unicode
strings may contain any Unicode code point (up to 65535)
In general, do not mix the two types - i.e. only concatentate str
strings with other str
strings, only use unicode
strings as arguments to operations on other unicode
strings, etc.
In practice, there are three categories of string values:
- (I)
str
strings which do not contain any byte values > 127 - (II)
str
strings which contain byte values >= 128 - (III)
unicode
strings
The main error people run into is mixing category II strings with category III strings in string operations. For instance, each of the following will produce an error:
ustr = u"Joe" # category III
bstr = "\xa0" # category II
ustr + bstr
ustr.find( bstr )
bstr.find( ustr )
ustr.replace(bstr, " ")
The error message:
UnicodeDecodeError: 'ascii' codec can't decode byte ...
in position ...: ordinal not in range(128)
is a tell-tale sign that you are mixing a category II str
string with a unicode
string.
Mixing category I and category III strings is generally OK:
"My name is " + ustr
ustr.find('e')
chr(127) + ustr
When concatenating a unicode
string and a str
string (in either order) the result will be a unicode
string.
Construct str
strings with the conventional single or double quotes. Use \xXX
for byte values represented in hex. The function chr()
creates a string consisting of a single byte:
"Hello, world"
'Non-breaking space: \xa0'
'Non-breaking space: ' + chr(160)
Construct unicode
literals with u"..."
or u'...'
quotes. Inside these quotes use the escape \uXXXX
to create a single code point with hex value XXXX. unichr()
function for a single code point:
u"Hello, world"
u'Non-breaking space: \u00a0'
u"Non-breaking space: " + unichr(160)
(Note: The character escape \xXX
also works in unicode strings and is equivalent to \u00XX
.)
How strings are printed depends on your LANG
environment variable (on Unix-like systems.)
Printing a str
string always works. The bytes of the string are simply written to stdout.
When printing a unicode
string each code point will be translated according to the encoding specified by the LANG
environment variable. An error will result if the encoding does not support the code point you are trying to print.
Both raw_input()
and sys.stdin.read(...)
return str
strings.
To read stdin
(or any other file handle) as code points, you can create a wrapper with codecs.getreader(...)
:
import codecs
char_stream = codecs.getreader("utf-8")(sys.stdin)
Reading from char_stream
will return unicode
strings.
This will read the contents of a file as a str
string:
with open("foo", "r") as f: contents = f.read()
And this will write the str
string contents
to a file:
with open("bar", "w") as f: f.write(contents)
To read and write unicode
strings, use codecs.open(...)
with an encoding setting:
import codecs
with codecs.open("foo", "r", encoding="utf-8") as f:
contents = f.read()
...
with codecs.open("bar", "w", encoding="utf-16-le") as f:
f.write(contents)
A str
string may be decoded to a unicode
string using the .decode(...)
method:
ustr = "\xe2\x96\xba".decode("utf-8")
# results in u"\u2631"
A unicode
string may be encoded to a str
string using the .encode(...)
method:
bytes = u"\u263a".encode("utf-8")
# results in "\xe2\x98\xba"