My quick summary of Python 2.7 strings and Unicode.
Python has two string types:
- type
str - type
unicode
Examples:
>>> type("abc")
<type 'str'>
>>> type(u"def")
<type 'unicode'>
They come with the following restrictions:
- type
strstrings may only contain byte values - type
unicodestrings may contain any Unicode code point (up to 65535)
In general, do not mix the two types - i.e. only concatentate str strings with other str strings, only use unicode strings as arguments to operations on other unicode strings, etc.
In practice, there are three categories of string values:
- (I)
strstrings which do not contain any byte values > 127 - (II)
strstrings which contain byte values >= 128 - (III)
unicodestrings
The main error people run into is mixing category II strings with category III strings in string operations. For instance, each of the following will produce an error:
ustr = u"Joe" # category III
bstr = "\xa0" # category II
ustr + bstr
ustr.find( bstr )
bstr.find( ustr )
ustr.replace(bstr, " ")
The error message:
UnicodeDecodeError: 'ascii' codec can't decode byte ...
in position ...: ordinal not in range(128)
is a tell-tale sign that you are mixing a category II str string with a unicode string.
Mixing category I and category III strings is generally OK:
"My name is " + ustr
ustr.find('e')
chr(127) + ustr
When concatenating a unicode string and a str string (in either order) the result will be a unicode string.
Construct str strings with the conventional single or double quotes. Use \xXX for byte values represented in hex. The function chr() creates a string consisting of a single byte:
"Hello, world"
'Non-breaking space: \xa0'
'Non-breaking space: ' + chr(160)
Construct unicode literals with u"..." or u'...' quotes. Inside these quotes use the escape \uXXXX to create a single code point with hex value XXXX. unichr() function for a single code point:
u"Hello, world"
u'Non-breaking space: \u00a0'
u"Non-breaking space: " + unichr(160)
(Note: The character escape \xXX also works in unicode strings and is equivalent to \u00XX.)
How strings are printed depends on your LANG environment variable (on Unix-like systems.)
Printing a str string always works. The bytes of the string are simply written to stdout.
When printing a unicode string each code point will be translated according to the encoding specified by the LANG environment variable. An error will result if the encoding does not support the code point you are trying to print.
Both raw_input() and sys.stdin.read(...) return str strings.
To read stdin (or any other file handle) as code points, you can create a wrapper with codecs.getreader(...):
import codecs
char_stream = codecs.getreader("utf-8")(sys.stdin)
Reading from char_stream will return unicode strings.
This will read the contents of a file as a str string:
with open("foo", "r") as f: contents = f.read()
And this will write the str string contents to a file:
with open("bar", "w") as f: f.write(contents)
To read and write unicode strings, use codecs.open(...) with an encoding setting:
import codecs
with codecs.open("foo", "r", encoding="utf-8") as f:
contents = f.read()
...
with codecs.open("bar", "w", encoding="utf-16-le") as f:
f.write(contents)
A str string may be decoded to a unicode string using the .decode(...) method:
ustr = "\xe2\x96\xba".decode("utf-8")
# results in u"\u2631"
A unicode string may be encoded to a str string using the .encode(...) method:
bytes = u"\u263a".encode("utf-8")
# results in "\xe2\x98\xba"