python 2.7 unicode

My quick summary of Python 2.7 strings and Unicode.

Overview

Python has two string types:

type str
type unicode

Examples:

>>> type("abc")
<type 'str'>

>>> type(u"def")
<type 'unicode'>

They come with the following restrictions:

type str strings may only contain byte values
type unicode strings may contain any Unicode code point (up to 65535)

In general, do not mix the two types - i.e. only concatentate str strings with other str strings, only use unicode strings as arguments to operations on other unicode strings, etc.

In practice, there are three categories of string values:

(I) str strings which do not contain any byte values > 127
(II) str strings which contain byte values >= 128
(III) unicode strings

The main error people run into is mixing category II strings with category III strings in string operations. For instance, each of the following will produce an error:

ustr = u"Joe"   # category III
bstr = "\xa0"   # category II

ustr + bstr
ustr.find( bstr )
bstr.find( ustr )
ustr.replace(bstr, " ")

The error message:

UnicodeDecodeError: 'ascii' codec can't decode byte ...
    in position ...: ordinal not in range(128)

is a tell-tale sign that you are mixing a category II str string with a unicode string.

Mixing category I and category III strings is generally OK:

"My name is " + ustr
ustr.find('e')
chr(127) + ustr

When concatenating a unicode string and a str string (in either order) the result will be a unicode string.

Constructing Strings

Construct str strings with the conventional single or double quotes. Use \xXX for byte values represented in hex. The function chr() creates a string consisting of a single byte:

"Hello, world"
'Non-breaking space: \xa0'
'Non-breaking space: ' + chr(160)

Construct unicode literals with u"..." or u'...' quotes. Inside these quotes use the escape \uXXXX to create a single code point with hex value XXXX. unichr() function for a single code point:

u"Hello, world"
u'Non-breaking space: \u00a0'
u"Non-breaking space: " + unichr(160)

(Note: The character escape \xXX also works in unicode strings and is equivalent to \u00XX.)

Printing to the console

How strings are printed depends on your LANG environment variable (on Unix-like systems.)

Printing a str string always works. The bytes of the string are simply written to stdout.

When printing a unicode string each code point will be translated according to the encoding specified by the LANG environment variable. An error will result if the encoding does not support the code point you are trying to print.

Reading from stdin

Both raw_input() and sys.stdin.read(...) return str strings.

To read stdin (or any other file handle) as code points, you can create a wrapper with codecs.getreader(...):

import codecs
char_stream = codecs.getreader("utf-8")(sys.stdin)

Reading from char_stream will return unicode strings.

Reading and writing files

This will read the contents of a file as a str string:

with open("foo", "r") as f: contents = f.read()

And this will write the str string contents to a file:

with open("bar", "w") as f: f.write(contents)

To read and write unicode strings, use codecs.open(...) with an encoding setting:

import codecs
with codecs.open("foo", "r", encoding="utf-8") as f:
  contents = f.read()
...
with codecs.open("bar", "w", encoding="utf-16-le") as f:
  f.write(contents)

Encoding and Decoding

A str string may be decoded to a unicode string using the .decode(...) method:

ustr = "\xe2\x96\xba".decode("utf-8")
  # results in u"\u2631"

A unicode string may be encoded to a str string using the .encode(...) method:

bytes = u"\u263a".encode("utf-8")
  # results in "\xe2\x98\xba"