MessagePack-Python and msgpack-string-ext

Background of Python

Python 2

Python 2 には str と unicode の2種類の文字列型がある。 bytes という型があるがこれは単に str へのエイリアスになっている.

Python 2 has two string types: str and unicode. And bytes is just a alias for str. (str is bytes)

文字列ではないバイト列専用の型として、 bytearray がある。 bytes が immutable なのに対して、 bytearray は mutable という位置づけになっている。

bytearray is a type only for binary. bytearray is a mutable in contrast to bytes is a immutable.

bytes と unicode で結合などの演算を行うと、 bytes 側が unicode にキャストされ、その際 ASCII の範囲に収まっていれば成功し、いなければ UnicodeDecodeError が発生する.

When you mix bytes and unicode, bytes will be cast to unicode with 'ascii' codec. If the bytes contains non-ASCII character, this cause UnicodeDecodeError.

キャストの際に用いられるデフォルトエンコーディングを UTF-8 にすることもできるが、 Python の環境ごとの設定であり、影響範囲がそのインタプリタで実行される全コードに及ぶため推奨されない。

There is a way to change default encoding to UTF-8. But don't do this is highly recommended because it changes whole interpreter behavior.

Python 3

Python 2 の unicode が str にリネームされ、唯一の文字列型となった。

unicode in Python 2 is renamed str. And it's only type for string.

bytes 型は文字列用のメソッドが残っているので文字列のように操作することができるが、インデックスアクセスで整数が帰るなど文字列と全く同じにはなっていない。

bytes has methods like str, but it's not string.

bytes と unicode の間で演算を行うと暗黙のキャストは行われず、常に TypeError になる。

Mixing bytes and unicode cause TypeError always.

Is Python strong-string language in point of msgpack-string-ext view?

Python 3 は strong-string language と呼べる。

Python 3 is clearly strong-string language.

Python 2 についてはアプリケーションに依存する。

On the other hand, it depends on application in Python 2.

例えば、 json では次のようになっている.

For example, json behavior is:

>>> import json
>>> json.dumps(['foo', u'bar'])
'["foo", "bar"]'
>>> json.loads(_)
[u'foo', u'bar']

Python 2 で msgpack の string extension に対応する場合、自然なのは、Packerは全て文字列として扱う weak-string language として振る舞い、Unpackerは strong-string language として振る舞うことである。

It is natural that Packer treats bytes as string (weak-string) and Unpacker treats string as unicode (strong-string).

一方で、アプリケーションがデータとして扱う文字列は全て unicode で統一するというベストプラクティスが広まっており、そのようなアプリはPacker側に strong-string language としての振る舞いを期待するはずである。

But many application uses best practice:: Use unicode for all strings. In such application, Packer should behave as strong-string.

msgpack-python's design

Current msgpack-python's behavior

msgpack-python 0.3 は、すでに文字列の扱いについて、 encoding, unicode_errors オプションを持っている。

msgpack-python 0.3 has already options: encoding and unicode_errors.

Packer の encoding オプションは、Packerに unicode が渡された時に raw に変換するときのエンコーディングを指定し、デフォルトでは 'utf-8' である。 None を指定すると unicode を渡すとエラーになる。

Packer's encoding specifies encoding used to encode unicode into raw. (default: utf-8) Specifying None to it cause error when passing unicode to packer.

Unpacker の encoding オプションは msgpack 中の raw を unicode にアンパックする場合のエンコーディングである。デフォルトは None であり、 raw を bytes にアンパックする。

Unpacker's encoding option specifies encoding for decode raw to unicode. Default value of it is None and unpacks raw to bytes.

unicode_errors オプションはエンコード・デコード時にエラーが発生した時の動作を指定するもので、 unicode.encode と bytes.decode と同じ振舞いをする。デフォルトは Packer, Unpacker ともに 'strict' で、例外を発生させる。

unicode_errors option specifies how to handle error on decoding and encoding. Default value of it is 'strict' on both of Packer and Unpacker.

Packer

String (raw) に非 UTF-8 を格納することは非推奨となった。したがって Packer の encoding, unicode_errors オプションはすみやかに廃止する。

Storing non UTF-8 data into string(raw) is deprecated. So encoding and unicode_errors options are deprecated immediately.

非 UTF-8 の文字列を msgpack に格納したい場合、 MessagePack 1.0/2.0 ともにアプリ側でエンコードしたバイト列を渡す必要がある。 If application want to store non UTF-8 data, application should encode it into bytes manually.

encoding オプションの代わりに、生成する msgpack データを制御するための mode オプションを追加する。 Instead of encoding option, add mode option to control Packer's output.

mode	'unicode'	'bytes'	'compat'
bytearray	binary	binary	raw
bytes	binary	string	raw
unicode	string	string	raw

新しい string 型は、古い raw を置き換えるだけでなく新しいフォーマットも利用しているが、 compat モードは古いデコーダでも読み込める raw だけを生成する。 New string type not only replaces old raw format. It uses new format for some length. So compat mode uses only raw format to allow old decoders unpacks it.

string-ext に対応していない古いデコーダが殆ど使われなくなるまで、デフォルトのモードは 'compat' にする。 Until most msgpack decoders in the world supports string-ext, default value of mode option is 'compat'.

Unpacker

Unpacker も encoding, unicode_errors オプションは廃止する。 encoding and unicode_errors of Unpacker are also deprecated.

Instead of these options, new keyword arguments are used to specify how string and binary unpacked.

binary:

'bytes': Unpack to bytes type (default)
'bytearray': Unpack to bytearray type

string:

'unicode': Unpack to unicode type. Raises UnicodeDecodeError when fail to decode by UTF-8.
'fallback': Unpack to unicode type. Unpack to bytes type when fail to decode by UTF-8.
'bytes': Unpack to bytes type. (default)

Default option of string will be changed to unicode after deprecation process.

Roadmap

0.4 (or 0.5)

schedule: ~2013-04-01

deprecate: encoding and unicode_errors options for Packer and Unpacker
add: mode option for Packer
add: binary and string options for Unpacker. Default value of string is 'bytes'. But missing string parameter cause warning.

1.0

schedule: ~2014-04-01

remove: encoding and unicode_errors options for Packer and Unpacker.
Default value of string option in Unpacker will be changed to 'unicode'.

methane/msgpack-python-memo.rst