Skip to content

Instantly share code, notes, and snippets.

@tail-call
Created September 21, 2024 05:46
Show Gist options
  • Save tail-call/af50760561cfe497c5af752c0d9ec6be to your computer and use it in GitHub Desktop.
Save tail-call/af50760561cfe497c5af752c0d9ec6be to your computer and use it in GitHub Desktop.
Python's Pickle
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pickle\n",
"\n",
"Welcome to `pickle`, the most unsecure data serialization format.\n",
"\n",
"Let's define a simple data structure."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"list = (1 . (2 . (3 . None)))\n"
]
}
],
"source": [
"class Cons:\n",
" \"\"\"\n",
" A pair of values. `car` is the first value, `cdr` is the second\n",
" value.\n",
" \"\"\"\n",
" def __init__(self, car, cdr):\n",
" self.car = car\n",
" self.cdr = cdr\n",
"\n",
" def __str__(self):\n",
" return f\"({self.car} . {self.cdr})\"\n",
"\n",
"list = Cons(1, Cons(2, Cons(3, None)))\n",
"\n",
"print(f\"list = {list}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we import `loads()` and `dumps()` from the `pickle` module. Let's\n",
"try using `dumps()` on the object we created earlier."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"data = b'\\x80\\x04\\x95L\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x8c\\x08__main__\\x94\\x8c\\x04Cons\\x94\\x93\\x94)\\x81\\x94}\\x94(\\x8c\\x03car\\x94K\\x01\\x8c\\x03cdr\\x94h\\x02)\\x81\\x94}\\x94(h\\x05K\\x02h\\x06h\\x02)\\x81\\x94}\\x94(h\\x05K\\x03h\\x06Nububub.'\n",
"type(data) = <class 'bytes'>\n",
"len(data) = 87 bytes\n"
]
}
],
"source": [
"from pickle import loads, dumps\n",
"\n",
"data = dumps(list)\n",
"\n",
"print(f\"data = {data}\")\n",
"print(f\"type(data) = {type(data)}\")\n",
"print(f\"len(data) = {len(data)} bytes\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now deserialize `data` by passing it to `loads()`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1 . (2 . (3 . None)))\n"
]
}
],
"source": [
"loaded_list = loads(data)\n",
"\n",
"print(loaded_list)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cooler stuff\n",
"\n",
"`pickle` can do much more than simply packing objects into bytes\n",
"and back. In this section we will explore its possibilities.\n",
"\n",
"For now of particular interest is the `pickletools` sibling module."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0: \\x80 PROTO 4\n",
" 2: \\x95 FRAME 76\n",
" 11: \\x8c SHORT_BINUNICODE '__main__'\n",
" 21: \\x94 MEMOIZE (as 0)\n",
" 22: \\x8c SHORT_BINUNICODE 'Cons'\n",
" 28: \\x94 MEMOIZE (as 1)\n",
" 29: \\x93 STACK_GLOBAL\n",
" 30: \\x94 MEMOIZE (as 2)\n",
" 31: ) EMPTY_TUPLE\n",
" 32: \\x81 NEWOBJ\n",
" 33: \\x94 MEMOIZE (as 3)\n",
" 34: } EMPTY_DICT\n",
" 35: \\x94 MEMOIZE (as 4)\n",
" 36: ( MARK\n",
" 37: \\x8c SHORT_BINUNICODE 'car'\n",
" 42: \\x94 MEMOIZE (as 5)\n",
" 43: K BININT1 1\n",
" 45: \\x8c SHORT_BINUNICODE 'cdr'\n",
" 50: \\x94 MEMOIZE (as 6)\n",
" 51: h BINGET 2\n",
" 53: ) EMPTY_TUPLE\n",
" 54: \\x81 NEWOBJ\n",
" 55: \\x94 MEMOIZE (as 7)\n",
" 56: } EMPTY_DICT\n",
" 57: \\x94 MEMOIZE (as 8)\n",
" 58: ( MARK\n",
" 59: h BINGET 5\n",
" 61: K BININT1 2\n",
" 63: h BINGET 6\n",
" 65: h BINGET 2\n",
" 67: ) EMPTY_TUPLE\n",
" 68: \\x81 NEWOBJ\n",
" 69: \\x94 MEMOIZE (as 9)\n",
" 70: } EMPTY_DICT\n",
" 71: \\x94 MEMOIZE (as 10)\n",
" 72: ( MARK\n",
" 73: h BINGET 5\n",
" 75: K BININT1 3\n",
" 77: h BINGET 6\n",
" 79: N NONE\n",
" 80: u SETITEMS (MARK at 72)\n",
" 81: b BUILD\n",
" 82: u SETITEMS (MARK at 58)\n",
" 83: b BUILD\n",
" 84: u SETITEMS (MARK at 36)\n",
" 85: b BUILD\n",
" 86: . STOP\n",
"highest protocol among opcodes = 4\n"
]
}
],
"source": [
"import pickletools\n",
"\n",
"pickletools.dis(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pickletools.dis()` prints the internal structure of a pickle object.\n",
"This allows us to get an idea of how our data is stored on a very low\n",
"level.\n",
"\n",
"We can even programmatically iterate over a series of opcodes using\n",
"`pickletools.genops()`. Let's utilize this feature to see all opcodes\n",
"used in our data and their description."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'PROTO': 'Protocol version indicator.\\n\\n For protocol 2 and above, a pickle must start with this opcode.\\n The argument is the protocol version, an int in range(2, 256).\\n ',\n",
" 'FRAME': 'Indicate the beginning of a new frame.\\n\\n The unpickler may use this opcode to safely prefetch data from its\\n underlying stream.\\n ',\n",
" 'SHORT_BINUNICODE': 'Push a Python Unicode string object.\\n\\n There are two arguments: the first is a 1-byte little-endian signed int\\n giving the number of bytes in the string. The second is that many\\n bytes, and is the UTF-8 encoding of the Unicode string.\\n ',\n",
" 'MEMOIZE': 'Store the stack top into the memo. The stack is not popped.\\n\\n The index of the memo location to write is the number of\\n elements currently present in the memo.\\n ',\n",
" 'STACK_GLOBAL': 'Push a global object (module.attr) on the stack.\\n ',\n",
" 'EMPTY_TUPLE': 'Push an empty tuple.',\n",
" 'NEWOBJ': 'Build an object instance.\\n\\n The stack before should be thought of as containing a class\\n object followed by an argument tuple (the tuple being the stack\\n top). Call these cls and args. They are popped off the stack,\\n and the value returned by cls.__new__(cls, *args) is pushed back\\n onto the stack.\\n ',\n",
" 'EMPTY_DICT': 'Push an empty dict.',\n",
" 'MARK': 'Push markobject onto the stack.\\n\\n markobject is a unique object, used by other opcodes to identify a\\n region of the stack containing a variable number of objects for them\\n to work on. See markobject.doc for more detail.\\n ',\n",
" 'BININT1': 'Push a one-byte unsigned integer.\\n\\n This is a space optimization for pickling very small non-negative ints,\\n in range(256).\\n ',\n",
" 'BINGET': 'Read an object from the memo and push it on the stack.\\n\\n The index of the memo object to push is given by the 1-byte unsigned\\n integer following.\\n ',\n",
" 'NONE': 'Push None on the stack.',\n",
" 'SETITEMS': 'Add an arbitrary number of key+value pairs to an existing dict.\\n\\n The slice of the stack following the topmost markobject is taken as\\n an alternating sequence of keys and values, added to the dict\\n immediately under the topmost markobject. Everything at and after the\\n topmost markobject is popped, leaving the mutated dict at the top\\n of the stack.\\n\\n Stack before: ... pydict markobject key_1 value_1 ... key_n value_n\\n Stack after: ... pydict\\n\\n where pydict has been modified via pydict[key_i] = value_i for i in\\n 1, 2, ..., n, and in that order.\\n ',\n",
" 'BUILD': 'Finish building an object, via __setstate__ or dict update.\\n\\n Stack before: ... anyobject argument\\n Stack after: ... anyobject\\n\\n where anyobject may have been mutated, as follows:\\n\\n If the object has a __setstate__ method,\\n\\n anyobject.__setstate__(argument)\\n\\n is called.\\n\\n Else the argument must be a dict, the object must have a __dict__, and\\n the object is updated via\\n\\n anyobject.__dict__.update(argument)\\n ',\n",
" 'STOP': \"Stop the unpickling machine.\\n\\n Every pickle ends with this opcode. The object at the top of the stack\\n is popped, and that's the result of unpickling. The stack should be\\n empty then.\\n \"}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"opcodes = {}\n",
"\n",
"for (opcode, arg, pos) in pickletools.genops(data):\n",
" opcodes[opcode.name] = opcode.doc\n",
"\n",
"opcodes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So pickle isn't just convenient, it's wondrously transparent,\n",
"interactive, and self-documenting.\n",
"\n",
"The only downside of piclke is that every other serialization system\n",
"from now on will seem bland compared to pickle."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment