About my endeavor to automatically retrieve grammar of CPython
.pyc
-files bytecode from Python
's sources itself (success) and build parser on them (failed).
Most recent version of this article is always on GitHub
In 2011 as part of my diploma thesis I came up with an assembler language for .pyc
files and wrote a simple proof-of-concept assembler for it based on ANTLR3
parser generator. Back then I stumble upon the issue that CPython
bytecode is somewhat not stable and differs from version to version. I.e., as of Python
from 3.0
to 3.4
there are 28 incompatible bytecode versions (you may find all bytecode versions with their descriptions in Python34/Lib/importlib/_bootstrap.py
).
So, if we aim to devise an assembler or disassembler not for one bytecode version but for many, we face a routine of retrieving out these bytecode differences. This task strikingly invites for automatisation, that I've decided to tackle and give my report on this here.
To handle many versions of bytecode in an automated way I've decided to use Python
distribution source files as cartridges for, e.g., a disassembler.
.pyc
bytecode files are just marshaled Python objects with some headers. So .pyc
files differ in headers, marshaling protocol and bytecodes used. As of now there only 4 different marshaling protocols. The headers are written in Python34/Lib/importlib/_bootstrap.py
in this function:
def _code_to_bytecode(code, mtime=0, source_size=0):
data = bytearray(MAGIC_NUMBER)
data.extend(_w_long(mtime))
data.extend(_w_long(source_size))
data.extend(marshal.dumps(code))
return data
You may see it calls to marshal.dumps
built-in module which is defined in Python3Sources\Python\marshal.c
. marshal.c
is the heart of marshaling protocol and all further .pyc
-structure is defined there in w_object(..)
function.
There are several ways for automated, e.g., disassembling, with marshal.c
as a cartridge:
- Instrument all important functions in
marshal.c
with code to convey the structure of the.pyc
file under consideration. - Analyse
marshal.c
with some syntax analyser and retrieve grammar from it to parse.pyc
-files.
For the first we may implement instrumentation in compiler with some compiler plugin which sets sentinels for target functions. Sounds tough that's why I've chosen the second approach and decided to implement my analyser in ANTLR4
parser generator.
ANTLR
is a tool which generates lexer and parser from an income grammar file description (e.g., C
grammar to parse C
files and get parse trees).
The marshal.c
is written in C and CPP (C preprocessor). To sift out preprocessing directives you may call to cpp
but then you'll lost some useful semantic information about bytecode. It's not hard to implement your own custom CPP lexer though it takes time. Here is a part of my grammar for CppLexer:
lexer grammar CppLexer;
...
DefineCommand
: '#' WS* 'define' WS+ -> mode(DEFINE)
;
OtherCommand
:
'#' WS* { _input.LA(1) != ' ' }? (~[d] ~[\n\r]*?)? NewLine -> skip
;
Text
: .+? -> channel(2)
;
mode DEFINE;
DefineIdentifier
: Identifier { definitions.add(getText()); }
;
...
DefineTerminator
: NewLine -> mode(DEFAULT_MODE)
;
Next, after we have gathered important information from directives and stripped them out, done some other preprocessing, our input file is still not C
but raw C
with unexpanded macros which I preserved for later use. To parse this RawC
I've used modified C
grammar from official ANTLR
repository by Sam Harwell.
After successfully parsing marshal.c
we have its parse tree to seek out for functions responsible for writing bytecode. These functions are all prefixed with w_
and call each other. I've written a tool in Java
to build a simplified Control Flow Graph of their calls convertable to ANTLR
grammar. I've slightly manually modified the resulting grammar to get the following .pyc
grammar:
grammar PycBc;
pycFile
: w_long w_long w_object
;
w_pylong
: w_type w_long? w_long w_short* w_short+
;
w_ref
: w_byte w_long
;
w_short
: w_byte w_byte
;
w_short_pstring
: w_byte w_string
;
w_pstring
: w_size w_string
;
w_string
: w_byte+
;
w_long
: w_byte w_byte w_byte w_byte
;
w_object
: ( w_byte
| w_complex_object
)
;
w_complex_object
: ( w_pylong
| w_type w_long
| w_type w_string
| w_type w_byte w_string
| w_type w_string w_string
| w_type w_byte w_string w_byte w_string
| w_type w_short_pstring
| w_type w_pstring
| ( w_type w_byte
| w_type w_size
) w_object*
| w_type (w_object w_object)* w_object
| w_type w_size w_object*
| w_type w_long w_long w_long w_long w_long w_object w_object w_object w_object w_object w_object w_object w_object w_long w_o
bject
| w_byte? w_type w_pstring
| w_type
)
;
w_more
: BYTE
;
w_type
: BYTE
;
w_byte
: BYTE
;
w_size
: BYTE BYTE BYTE BYTE
;
Unfortunately this grammar is not optimized and if you try to parse .pyc
file with it ANTLR
-based parser just freezes. I've tried to optimize it to no avail.
The other alternative to grammar files is C
code instrumentation which gives me a hope that the idea of cartridge-driven disassembler is still feasible. I have little time for it though, so I publish my intermediate results for anyone interested.
I was hot while writing it. Now I think to support all .pyc
it's sufficient to implement four major marshaling protocols, not all existing marshal.c
versions.
https://github.com/ilyaigpetrov/pyc-grammar