Retrieve .pyc Grammar from Python's Sources

About my endeavor to automatically retrieve grammar of CPython .pyc-files bytecode from Python's sources itself (success) and build parser on them (failed).
Most recent version of this article is always on GitHub

The problem story

In 2011 as part of my diploma thesis I came up with an assembler language for .pyc files and wrote a simple proof-of-concept assembler for it based on ANTLR3 parser generator. Back then I stumble upon the issue that CPython bytecode is somewhat not stable and differs from version to version. I.e., as of Python from 3.0 to 3.4 there are 28 incompatible bytecode versions (you may find all bytecode versions with their descriptions in Python34/Lib/importlib/_bootstrap.py).

So, if we aim to devise an assembler or disassembler not for one bytecode version but for many, we face a routine of retrieving out these bytecode differences. This task strikingly invites for automatisation, that I've decided to tackle and give my report on this here.

Files Responsible for Bytecode

To handle many versions of bytecode in an automated way I've decided to use Python distribution source files as cartridges for, e.g., a disassembler.

.pyc bytecode files are just marshaled Python objects with some headers. So .pyc files differ in headers, marshaling protocol and bytecodes used. As of now there only 4 different marshaling protocols. The headers are written in Python34/Lib/importlib/_bootstrap.py in this function:

def _code_to_bytecode(code, mtime=0, source_size=0):
    data = bytearray(MAGIC_NUMBER)
    data.extend(_w_long(mtime))
    data.extend(_w_long(source_size))
    data.extend(marshal.dumps(code))
    return data

You may see it calls to marshal.dumps built-in module which is defined in Python3Sources\Python\marshal.c. marshal.c is the heart of marshaling protocol and all further .pyc-structure is defined there in w_object(..) function.

Building Custom CPP and C Syntax Analysers in ANTLR4

There are several ways for automated, e.g., disassembling, with marshal.c as a cartridge:

Instrument all important functions in marshal.c with code to convey the structure of the .pyc file under consideration.
Analyse marshal.c with some syntax analyser and retrieve grammar from it to parse .pyc-files.

For the first we may implement instrumentation in compiler with some compiler plugin which sets sentinels for target functions. Sounds tough that's why I've chosen the second approach and decided to implement my analyser in ANTLR4 parser generator. ANTLR is a tool which generates lexer and parser from an income grammar file description (e.g., C grammar to parse C files and get parse trees).

The marshal.c is written in C and CPP (C preprocessor). To sift out preprocessing directives you may call to cpp but then you'll lost some useful semantic information about bytecode. It's not hard to implement your own custom CPP lexer though it takes time. Here is a part of my grammar for CppLexer:

lexer grammar CppLexer;
...
DefineCommand
    :   '#' WS* 'define' WS+ -> mode(DEFINE)
    ;
OtherCommand
    :
        '#' WS* { _input.LA(1) != ' ' }? (~[d] ~[\n\r]*?)? NewLine -> skip
    ;
Text
    :   .+? -> channel(2)
    ;
mode DEFINE;
DefineIdentifier
    :   Identifier { definitions.add(getText()); }
    ;
...
DefineTerminator
    :   NewLine -> mode(DEFAULT_MODE)
    ;

Next, after we have gathered important information from directives and stripped them out, done some other preprocessing, our input file is still not C but raw C with unexpanded macros which I preserved for later use. To parse this RawC I've used modified C grammar from official ANTLR repository by Sam Harwell.

.pyc Grammar Retrieved

After successfully parsing marshal.c we have its parse tree to seek out for functions responsible for writing bytecode. These functions are all prefixed with w_ and call each other. I've written a tool in Java to build a simplified Control Flow Graph of their calls convertable to ANTLR grammar. I've slightly manually modified the resulting grammar to get the following .pyc grammar:

grammar PycBc;

pycFile
  :   w_long w_long w_object
  ;

w_pylong
  : w_type w_long? w_long w_short* w_short+
  ;

w_ref
  : w_byte w_long
  ;

w_short
  : w_byte w_byte
  ;

w_short_pstring
  : w_byte w_string
  ;

w_pstring
  : w_size w_string
  ;

w_string
  : w_byte+
  ;

w_long
  : w_byte w_byte w_byte w_byte
  ;

w_object
  : ( w_byte
    | w_complex_object
    )
  ;

w_complex_object
  : ( w_pylong
    | w_type w_long
    | w_type w_string
    | w_type w_byte w_string
    | w_type w_string w_string
    | w_type w_byte w_string w_byte w_string
    | w_type w_short_pstring
    | w_type w_pstring
    | ( w_type w_byte
      | w_type w_size
      ) w_object*
    | w_type (w_object w_object)* w_object
    | w_type w_size w_object*
    | w_type w_long w_long w_long w_long w_long w_object w_object w_object w_object w_object w_object w_object w_object w_long w_o
bject
    | w_byte? w_type w_pstring
    | w_type
    )
  ;

w_more
    :   BYTE
    ;
w_type
    :   BYTE
    ;
w_byte
    :   BYTE
    ;
w_size
  : BYTE BYTE BYTE BYTE
  ;

Unfortunately this grammar is not optimized and if you try to parse .pyc file with it ANTLR-based parser just freezes. I've tried to optimize it to no avail.

Alternative Ways

The other alternative to grammar files is C code instrumentation which gives me a hope that the idea of cartridge-driven disassembler is still feasible. I have little time for it though, so I publish my intermediate results for anyone interested.

Update

I was hot while writing it. Now I think to support all .pyc it's sufficient to implement four major marshaling protocols, not all existing marshal.c versions.

The Code

https://github.com/ilyaigpetrov/pyc-grammar

Shortened link

ilyaigpetrov/retrieve-.pyc-grammar-from-marshal.c.md