- CPython for greater understanding of the Python programming language (but "reference implementations always overspecify") Reading source to solve problems
- getting involved, contributing to the project
This workshop will cover the basics of the CPython runtime and interpreter. There is an enormous amount of material to cover, and I'll try to to rush through as much as I can.
This introduction will be rushed, so it may not be perfectly accurate.
CPython is the reference implementation of the Python programming language. The Python programming language exists separate from this implementation, and, in fact, alternate implementations of the language exist. Julian spoke about PyPy. There are also implementations built on top of the .NET runtime (IronPython,) the JVM (Jython,) &c.
CPython is still the dominant, most commonly used implementation of the Python language. (Whether this should or should not be the case is an independent question, one which we can see Julian feels passionately about.) CPython is probably what you are using when you type python
at the command line.
CPython is written in the C programming language. It should compile without errors on a C89 or C99 compliant compiler.
If you want to learn more about the C programming language, stop by office hours. There are many C/C++/Java programmers who might be able to help you out. There are also on-line resources such as c.learncodethehardway.org/book. You may find the syntax of C to be not completely alien to you as a Python programmer. The CPython interpreter is written to be fairly straightforward to understand and to contribute to.
We are going to be using a couple of other tools, too. autoconf, gcc, gdb, coreutils.
We recommend using our Ubuntu virtual machine (locally or remotely,) since it removes a lot of the difficulty from this process.
These tools are extremely rich, and we could do a workshop on each of them individually. We're going to use them in this workshop, but going into depth on anything outside of basic gdb
is out of our scope.
We're going to start this workshop by downloading the C source code for CPython 3.3.3. We're going to build it, install it, and run it under gdb
.
# first, go to Python.org/download
# here, we see links to "Python 3.3.3 xzipped source tarball (for Linux, Unix or Mac OS X, better compression)"
lynx http://python.org/download/
# this is the source code for the CPython interpreter
# let's download & extract it
wget 'http://python.org/ftp/python/3.3.3/Python-3.3.3.tar.xz' # get the source code
tar xJvf Python-3.3.3.tar.xz # extract
# a quick, incomplete, and imprecise tour of the contents
ls Doc/ # reStructuredText documentation
less Grammar/Grammar # Python 'grammar' EBNF
ls Include/ # header files
ls Lib/ # the Python standard library, Python code, e.g., the source for collections.namedtuple
ls Modules/ # Python modules written in C, e.g., the source for math.pow
ls Objects/ # Python basic types
ls Parser/ # language parser
ls Python/ # interpreter
# the difficulty of building software from source is often getting all the dependent parts
# we can solve this on Ubuntu
sudo apt-get install build-essential # first, get all the tools necessary to build anything
sudo apt-get build-dep python3.3 # gives you all the other software Python depends on
# apt-get build-dep is a GREAT tool; it will install all the dependencies necessary for building a given package!
# Python uses autoconf; the standard procedure for building and installing software is ./configure; make; make install
cd Python-3.3.3
# configure
# the CFLAGS above give us some extra debugging information in `gdb`: very, very handy!
# e.g., we get the ability to view macro definitions
# --with-pydebug enables basic CPython interpreter debugging features (e.g., ref counts)
# --prefix=$PWD-build lets us install this locally (not system-wide)
CFLAGS="-g3 -ggdb -gdwarf-4" ./configure --with-pydebug --prefix=$PWD-build
# make -j9 for parallel make; make install to install into $PWD-build
# $PWD should be something like /home/nycpython/Python-3.3.3
# our Python will be installed into /home/nycpython/Python3.3.3-build
make -j9
make install
# run Python, see if it works!
$PWD-build/bin/python3
Let's try some sample code.
We have our neat swapping syntax in Python. Let's try to figure out what it does. Does it allocate a temporary variable behind the scenes?
def f(x, y):
x, y = y, x
return x, y
from dis import dis
dis(f)
We get:
2 0 LOAD_FAST 1 (y)
3 LOAD_FAST 0 (x)
6 ROT_TWO
7 STORE_FAST 0 (x)
10 STORE_FAST 1 (y)
3 13 LOAD_FAST 0 (x)
16 LOAD_FAST 1 (y)
19 BUILD_TUPLE 2
22 RETURN_VALUE
- First column is the line number (formal args and
def
are line 1.) - Second column is the bytecode index. (Notice that many bytecodes are 3 bytes long.)
- Third column is the bytecode. (Notice ROT_TWO!)
- Fourth column is the argument to that bytecode.
- Fifth column is an interpretation of that argument.
CPython is a stack-based VM. Values are pushed and popped from the stack. Notice in this example, we load two local variables, x and y onto the stack. We rotate their order. Then we store the top elements of the stack back to x and y. No temporary storage needed! Then we load x and y back up, make a tuple out of them, and then return that.
dis.dis
is a fantastic tool for figuring out what is going on behind the scenes. We will use it throughout this workshop.
I learnt my way around CPython using exactly this tool. I saw bytecodes, then I grep
ed for them. This lead me to ceval.c
find -iname '*.c' -print0 | xargs -0 grep -n ROT_TWO
We see:
./Python/compile.c:793: case ROT_TWO:
./Python/compile.c:2877: ADDOP(c, ROT_TWO);
./Python/compile.c:3393: ADDOP(c, ROT_TWO);
./Python/peephole.c:542: codestr[i] = ROT_TWO;
./Python/peephole.c:547: codestr[i+1] = ROT_TWO;
./Python/ceval.c:1389: TARGET(ROT_TWO)
Let's look at ceval.c on line 1389.
Oh, my! This is the file that implements all the bytecodes! It's the interpreter loop! (If we look up, there's even a switch(opcode) up there... though the situation here is a bit more complicated, since Python3.3.2 is a "direct threaded interpreter.") Let's use gdb to put a breakpoint on that line and see what happens!
Run your Python under gdb
:
gdb --args $PWD-build/bin/python3
Type... (note: ^C means hit control-c)
(gdb) source Tools/gdb/libpython.py # nice helpers!
(gdb) run # start the program
>>> def f(x, y):
... x, y = y, x
... return x, y
...
>>>
^C
(gdb) tbreak ceval.c:1389 # put a breakpoint on that line
(gdb) continue
>>> f(10, 20) # evaluate our code
(gdb) list # show the source code around where we broke
(gdb) info macro TARGET # get information about a C macro
(gdb) info function PyObject_GetAttr # get information about a C function
(gdb) backtrace # see all the C function calls leading up to this point
(gdb) next # step to the next line of C-code (stepping over functions)
(gdb) step # step to the next line of C-code (stepping into functions)
(gdb) finish # run the current C function to completion
This is our first exercise. Let's start with some Python code.
class Foo(int):
def __mul__(self, other):
print('Foo.__mul__({}, {})'.format(self, other))
class Bar(int):
def __mul__(self, other):
print('Bar.__mul__({}, {})'.format(self, other))
We have a custom object with __mul__
implemented. Let's try it out!
foo, bar = Foo(10), Bar(20)
foo * bar
bar * foo
10 * foo
bar * 10
This makes sense. In some cases we see Foo.__mul__
being used; in others, Bar.__mul__
Let's make this example a bit richer and implement __rmul__
.
class Foo(int):
def __mul__(self, other):
print('Foo.__mul__({}, {})'.format(self, other))
def __rmul__(self, other):
print('Foo.__rmul__({}, {})'.format(self, other))
class Bar(int):
def __mul__(self, other):
print('Bar.__mul__({}, {})'.format(self, other))
def __rmul__(self, other):
print('Bar.__rmul__({}, {})'.format(self, other))
Try it out:
foo, bar = Foo(10), Bar(20)
foo * bar
bar * foo
10 * foo
bar * 10
This behaviour makes sense, too!
What about:
class Foo(int):
def __mul__(self, other):
print('Foo.__mul__({}, {})'.format(self, other))
def __rmul__(self, other):
print('Foo.__rmul__({}, {})'.format(self, other))
class Bar(Foo):
def __mul__(self, other):
print('Bar.__mul__({}, {})'.format(self, other))
def __rmul__(self, other):
print('Bar.__rmul__({}, {})'.format(self, other))
foo, bar = Foo(10), Bar(20)
foo * bar
bar * foo
10 * foo
bar * 10
That's interesting! Bar.__rmul__
seems to be preferred! Why?
Here is our standard procedure. Use dis
to find where to put a breakpoint, then step through the code.
from dis import dis
def f(x, y):
return x * y
dis(f)
[insert live coding stepping into PyNumber_Multiply]
Here's what the documentation says.
Note If the right operand’s type is a subclass of the left operand’s type and that subclass provides the reflected method for the operation, this method will be called before the left operand’s non-reflected method. This behavior allows subclasses to override their ancestors’ operations.
But in this last example, we were able to determine this behaviour conclusively by ourselves by just reading (& debugging) the source!
Here's our next example.
Run python3
:
hash(100)
hash(10)
hash(2)
hash(1)
hash(-100)
hash(-10)
hash(-2)
hash(-1) # weird!!
But look at pypy
:
hash(-1) # different answer than above
What is going on?! Why does hash(-1) == -2
only in CPython?!
Let's do the same thing. Use dis
to find an entry point, then step through the code.
from dis import dis
def f():
hash(-1)
dis(f)
Let's put a breakpoint on CALL_FUNCTION.
(gdb) tb ceval.c:2671
[live coding stepping through CALL_FUNCTION to builtin_hash in Python/bltinmodule.c; end up at Objects/longobject.c:2785:long_hash where we see this behaviour hard-coded]
if (x == (Py_uhash_t)-1)
x = (Py_uhash_t)-2;
Side note: gdb
is powerful enough to let us evaluate code live. This helps us step through code and inspect things.
(gdb) print PyCFunction_Check(func)
(gdb) print ((PyFunctionObject*)(((PyMethodObject*)(func))->im_func))->func_name
What about a custom object?
class Foo(object):
def __hash__(self):
return -1
foo = Foo()
hash(foo)
Even custom-objects do this! We can't ever get a -1 return value from hash
!
Add a breakpoint.
(gdb) tb builtin_hash
[live-coding; step through ending up at Objects/typeobject.c:5309:slot_tp_hash; slot_tp_hash is the slot wrapper for __hash__; also hard-codes this behaviour but gives us a hint]
/* -1 is reserved for errors. */
if (h == -1)
h = -2;
-1 is a common convention for error values in C. If we look at PyObject_Hash
we can see this in use. This is why PyPy shows different behaviour (it's not written in C)
We see this in Python/bltinmodule.c:1238:builtin_hash
:
if (x == -1)
return NULL;
[discussion on difference between the contents of these files]
Objects/abstract.c vs Objects/object.c vs Objects/typeobject.c
First, Nick Coghlan wanted me to mention docs.python.org/devguide and pythonmentors.com
docs.python.org/devguide is the comprehensive guide to contributing to the CPython project. It contains every resource you need.
pythonmentors.com contains a lot of resources for mentorship around contributing to the CPython project.
Briefly: check out the #python-dev and #python-ideas and #core-mentorship mailing lists. Check out bugs.python.org
Okay, so we know about decorators in Python. The follow decorators don't do anything interesting.
def dec(func):
return func
@dec
def foo(x, y):
return x * y
We know we can write dec
as follows, too:
dec = lambda func: func
But notice we can't write:
@(lambda func:func)
def foo(x, y):
return x * y
Why not? Guido made a "gut feeling" decision about this: http://mail.python.org/pipermail/python-dev/2004-August/046711.html
Let's put our knowledge to use to figure out how to lift this restriction.
First, let's look at Grammar/Grammar:22
:
decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
So, after the decorator, we can only have a dotted_name
(the same thing we can have in from ... import ...
): just a bunch of names with dots between them like a.b.c
We are allowed one function call at the very end. This means that f().g()
is invalid as a decorator.
Let's switch this to a testlist
which is the term used for arbitrary expressions.
diff -r 177e0254fdee Grammar/Grammar
--- a/Grammar/Grammar Tue Nov 19 11:06:44 2013 -0500
+++ b/Grammar/Grammar Tue Nov 19 15:15:33 2013 -0500
@@ -19,7 +19,7 @@
file_input: (NEWLINE | stmt)* ENDMARKER
eval_input: testlist NEWLINE* ENDMARKER
-decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
+decorator: '@' testlist NEWLINE
decorators: decorator+
decorated: decorators (classdef | funcdef)
funcdef: 'def' NAME parameters ['->' test] ':' suite
We need to make one more change to get this to work. In Python/ast.c
we have some code that helps create the AST. Where we used to expect a dotted_name
for a decorator, we now expect a testlist
.
diff -r 177e0254fdee Python/ast.c
--- a/Python/ast.c Tue Nov 19 11:06:44 2013 -0500
+++ b/Python/ast.c Tue Nov 19 15:15:33 2013 -0500
@@ -1429,7 +1429,7 @@
REQ(CHILD(n, 0), AT);
REQ(RCHILD(n, -1), NEWLINE);
- name_expr = ast_for_dotted_name(c, CHILD(n, 1));
+ name_expr = ast_for_testlist(c, CHILD(n, 1));
if (!name_expr)
return NULL;
Now, everything just works! Isn't that amazing! Fundamentally changing the language with just two changes to two files.
But our job isn't over. If we want to submit a patch, we need to make sure all tests pass with make test
or $PWD-build/bin/python3 -m test -j3
.
We see two tests that fail: Lib/test/test_decorators.py
and Lib/test/test_parser.py
Let's fix those.
The first is easy. There is a test-case that is no longer appropriate (checking for invalid forms that we now consider valid.)
Just remove the test.
diff -r 177e0254fdee Lib/test/test_decorators.py
--- a/Lib/test/test_decorators.py Tue Nov 19 11:06:44 2013 -0500
+++ b/Lib/test/test_decorators.py Tue Nov 19 15:15:33 2013 -0500
@@ -152,15 +152,6 @@
self.assertEqual(counts['double'], 4)
def test_errors(self):
- # Test syntax restrictions - these are all compile-time errors:
- #
- for expr in [ "1+2", "x[3]", "(1, 2)" ]:
- # Sanity check: is expr is a valid expression by itself?
- compile(expr, "testexpr", "exec")
-
- codestr = "@%s\ndef f(): pass" % expr
- self.assertRaises(SyntaxError, compile, codestr, "test", "exec")
-
# You can't put multiple decorators on a single line:
#
self.assertRaises(SyntaxError, compile,
The second is also easy. We have some validation of input to validate syntax. Instead of expecting a dotted_name
we now expect a testlist
:
diff -r 177e0254fdee Modules/parsermodule.c
--- a/Modules/parsermodule.c Tue Nov 19 11:06:44 2013 -0500
+++ b/Modules/parsermodule.c Tue Nov 19 15:15:33 2013 -0500
@@ -2541,7 +2541,7 @@
ok = (validate_ntype(tree, decorator) &&
(nch == 3 || nch == 5 || nch == 6) &&
validate_at(CHILD(tree, 0)) &&
- validate_dotted_name(CHILD(tree, 1)) &&
+ validate_testlist(CHILD(tree, 1)) &&
validate_newline(RCHILD(tree, -1)));
if (ok && nch != 3) {
All done!
make && make install
then $PWD-build/bin/python3 -m test -j3
and all tests pass!
Let's submit this as a patch to bugs.python.org/
Here's my text:
Decorator syntax currently allows only a dotted_name after the @. As far as I can tell, this was a gut-feeling decision made by Guido. [1]
I spoke with Nick Coghlan at PyTexas about this, and he suggested that if someone did the work, there might be interest in revisiting this restriction.
The attached patch allows any testlist to follow the @.
The following are now valid:
@(lambda x:x)
def f():
pass
@(spam if p else eggs)
def f():
pass
@spam().ham().eggs()
def f():
pass
[1] http://mail.python.org/pipermail/python-dev/2004-August/046711.html
Here's the issue I created http://bugs.python.org/issue19660. Let's see if it gets accepted (or if it needs more work!)
Awesome!