Skip to content

Instantly share code, notes, and snippets.

@dutc
Last active July 1, 2022 20:57
Show Gist options
  • Save dutc/9097282 to your computer and use it in GitHub Desktop.
Save dutc/9097282 to your computer and use it in GitHub Desktop.
larger principles of CPython debugging
> I'm able to follow most of your instructions in the tutorial. However,
> there are things that are not yet obvious to me. If, for instance, I
> want to find the implementation of a specific built-in, where do I go?
> The `dis()` decompilation of `id()`, which I want to study, isn't
> instructive. Using `find ... -print0 | xargs -0 grep ...` also doesn't
> seem to get me anything useful. But I don't even see "builtins" file or
> a section of http://docs.python.org/3.3/c-api/index.html dealing with
> built-ins at all. Where should I be looking; how should I generalize
> this type of search for the future?
>>> def f(x):
... return id(x)
...
>>> from dis import dis
>>> dis(f)
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_FAST 0 (x)
6 CALL_FUNCTION 1
9 RETURN_VALUE
`id` is just a function in the global scope. Where does it come from?
>>> from inspect import getmodule
>>> getmodule(id)
<module '__builtin__' (built-in)>
`__builtin__` is one of a few special modules in Python. You won't find
this in the directory of modules written in C (Modules/) or in the
directory of modules written in Python (Lib/)
(Another special module is sys.)
In general, a module written in C explicitly exposes the names you can
access from Python. Therefore, the literal string "id" (with quotes)
should appear where-ever this name is bound to the underlying C-function
that implements it or where-ever another piece of C code wants to find
this code by name (i.e., via some standardised lookup mechanism -
there's a lot more to be said on this point; think on it.)
$ grep '"id"' Python/bltinmodule.c
{"id", builtin_id, METH_O, id_doc},
The `sys` and `__builtin__` modules are in Python/bltinmodule.c and
Python/sysmodule.c
The above code is part of a lookup table connecting the exported symbol
`"id"` to a corresponding C-function (`builtin_id`) with a specification
for what kind of method it is (in terms of what kinds of parameters it
takes) and with an associated piece of documentation.
This is `builtin_id`. It's common for the exported name ("id") to map to
an internal implementation name ("builtin_id") by just prefixing the
former with the name of the module.
static PyObject *
builtin_id(PyObject *self, PyObject *v)
{
return PyLong_FromVoidPtr(v);
}
PyDoc_STRVAR(id_doc,
"id(object) -> integer\n\
\n\
Return the identity of an object. This is guaranteed to be unique among\n\
simultaneously existing objects. (Hint: it's the object's memory
address.)");
`id` just converts the PyObject pointer argument (v) to a long.
Therefore `id(x)` gives the value of the C-pointer `x` which is the
location of the data `x` points to in memory. `id(x)` gives you the
memory location of `x`'s `PyObject`.
Another way to determine this is as follows:
>>> def f(x):
... return id(x)
...
>>> from dis import dis
>>> dis(f)
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_FAST 0 (x)
6 CALL_FUNCTION 1
9 RETURN_VALUE
We're calling a function.
$ grep -A15 -n 'case CALL_FUNCTION:' Python/ceval.c
2658: case CALL_FUNCTION:
2659- {
2660- PyObject **sp;
2661- PCALL(PCALL_ALL);
2662- sp = stack_pointer;
2663-#ifdef WITH_TSC
2664- x = call_function(&sp, oparg, &intr0, &intr1);
2665-#else
2666- x = call_function(&sp, oparg);
2667-#endif
2668- stack_pointer = sp;
2669- PUSH(x);
2670- if (x != NULL)
2671- continue;
2672- break;
2673- }
$ gdb --args $PWD-build/bin/python
(gdb) run
>>> ^C
(gdb) break ceval.c:2658
(gdb) continue
>>> id(10)
This breakpoint will be hit on ANY function call, and it'll catch on
some function call that's part of displaying and processing the
interactive console. This breakpoint will probably NOT trigger on id first.
It may trigger on something like `utf_8_decode`, and if you follow
execution of it, you'll see how a C function usually executes. You can
also do this by tracing the code by hand.
Follow `CALL_FUNCTION` into `call_function`, then probably into
`PyCFunction_Call`, then into `(*meth)(self, arg)` which calls the C
function itself.
Of course, we want to trace `id` not `utf_8_decode`!
We could keep tracing every entrance of this code-path, but this is
going to exhaust us. The breakpoint will be hit a lot of times before we
see `id`.
We need to put in a better breakpoint, and this requires a bit more
knowledge about what a C function is.
In Python, you can get the name of a C function just like you can get it
for a regular Python function.
>>> id.__name__
'id'
This means we should be able to access this name from the C code and put
in a CONDITIONAL breakpoint to break only on this name.
Look at the code for the first step in processing the `CALL_FUNCTION`
opcode, call_function. The first branch says:
if (PyCFunction_Check(func) && nk == 0) {
Let's look up that macro. In `vim`, we can just ^] to jump to the tag.
This requires we have our ctags set up correctly.
Include/methodobject.h:16
#define PyCFunction_Check(op) (Py_TYPE(op) == &PyCFunction_Type)
So a C function is a `PyCFunctionType`. We can ^] on that tag to get to:
Objects/methodobject.c:281
PyTypeObject PyCFunction_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"builtin_function_or_method",
sizeof(PyCFunctionObject),
0,
(destructor)meth_dealloc, /* tp_dealloc */
0, /* tp_print */
0, /* tp_getattr */
0, /* tp_setattr */
(cmpfunc)meth_compare, /* tp_compare */
(reprfunc)meth_repr, /* tp_repr */
0, /* tp_as_number */
0, /* tp_as_sequence */
0, /* tp_as_mapping */
(hashfunc)meth_hash, /* tp_hash */
PyCFunction_Call, /* tp_call */
0, /* tp_str */
PyObject_GenericGetAttr, /* tp_getattro */
0, /* tp_setattro */
0, /* tp_as_buffer */
Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC,/* tp_flags */
0, /* tp_doc */
(traverseproc)meth_traverse, /* tp_traverse */
0, /* tp_clear */
meth_richcompare, /*
tp_richcompare */
0, /* tp_weaklistoffset */
0, /* tp_iter */
0, /* tp_iternext */
0, /* tp_methods */
meth_members, /* tp_members */
meth_getsets, /* tp_getset */
0, /* tp_base */
0, /* tp_dict */
};
In this file, we can search for `"__name__"`. This file should contain
all the implementation code for a C function.
Objects/methodobject.c:187
static PyGetSetDef meth_getsets [] = {
{"__doc__", (getter)meth_get__doc__, NULL, NULL},
{"__name__", (getter)meth_get__name__, NULL, NULL},
{"__self__", (getter)meth_get__self__, NULL, NULL},
{0}
};
Okay, let's find `meth_get__name__`.
Objects/methodobject.c:157
static PyObject *
meth_get__name__(PyCFunctionObject *m, void *closure)
{
return PyString_FromString(m->m_ml->ml_name);
}
This is what happens when we `id.__name__`, and this is how we can
determine the name of the `PyCFunctionObject` we're looking at.
Let's use that to better our breakpoint.
We'll put the breakpoint in call_function in the branch we know
corresponds to processing of C functions.
$ grep -A20 -n '^call_function(' Python/ceval.c
3980:call_function(PyObject ***pp_stack, int oparg
3981-#ifdef WITH_TSC
3982- , uint64* pintr0, uint64* pintr1
3983-#endif
3984- )
3985-{
3986- int na = oparg & 0xff;
3987- int nk = (oparg>>8) & 0xff;
3988- int n = na + 2 * nk;
3989- PyObject **pfunc = (*pp_stack) - n - 1;
3990- PyObject *func = *pfunc;
3991- PyObject *x, *w;
3992-
3993- /* Always dispatch PyCFunction first, because these are
3994- presumed to be the most frequent callable object.
3995- */
3996- if (PyCFunction_Check(func) && nk == 0) {
3997- int flags = PyCFunction_GET_FLAGS(func);
3998- PyThreadState *tstate = PyThreadState_GET();
3999-
4000- PCALL(PCALL_CFUNCTION);
We'll put it on line 3996 and only trigger it if the function's name is
"id".
`func` is a PyObject pointer, so we need to cast it to a more specific
type before we can look into the structure.
((PyCFunctionObject*)func)->m_ml->ml_name
This gives us a PyString object, but we want a C-string so we can
compare it using strcmp in C.
PyString_AS_STRING(((PyCFunctionObject*)func)->m_ml->ml_name)
`strcmp` is a string comparison function in C that returns 0 on match.
0 ==
strcmp(PyString_AS_STRING(((PyCFunctionObject*)func)->m_ml->ml_name), "id")
$ gdb --args $PWD-build/bin/python
(gdb) run
>>> ^C
(gdb) break ceval.c:3996 if 0 ==
strcmp(((PyCFunctionObject*)func)->m_ml->ml_name, "id")
(gdb) continue
>>> id(10)
This breakpoint will only trigger on calls to `id`
Okay, let's follow this as before.
(There is on additional set of branches in call_function once it has
been determined that we have a C function. The second level of branches
either calls the C function directly if it is a function that takes no
arguments or takes a single object argument or goes through
`PyCFunction_Call` otherwise. In the case of `id`, we follow the former
path. In the case of `utf_8_decode` or some other C function, we may
have followed the latter path.)
From `call_function`, we step into `(*meth)(self, arg)` directly which
calls the C function itself.
Step into `(*meth)(self,arg)` and we end up in the corresponding C code
for `id`.
builtin_id (self=0x0, v=0x8529e8) at Python/bltinmodule.c:918
Let's see how we got there.
(gdb) bt
#0 builtin_id (self=0x0, v=0x8529e8) at Python/bltinmodule.c:918
#1 0x00000000004d671e in call_function (pp_stack=0x7fffffffdb90,
oparg=1) at Python/ceval.c:4009
#2 0x00000000004d172b in PyEval_EvalFrameEx (f=0xa9c800, throwflag=0)
at Python/ceval.c:2666
#3 0x00000000004d4119 in PyEval_EvalCodeEx (co=0x9da0f0,
globals=0x8b83d0, locals=0x8b83d0, args=0x0, argcount=0, kws=0x0,
kwcount=0,
defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#4 0x00000000004ca18a in PyEval_EvalCode (co=0x9da0f0,
globals=0x8b83d0, locals=0x8b83d0) at Python/ceval.c:667
#5 0x0000000000506807 in run_mod (mod=0xa7ab88, filename=0x58e42a
"<stdin>", globals=0x8b83d0, locals=0x8b83d0, flags=0x7fffffffe060,
arena=0x9a52a0) at Python/pythonrun.c:1363
#6 0x0000000000504b1e in PyRun_InteractiveOneFlags (fp=0x7ffff74b4360
<_IO_2_1_stdin_>, filename=0x58e42a "<stdin>",
flags=0x7fffffffe060) at Python/pythonrun.c:850
#7 0x0000000000504788 in PyRun_InteractiveLoopFlags (fp=0x7ffff74b4360
<_IO_2_1_stdin_>, filename=0x58e42a "<stdin>",
flags=0x7fffffffe060) at Python/pythonrun.c:770
#8 0x00000000005045cb in PyRun_AnyFileExFlags (fp=0x7ffff74b4360
<_IO_2_1_stdin_>, filename=0x58e42a "<stdin>", closeit=0,
flags=0x7fffffffe060) at Python/pythonrun.c:739
#9 0x000000000041600a in Py_Main (argc=1, argv=0x7fffffffe278) at
Modules/main.c:640
#10 0x0000000000414aac in main (argc=1, argv=0x7fffffffe278) at
./Modules/python.c:23
Via live inspection and debugging, we can answer many of these questions
though the process is longer and more tedious. Of course, after doing
this a couple of times, we'll learn the lay of the land and a few
short-cuts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment