Last active
July 1, 2022 20:57
-
-
Save dutc/9097282 to your computer and use it in GitHub Desktop.
larger principles of CPython debugging
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> I'm able to follow most of your instructions in the tutorial. However, | |
> there are things that are not yet obvious to me. If, for instance, I | |
> want to find the implementation of a specific built-in, where do I go? | |
> The `dis()` decompilation of `id()`, which I want to study, isn't | |
> instructive. Using `find ... -print0 | xargs -0 grep ...` also doesn't | |
> seem to get me anything useful. But I don't even see "builtins" file or | |
> a section of http://docs.python.org/3.3/c-api/index.html dealing with | |
> built-ins at all. Where should I be looking; how should I generalize | |
> this type of search for the future? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>> def f(x): | |
... return id(x) | |
... | |
>>> from dis import dis | |
>>> dis(f) | |
2 0 LOAD_GLOBAL 0 (id) | |
3 LOAD_FAST 0 (x) | |
6 CALL_FUNCTION 1 | |
9 RETURN_VALUE | |
`id` is just a function in the global scope. Where does it come from? | |
>>> from inspect import getmodule | |
>>> getmodule(id) | |
<module '__builtin__' (built-in)> | |
`__builtin__` is one of a few special modules in Python. You won't find | |
this in the directory of modules written in C (Modules/) or in the | |
directory of modules written in Python (Lib/) | |
(Another special module is sys.) | |
In general, a module written in C explicitly exposes the names you can | |
access from Python. Therefore, the literal string "id" (with quotes) | |
should appear where-ever this name is bound to the underlying C-function | |
that implements it or where-ever another piece of C code wants to find | |
this code by name (i.e., via some standardised lookup mechanism - | |
there's a lot more to be said on this point; think on it.) | |
$ grep '"id"' Python/bltinmodule.c | |
{"id", builtin_id, METH_O, id_doc}, | |
The `sys` and `__builtin__` modules are in Python/bltinmodule.c and | |
Python/sysmodule.c | |
The above code is part of a lookup table connecting the exported symbol | |
`"id"` to a corresponding C-function (`builtin_id`) with a specification | |
for what kind of method it is (in terms of what kinds of parameters it | |
takes) and with an associated piece of documentation. | |
This is `builtin_id`. It's common for the exported name ("id") to map to | |
an internal implementation name ("builtin_id") by just prefixing the | |
former with the name of the module. | |
static PyObject * | |
builtin_id(PyObject *self, PyObject *v) | |
{ | |
return PyLong_FromVoidPtr(v); | |
} | |
PyDoc_STRVAR(id_doc, | |
"id(object) -> integer\n\ | |
\n\ | |
Return the identity of an object. This is guaranteed to be unique among\n\ | |
simultaneously existing objects. (Hint: it's the object's memory | |
address.)"); | |
`id` just converts the PyObject pointer argument (v) to a long. | |
Therefore `id(x)` gives the value of the C-pointer `x` which is the | |
location of the data `x` points to in memory. `id(x)` gives you the | |
memory location of `x`'s `PyObject`. | |
Another way to determine this is as follows: | |
>>> def f(x): | |
... return id(x) | |
... | |
>>> from dis import dis | |
>>> dis(f) | |
2 0 LOAD_GLOBAL 0 (id) | |
3 LOAD_FAST 0 (x) | |
6 CALL_FUNCTION 1 | |
9 RETURN_VALUE | |
We're calling a function. | |
$ grep -A15 -n 'case CALL_FUNCTION:' Python/ceval.c | |
2658: case CALL_FUNCTION: | |
2659- { | |
2660- PyObject **sp; | |
2661- PCALL(PCALL_ALL); | |
2662- sp = stack_pointer; | |
2663-#ifdef WITH_TSC | |
2664- x = call_function(&sp, oparg, &intr0, &intr1); | |
2665-#else | |
2666- x = call_function(&sp, oparg); | |
2667-#endif | |
2668- stack_pointer = sp; | |
2669- PUSH(x); | |
2670- if (x != NULL) | |
2671- continue; | |
2672- break; | |
2673- } | |
$ gdb --args $PWD-build/bin/python | |
(gdb) run | |
>>> ^C | |
(gdb) break ceval.c:2658 | |
(gdb) continue | |
>>> id(10) | |
This breakpoint will be hit on ANY function call, and it'll catch on | |
some function call that's part of displaying and processing the | |
interactive console. This breakpoint will probably NOT trigger on id first. | |
It may trigger on something like `utf_8_decode`, and if you follow | |
execution of it, you'll see how a C function usually executes. You can | |
also do this by tracing the code by hand. | |
Follow `CALL_FUNCTION` into `call_function`, then probably into | |
`PyCFunction_Call`, then into `(*meth)(self, arg)` which calls the C | |
function itself. | |
Of course, we want to trace `id` not `utf_8_decode`! | |
We could keep tracing every entrance of this code-path, but this is | |
going to exhaust us. The breakpoint will be hit a lot of times before we | |
see `id`. | |
We need to put in a better breakpoint, and this requires a bit more | |
knowledge about what a C function is. | |
In Python, you can get the name of a C function just like you can get it | |
for a regular Python function. | |
>>> id.__name__ | |
'id' | |
This means we should be able to access this name from the C code and put | |
in a CONDITIONAL breakpoint to break only on this name. | |
Look at the code for the first step in processing the `CALL_FUNCTION` | |
opcode, call_function. The first branch says: | |
if (PyCFunction_Check(func) && nk == 0) { | |
Let's look up that macro. In `vim`, we can just ^] to jump to the tag. | |
This requires we have our ctags set up correctly. | |
Include/methodobject.h:16 | |
#define PyCFunction_Check(op) (Py_TYPE(op) == &PyCFunction_Type) | |
So a C function is a `PyCFunctionType`. We can ^] on that tag to get to: | |
Objects/methodobject.c:281 | |
PyTypeObject PyCFunction_Type = { | |
PyVarObject_HEAD_INIT(&PyType_Type, 0) | |
"builtin_function_or_method", | |
sizeof(PyCFunctionObject), | |
0, | |
(destructor)meth_dealloc, /* tp_dealloc */ | |
0, /* tp_print */ | |
0, /* tp_getattr */ | |
0, /* tp_setattr */ | |
(cmpfunc)meth_compare, /* tp_compare */ | |
(reprfunc)meth_repr, /* tp_repr */ | |
0, /* tp_as_number */ | |
0, /* tp_as_sequence */ | |
0, /* tp_as_mapping */ | |
(hashfunc)meth_hash, /* tp_hash */ | |
PyCFunction_Call, /* tp_call */ | |
0, /* tp_str */ | |
PyObject_GenericGetAttr, /* tp_getattro */ | |
0, /* tp_setattro */ | |
0, /* tp_as_buffer */ | |
Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC,/* tp_flags */ | |
0, /* tp_doc */ | |
(traverseproc)meth_traverse, /* tp_traverse */ | |
0, /* tp_clear */ | |
meth_richcompare, /* | |
tp_richcompare */ | |
0, /* tp_weaklistoffset */ | |
0, /* tp_iter */ | |
0, /* tp_iternext */ | |
0, /* tp_methods */ | |
meth_members, /* tp_members */ | |
meth_getsets, /* tp_getset */ | |
0, /* tp_base */ | |
0, /* tp_dict */ | |
}; | |
In this file, we can search for `"__name__"`. This file should contain | |
all the implementation code for a C function. | |
Objects/methodobject.c:187 | |
static PyGetSetDef meth_getsets [] = { | |
{"__doc__", (getter)meth_get__doc__, NULL, NULL}, | |
{"__name__", (getter)meth_get__name__, NULL, NULL}, | |
{"__self__", (getter)meth_get__self__, NULL, NULL}, | |
{0} | |
}; | |
Okay, let's find `meth_get__name__`. | |
Objects/methodobject.c:157 | |
static PyObject * | |
meth_get__name__(PyCFunctionObject *m, void *closure) | |
{ | |
return PyString_FromString(m->m_ml->ml_name); | |
} | |
This is what happens when we `id.__name__`, and this is how we can | |
determine the name of the `PyCFunctionObject` we're looking at. | |
Let's use that to better our breakpoint. | |
We'll put the breakpoint in call_function in the branch we know | |
corresponds to processing of C functions. | |
$ grep -A20 -n '^call_function(' Python/ceval.c | |
3980:call_function(PyObject ***pp_stack, int oparg | |
3981-#ifdef WITH_TSC | |
3982- , uint64* pintr0, uint64* pintr1 | |
3983-#endif | |
3984- ) | |
3985-{ | |
3986- int na = oparg & 0xff; | |
3987- int nk = (oparg>>8) & 0xff; | |
3988- int n = na + 2 * nk; | |
3989- PyObject **pfunc = (*pp_stack) - n - 1; | |
3990- PyObject *func = *pfunc; | |
3991- PyObject *x, *w; | |
3992- | |
3993- /* Always dispatch PyCFunction first, because these are | |
3994- presumed to be the most frequent callable object. | |
3995- */ | |
3996- if (PyCFunction_Check(func) && nk == 0) { | |
3997- int flags = PyCFunction_GET_FLAGS(func); | |
3998- PyThreadState *tstate = PyThreadState_GET(); | |
3999- | |
4000- PCALL(PCALL_CFUNCTION); | |
We'll put it on line 3996 and only trigger it if the function's name is | |
"id". | |
`func` is a PyObject pointer, so we need to cast it to a more specific | |
type before we can look into the structure. | |
((PyCFunctionObject*)func)->m_ml->ml_name | |
This gives us a PyString object, but we want a C-string so we can | |
compare it using strcmp in C. | |
PyString_AS_STRING(((PyCFunctionObject*)func)->m_ml->ml_name) | |
`strcmp` is a string comparison function in C that returns 0 on match. | |
0 == | |
strcmp(PyString_AS_STRING(((PyCFunctionObject*)func)->m_ml->ml_name), "id") | |
$ gdb --args $PWD-build/bin/python | |
(gdb) run | |
>>> ^C | |
(gdb) break ceval.c:3996 if 0 == | |
strcmp(((PyCFunctionObject*)func)->m_ml->ml_name, "id") | |
(gdb) continue | |
>>> id(10) | |
This breakpoint will only trigger on calls to `id` | |
Okay, let's follow this as before. | |
(There is on additional set of branches in call_function once it has | |
been determined that we have a C function. The second level of branches | |
either calls the C function directly if it is a function that takes no | |
arguments or takes a single object argument or goes through | |
`PyCFunction_Call` otherwise. In the case of `id`, we follow the former | |
path. In the case of `utf_8_decode` or some other C function, we may | |
have followed the latter path.) | |
From `call_function`, we step into `(*meth)(self, arg)` directly which | |
calls the C function itself. | |
Step into `(*meth)(self,arg)` and we end up in the corresponding C code | |
for `id`. | |
builtin_id (self=0x0, v=0x8529e8) at Python/bltinmodule.c:918 | |
Let's see how we got there. | |
(gdb) bt | |
#0 builtin_id (self=0x0, v=0x8529e8) at Python/bltinmodule.c:918 | |
#1 0x00000000004d671e in call_function (pp_stack=0x7fffffffdb90, | |
oparg=1) at Python/ceval.c:4009 | |
#2 0x00000000004d172b in PyEval_EvalFrameEx (f=0xa9c800, throwflag=0) | |
at Python/ceval.c:2666 | |
#3 0x00000000004d4119 in PyEval_EvalCodeEx (co=0x9da0f0, | |
globals=0x8b83d0, locals=0x8b83d0, args=0x0, argcount=0, kws=0x0, | |
kwcount=0, | |
defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253 | |
#4 0x00000000004ca18a in PyEval_EvalCode (co=0x9da0f0, | |
globals=0x8b83d0, locals=0x8b83d0) at Python/ceval.c:667 | |
#5 0x0000000000506807 in run_mod (mod=0xa7ab88, filename=0x58e42a | |
"<stdin>", globals=0x8b83d0, locals=0x8b83d0, flags=0x7fffffffe060, | |
arena=0x9a52a0) at Python/pythonrun.c:1363 | |
#6 0x0000000000504b1e in PyRun_InteractiveOneFlags (fp=0x7ffff74b4360 | |
<_IO_2_1_stdin_>, filename=0x58e42a "<stdin>", | |
flags=0x7fffffffe060) at Python/pythonrun.c:850 | |
#7 0x0000000000504788 in PyRun_InteractiveLoopFlags (fp=0x7ffff74b4360 | |
<_IO_2_1_stdin_>, filename=0x58e42a "<stdin>", | |
flags=0x7fffffffe060) at Python/pythonrun.c:770 | |
#8 0x00000000005045cb in PyRun_AnyFileExFlags (fp=0x7ffff74b4360 | |
<_IO_2_1_stdin_>, filename=0x58e42a "<stdin>", closeit=0, | |
flags=0x7fffffffe060) at Python/pythonrun.c:739 | |
#9 0x000000000041600a in Py_Main (argc=1, argv=0x7fffffffe278) at | |
Modules/main.c:640 | |
#10 0x0000000000414aac in main (argc=1, argv=0x7fffffffe278) at | |
./Modules/python.c:23 | |
Via live inspection and debugging, we can answer many of these questions | |
though the process is longer and more tedious. Of course, after doing | |
this a couple of times, we'll learn the lay of the land and a few | |
short-cuts. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment