Example Remote debug session

If you need to do C level debugging, then you need to compile the relevant files / subdirectories with -ggdb and -O0. Don't compile everything with -O0 because the resulting firmware image will be too large.

In this case I am debuggin the lua core so I have modified the app/lua/Makefile to set the debug defines and select the correct compiler options. I have also enabled break at startup on GPIO pin low. With the pin pulled low, the RTS traps into the gdb client, and I can connect to it using the xtensa remote debugger on the host.

What this debugging feature does is to call a common routine which

OK, I don't have programm access to UART0, but what I tend to do is you have my test scripts in a short set of lua files (possibly controlled by a telnet client that I image into a 32Kb or 64kb SPIFFS image at the 1Mb boundary. (I guess that I should occasionally walk this SPIFFS base up the flash address space to avoid exhausting these hot pages.)

Anyway after using ESPTOOL to flash the firware and the SPIFFS image, I can reboot the chip and enter the remote debugger which displays up the following after a few seconds:

terry@ellison8:~/nodemcu-firmware$ xtensa-lx106-elf-gdb
...
luaL_dbgbreak () at lauxlib.c:982
982	  asm("break 0,0" ::);
(gdb) bt
#0  luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:460
#1  0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#2  0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#3  0x40267072 in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#4  0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#5  0x40000f49 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I am now at the start gate. (Another option is not to enable this hook, but instead to fast forward to the bt of code that want to debug by including a gdbstub.brk() call at the relevant point in the Lua program.)

If I just continue, then I hit this exception:

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
luaL_dbgbreak () at lauxlib.c:982
982	  asm("break 0,0" ::);
(gdb) bt
#0  luaL_dbgbreak () at lauxlib.c:982
#1  0x40243008 in index2adr (L=0x3fff00e8, idx=2) at lapi.c:50
#2  0x40243dbc in lua_tolstring (L=0x3fff00e8, idx=2, len=0x0) at lapi.c:347
#3  0x40246093 in luaL_checklstring (L=0x3fff00e8, narg=2, len=0x0) at lauxlib.c:339
#4  0x4025090b in luaR_findfunction (L=0x3fff00e8, ptable=0x4027084c <base_funcs_list>) at lrotable.c:54
#5  0x4025aa56 in luaB_index (L=0x3fff00e8) at lbaselib.c:505
#6  0x40249479 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:335
#7  0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#8  0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#9  0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#10 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
#11 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#12 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#13 0x40267072 in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#14 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#15 0x40000f49 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The function index2adr() is double check that a stack index is valid but for runtime efficiency, many of its checks are optised out if lua_assert isn't enabled. So this error only occurs on builds with debugging enabled in the lua subdirectory.

On non-debug builds the VM carries blindly on and later manifests other secondary failures. But with debug enabled, the VM detects the error very soon after it occurs. It fails here because there is now a mismatch between the top and bottom markers (and the top is always above the base and is typically ~20 greater). Something here has gone fundamentally wrong.

(gdb) stacklen
-400  top: 0x3fff3470, base: 0x3fff40f0

So it is now a case of retry and binary chop from the opcode that went ape shit, and thanks to a bit of Terry magic (see .gdb_init in the NodeMCU root directory), I can tell where this is, and the good thing is that we only execute this VM opcode once.

BTW, having a luac.cross -l of the relevant module isessential here.

(gdb) where
<main.lua:47,74>,  opcode 10
(gdb) q
terry@ellison8:~/nodemcu-firmware$ xtensa-lx106-elf-gdb
...
luaL_dbgbreak () at lauxlib.c:982
982	  asm("break 0,0" ::);
(gdb)

You can't rerun a target session, so you need to reboot the ESP (which I've just done and then reenter the debugger from the known startpoint. Have look at my startline macro which does this and take me to

<main.lua:47,74>, opcode 10 [50] GETGLOBAL 3 -1 ; print

(In practice I'd replace this when working a different bug, or slide in a gdbstub.break() call).

(gdb) startline
Hardware assisted breakpoint 1 at 0x40268afc: file sjson.c, line 503.

Breakpoint 1, sjson_decode (L=0x3fff00e8) at sjson.c:503
503	static int sjson_decode(lua_State *L) {
Hardware assisted breakpoint 2 at 0x40257989: file lvm.c, line 460.

Breakpoint 2, luaV_execute (L=0x3fff00e8, nexeccalls=3) at lvm.c:460
460	    switch (GET_OPCODE(i)) {

Breakpoint 2, luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:460
460	    switch (GET_OPCODE(i)) {
<main.lua:47,74>,  opcode 10
0x3fff0280:  $1 = {base = 0x3fff49f8, func = 0x3fff49f0, top = 0x3fff4a80, savedpc = 0x3fff262c, nresults = 0, tailcalls = 0}

Incidentally, it is probably worth my explaining where we are in the stack trace.

The GETGLOBAL does a gettable from _G, but on NodeMCU this includes a hook for missing entries to one of our routines luaB_index which does a luaR_findfunction to look it up in the base functions list. The first thing that this does is a luaL_checkstring() on the stacked entry (2="print"), and by now the call frames are corrupted causing the lua_assert failure.

So step and next only work within an individual function since the debugger uses break points to enter down stacks and these don't work in ROM. But we don't have any looping involved so I just use H/W breakpoints to step down this stack frame-by-frame on this rerun.

(gdb) luastack   #   another of my fancy internals macros to print the Lua call stack
 Ndx   top     base   func  (stack = 0x3fff49b8)
  0     21      1      0
  1     10      2      1
  2     25      8      7

Note that the slots for a Lua function are pre-computed by the compiler so this 21 / 10 / 25 isn't unusual.

The convention on C routines is somewhat different. The RTS will always ensure that LUA_MINSTACK slots (20) are free before a C function is called. In retrun, the C function must tell the RTS if it needs more by using a lua_checkstack() call. (and note that only the rtcmem and sqlite3 modules do this). What is also suspicous is that this crash occurs pretty much immediately after an sjson_decode call which makes heavy use of the Lua stack.

So now step down into a few more frames where we call lbaselib.c:505

Note that upto deletes and replaces the hardware brakepoint (HB) and continues. The ESP has one HB and normal breakpoints don't work in flash, so next and step only work if control does not leave the current module. So upto becomes your main tool in stepping through a program to diagnose errors. And you've got it wrong, and you don't hit the breakpoint, then its a Q of restarting the debug seesion and trying a smaller step. In the land of the blind, a one-eyed man is king.

(gdb) upto lbaselib.c:505  
Hardware assisted breakpoint 3 at 0x4025aa4e: file lbaselib.c, line 505.

Breakpoint 3, luaB_index (L=0x3fff00e8) at lbaselib.c:505
505	  if ((fres = luaR_findfunction(L, base_funcs_list)) != 0)
(gdb) bt 6
#0  luaB_index (L=0x3fff00e8) at lbaselib.c:505
#1  0x40249479 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:335
#2  0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#3  0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#4  0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#5  0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
(More stack frames follow...)
(gdb) luastack
 Ndx   top     base   func  (stack = 0x3fff4020)
  0   -401   -421   -422
  1   -412   -420   -421
  2   -397   -414   -415
  3   -374     26     25

The precall check ensures that there is enough stack for the C routine, and so has reallocated the stack but something has gone wrong with the call info relocation function. The stack has been relocated because it needs to grow, bu something has gone wrong with the recomputation of the Lua CallInfo blocks, and this is what is causing index2adr() to throw an exception.

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
luaL_dbgbreak () at lauxlib.c:982
982	  asm("break 0,0" ::);
(gdb) fr 1
#1  0x40243008 in index2adr (L=0x3fff00e8, idx=2) at lapi.c:50
50	    api_check(L, idx <= L->ci->top - L->base);
(gdb) bt 6
#0  luaL_dbgbreak () at lauxl.c:982
#1  0x40243008 in index2adr (L=0x3fff00e8, idx=2) at lapi.c:50
#2  0x40243dbc in lua_tolstring (L=0x3fff00e8, idx=2, len=0x0) at lapi.c:347
#3  0x40246093 in luaL_checklstring (L=0x3fff00e8, narg=2, len=0x0) at lauxlib.c:339
#4  0x4025090b in luaR_findfunction (L=0x3fff00e8, ptable=0x4027084c <base_funcs_list>) at lrotable.c:54
#5  0x4025aa56 in luaB_index (L=0x3fff00e8) at lbaselib.c:505
(More stack frames follow...)
(gdb) luastack
 Ndx   top     base   func  (stack = 0x3fff4020)
  0   -401   -421   -422
  1   -412   -420   -421
  2   -397   -414   -415
  3   -374     26     25
(gdb) stacklen
-400  top: 0x3fff3470, base: 0x3fff40f0

So why has the stack reallocation routine cocked this up? More exploration is needed. Indicentally if I replace the sjson.decode call by the equivalent direct Lua assignment then the program runs solidly.

My current guess is that this could etiher be (i) sjson_decode() is overrunning the stack top and corrupting other variables or (ii) it has near the limit and the ldo.c function which checks the stack for calling the next C routine, luaB_index(), is executing a path that we've just not typically exercised and there is a bug in this code. Either way, doing a stack realloc prior to a C Routine is a pretty rare event, and it is going wrong in this case.

This needs more thought, but I note that all of the API routines which push values do a stack check, for example lua_pushinteger (lua_State *L, lua_Integer n)includes the code:

  setnvalue(L->top, cast_num(n));
  api_incr_top(L);

which generates the following raw C (when debugging is enabled):

  lua_Number i_x = (lua_Number)(n); 
  TValue *i_o=(L->top); i_o->value.n=i_x; i_o->tt=5;
  (L->top < L->ci->top ? (void) 0 : luaL_dbgbreak()); L->top++;

In otherwords, if a stack overrun in sjson_decode() occurs, for example, then this should raise a debug trap, and it doesn't so my suspicion is that we are facing (ii) above.

I also added another debug macro to do a stack check as a debug macro, and stepped opcode by opcode over the sjson_decode() call through to the GETGLOBAL instruction at main.lua:47,74, opcode 10. Everything is fine up to this point. But working down the call stack, we have a headroom of 1 stackend but we hit

static void callTMres (lua_State *L, StkId res, const TValue *f,
                        const TValue *p1, const TValue *p2) {
  ptrdiff_t result = savestack(L, res);
  setobj2s(L, L->top, f);  /* push function */
  setobj2s(L, L->top+1, p1);  /* 1st argument */
  setobj2s(L, L->top+2, p2);  /* 2nd argument */
  luaD_checkstack(L, 3);
  L->top += 3;
  luaD_call(L, L->top - 3, 1);
  res = restorestack(L, result);
  L->top--;
  setobjs2s(L, res, L->top);
}

The headroom goes to -2 and the stackcheck is done after adding the values to the stack Yuk! The check should be done before the pushes, but as I said the Lua functions set the stack head precisiely and L->ci->top is generally less than L->stack_last so in this case no error was caused. Even so, entering the luaD_call() to invoke luaB_index() we have Used: 19, Headroom: -2, Total: 17. Not good. In the case of C functions this passes control to luaD_precall() to do the work, which then calls luaB_index. And on entry the Lua stack frames are:

(gdb) luastack
 Ndx   top     base   func  (stack = 0x3fff0b00)
  0     21      1      0    Lua
  1     10      2      1    Lua
  2     25      8      7    Lua
  3     34     12     11    C   << Why is this base at 12?  This makes no sense to me. 
  4     37     15     14    C 
(gdb) stackcheck
Used: 1, Headroom: 21, Total: 22

After haf a dozen more drill-downs and stepping through the code, the window where it all goes ape shit is very small. sjson.decode() is a thin convenience wrapper around the full implementation which creates and then dereferences a decoder object before returning the results. However this decoder object hangs around as an orphan unit the next GC is executed.

As this plays out, control is returned from the Lua routine that execute the sjson.decode() to the caller whith then prints an informatory "I'm back" so the instruction immediately following the CALL is a GETGLOBAL 3 "print", and retrieving flash based routines such as print is done by a C helper routine luaB_index() (see the first backtrace). By now the stack is almost full and to call this with contracted 20 slots head room requires resizing the stack.

Whenever, this type of resize is done the GC is also run for luck and it is this sweep that the sjson.decoder_destructor() is called.

Another restarted session and upto sjson.decoder_destructor get me to this backtrace:

#0  sjson_decoder_destructor (L=0x3fff00e8) at sjson.c:524
#1  0x40249479 in luaD_precall (L=0x3fff00e8, func=0x3fff4a98, nresults=0) at ldo.c:335
#2  0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a98, nResults=0) at ldo.c:392
#3  0x4024bed5 in GCTM (L=0x3fff00e8) at lgc.c:495
#4  0x4024c3db in singlestep (L=0x3fff00e8) at lgc.c:626
#5  0x4024c64e in luaC_fullgc (L=0x3fff00e8) at lgc.c:703
#6  0x40246be2 in l_alloc (ud=0x3fff00e8, ptr=0x3fff49b8, osize=432, nsize=912) at lauxlib.c:940
#7  0x4024cdc9 in luaM_realloc_ (L=0x3fff00e8, block=0x3fff49b8, osize=432, nsize=912) at lmem.c:78
#8  0x40248ad8 in luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:148
#9  0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#10 0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:322
#11 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#12 0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#13 0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#14 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
#15 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#16 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#17 0x4026709e in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#18 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#19 0x40000f49 in ?? ()

(gdb) luastack
stack = 0x3fff3c88, last: 108, size: 114, (OK)
 Ndx   top     base   func
  0     21      1      0
  1     10      2      1
  2     25      8      7
  3     50     29     28

and lua stack is healthy at this point. So comparing this with the first error traceback, it all goes wrong beween lines ldo.c:322 and 335 and another trial narrows this down to something in the callchain of line 322: luaD_checkstack(L, LUA_MINSTACK);

Examining the code along the call path, one thing that I notice is thatluaD_reallocstack() is not reentrant but that with eLua, this can trigger a full GC which can call C library destructors such as in this case sjson.decoder_destructor() which first requires the stack to be resized!! And setting a breakpoint at luaD_reallocstack and conituing through get me to this:

 (gdb) bt
#0  luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:145
#1  0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#2  0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a98, nresults=0) at ldo.c:322
#3  0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a98, nResults=0) at ldo.c:392
#4  0x4024bed5 in GCTM (L=0x3fff00e8) at lgc.c:495
#5  0x4024c3db in singlestep (L=0x3fff00e8) at lgc.c:626
#6  0x4024c64e in luaC_fullgc (L=0x3fff00e8) at lgc.c:703
#7  0x40246be2 in l_alloc (ud=0x3fff00e8, ptr=0x3fff49b8, osize=432, nsize=912) at lauxlib.c:940
#8  0x4024cdc9 in luaM_realloc_ (L=0x3fff00e8, block=0x3fff49b8, osize=432, nsize=912) at lmem.c:78
#9  0x40248ad8 in luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:148
#10 0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#11 0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:322
#12 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#13 0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#14 0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#15 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
#16 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#17 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#18 0x4026709e in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#19 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#20 0x40000f49 in ?? ()

So now all we need to do is bracket this call in luaD_reallocstack to disable GC during the primitive operation. Another bug killed.

TerryE/debug_log.md

pjsg commented Mar 25, 2018

Uh oh!

TerryE commented Mar 26, 2018

Uh oh!