If you need to do C level debugging, then you need to compile the relevant files /
subdirectories with -ggdb and -O0
. Don't compile everything with -O0
because the
resulting firmware image will be too large.
In this case I am debuggin the lua core so I have modified the app/lua/Makefile
to
set the debug defines and select the correct compiler options. I have also enabled
break at startup on GPIO pin low. With the pin pulled low, the RTS traps into the gdb
client, and I can connect to it using the xtensa remote debugger on the host.
What this debugging feature does is to call a common routine which
OK, I don't have programm access to UART0, but what I tend to do is you have my test scripts in a short set of lua files (possibly controlled by a telnet client that I image into a 32Kb or 64kb SPIFFS image at the 1Mb boundary. (I guess that I should occasionally walk this SPIFFS base up the flash address space to avoid exhausting these hot pages.)
Anyway after using ESPTOOL to flash the firware and the SPIFFS image, I can reboot the chip and enter the remote debugger which displays up the following after a few seconds:
terry@ellison8:~/nodemcu-firmware$ xtensa-lx106-elf-gdb
...
luaL_dbgbreak () at lauxlib.c:982
982 asm("break 0,0" ::);
(gdb) bt
#0 luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:460
#1 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#2 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#3 0x40267072 in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#4 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#5 0x40000f49 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I am now at the start gate. (Another option is not to enable this hook, but instead to fast forward to the bt of code that want to debug by including a gdbstub.brk()
call at the relevant point in the Lua program.)
If I just continue, then I hit this exception:
(gdb) c
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
luaL_dbgbreak () at lauxlib.c:982
982 asm("break 0,0" ::);
(gdb) bt
#0 luaL_dbgbreak () at lauxlib.c:982
#1 0x40243008 in index2adr (L=0x3fff00e8, idx=2) at lapi.c:50
#2 0x40243dbc in lua_tolstring (L=0x3fff00e8, idx=2, len=0x0) at lapi.c:347
#3 0x40246093 in luaL_checklstring (L=0x3fff00e8, narg=2, len=0x0) at lauxlib.c:339
#4 0x4025090b in luaR_findfunction (L=0x3fff00e8, ptable=0x4027084c <base_funcs_list>) at lrotable.c:54
#5 0x4025aa56 in luaB_index (L=0x3fff00e8) at lbaselib.c:505
#6 0x40249479 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:335
#7 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#8 0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#9 0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#10 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
#11 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#12 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#13 0x40267072 in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#14 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#15 0x40000f49 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
The function index2adr()
is double check that a stack index is valid but for runtime
efficiency, many of its checks are optised out if lua_assert
isn't enabled. So this
error only occurs on builds with debugging enabled in the lua
subdirectory.
On non-debug builds the VM carries blindly on and later manifests other secondary failures. But with debug enabled, the VM detects the error very soon after it occurs. It fails here because there is now a mismatch between the top and bottom markers (and the top is always above the base and is typically ~20 greater). Something here has gone fundamentally wrong.
(gdb) stacklen
-400 top: 0x3fff3470, base: 0x3fff40f0
So it is now a case of retry and binary chop from the opcode that went ape shit, and
thanks to a bit of Terry magic (see .gdb_init
in the NodeMCU root directory), I can
tell where this is, and the good thing is that we only execute this VM opcode once.
BTW, having a luac.cross -l
of the relevant module isessential here.
(gdb) where
<main.lua:47,74>, opcode 10
(gdb) q
terry@ellison8:~/nodemcu-firmware$ xtensa-lx106-elf-gdb
...
luaL_dbgbreak () at lauxlib.c:982
982 asm("break 0,0" ::);
(gdb)
You can't rerun a target session, so you need to reboot the ESP (which I've just done and then reenter the debugger from the known startpoint. Have look at my startline macro which does this and take me to
<main.lua:47,74>, opcode 10 [50] GETGLOBAL 3 -1 ; print
(In practice I'd replace this when working a different bug, or slide in a gdbstub.break()
call).
(gdb) startline
Hardware assisted breakpoint 1 at 0x40268afc: file sjson.c, line 503.
Breakpoint 1, sjson_decode (L=0x3fff00e8) at sjson.c:503
503 static int sjson_decode(lua_State *L) {
Hardware assisted breakpoint 2 at 0x40257989: file lvm.c, line 460.
Breakpoint 2, luaV_execute (L=0x3fff00e8, nexeccalls=3) at lvm.c:460
460 switch (GET_OPCODE(i)) {
Breakpoint 2, luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:460
460 switch (GET_OPCODE(i)) {
<main.lua:47,74>, opcode 10
0x3fff0280: $1 = {base = 0x3fff49f8, func = 0x3fff49f0, top = 0x3fff4a80, savedpc = 0x3fff262c, nresults = 0, tailcalls = 0}
Incidentally, it is probably worth my explaining where we are in the stack trace.
The GETGLOBAL does a gettable from _G, but on NodeMCU this includes a hook for missing entries to one of our routines luaB_index which does a luaR_findfunction to look it up in the base functions list. The first thing that this does is a luaL_checkstring() on the stacked entry (2="print"), and by now the call frames are corrupted causing the lua_assert failure.
So step and next only work within an individual function since the debugger uses break points to enter down stacks and these don't work in ROM. But we don't have any looping involved so I just use H/W breakpoints to step down this stack frame-by-frame on this rerun.
(gdb) luastack # another of my fancy internals macros to print the Lua call stack
Ndx top base func (stack = 0x3fff49b8)
0 21 1 0
1 10 2 1
2 25 8 7
Note that the slots for a Lua function are pre-computed by the compiler so this 21 / 10 / 25 isn't unusual.
The convention on C routines is somewhat different. The RTS will always ensure that
LUA_MINSTACK
slots (20) are free before a C function is called. In retrun, the C
function must tell the RTS if it needs more by using a lua_checkstack()
call. (and
note that only the rtcmem
and sqlite3
modules do this). What is also suspicous is that this crash occurs pretty much immediately after an sjson_decode
call which makes heavy use of the Lua stack.
So now step down into a few more frames where we call lbaselib.c:505
Note that upto
deletes and replaces the hardware brakepoint (HB) and continues. The
ESP has one HB and normal breakpoints don't work in flash, so next
and step
only
work if control does not leave the current module. So upto
becomes your main tool in
stepping through a program to diagnose errors. And you've got it wrong, and you don't
hit the breakpoint, then its a Q of restarting the debug seesion and trying a smaller
step. In the land of the blind, a one-eyed man is king.
(gdb) upto lbaselib.c:505
Hardware assisted breakpoint 3 at 0x4025aa4e: file lbaselib.c, line 505.
Breakpoint 3, luaB_index (L=0x3fff00e8) at lbaselib.c:505
505 if ((fres = luaR_findfunction(L, base_funcs_list)) != 0)
(gdb) bt 6
#0 luaB_index (L=0x3fff00e8) at lbaselib.c:505
#1 0x40249479 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:335
#2 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#3 0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#4 0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#5 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
(More stack frames follow...)
(gdb) luastack
Ndx top base func (stack = 0x3fff4020)
0 -401 -421 -422
1 -412 -420 -421
2 -397 -414 -415
3 -374 26 25
The precall check ensures that there is enough stack for the C routine,
and so has reallocated the stack but something has gone wrong with the
call info relocation function. The stack has been relocated because it needs to
grow, bu something has gone wrong with the recomputation of the Lua CallInfo
blocks, and this is what is causing index2adr()
to throw an exception.
(gdb) c
Continuing.
Program received signal SIGTRAP, Trace/breakpoint trap.
luaL_dbgbreak () at lauxlib.c:982
982 asm("break 0,0" ::);
(gdb) fr 1
#1 0x40243008 in index2adr (L=0x3fff00e8, idx=2) at lapi.c:50
50 api_check(L, idx <= L->ci->top - L->base);
(gdb) bt 6
#0 luaL_dbgbreak () at lauxl.c:982
#1 0x40243008 in index2adr (L=0x3fff00e8, idx=2) at lapi.c:50
#2 0x40243dbc in lua_tolstring (L=0x3fff00e8, idx=2, len=0x0) at lapi.c:347
#3 0x40246093 in luaL_checklstring (L=0x3fff00e8, narg=2, len=0x0) at lauxlib.c:339
#4 0x4025090b in luaR_findfunction (L=0x3fff00e8, ptable=0x4027084c <base_funcs_list>) at lrotable.c:54
#5 0x4025aa56 in luaB_index (L=0x3fff00e8) at lbaselib.c:505
(More stack frames follow...)
(gdb) luastack
Ndx top base func (stack = 0x3fff4020)
0 -401 -421 -422
1 -412 -420 -421
2 -397 -414 -415
3 -374 26 25
(gdb) stacklen
-400 top: 0x3fff3470, base: 0x3fff40f0
So why has the stack reallocation routine cocked this up? More exploration is needed.
Indicentally if I replace the sjson.decode
call by the equivalent direct Lua assignment then the program runs solidly.
My current guess is that this could etiher be (i) sjson_decode()
is overrunning the
stack top and corrupting other variables or (ii) it has near the limit and the
ldo.c
function which checks the stack for calling the next C routine, luaB_index()
,
is executing a path that we've just not typically exercised and there is a bug in this
code. Either way, doing a stack realloc prior to a C Routine is a pretty rare event,
and it is going wrong in this case.
This needs more thought, but I note that all of the API routines which push values do a
stack check, for example lua_pushinteger (lua_State *L, lua_Integer n)
includes the
code:
setnvalue(L->top, cast_num(n));
api_incr_top(L);
which generates the following raw C (when debugging is enabled):
lua_Number i_x = (lua_Number)(n);
TValue *i_o=(L->top); i_o->value.n=i_x; i_o->tt=5;
(L->top < L->ci->top ? (void) 0 : luaL_dbgbreak()); L->top++;
In otherwords, if a stack overrun in sjson_decode()
occurs, for example, then this
should raise a debug trap, and it doesn't so my suspicion is that we are facing (ii)
above.
I also added another debug macro to do a stack check as a debug macro, and stepped
opcode by opcode over the sjson_decode()
call through to the GETGLOBAL
instruction
at main.lua:47,74, opcode 10. Everything is fine up to this point. But working down the call stack, we have a headroom of 1 stackend but we hit
static void callTMres (lua_State *L, StkId res, const TValue *f,
const TValue *p1, const TValue *p2) {
ptrdiff_t result = savestack(L, res);
setobj2s(L, L->top, f); /* push function */
setobj2s(L, L->top+1, p1); /* 1st argument */
setobj2s(L, L->top+2, p2); /* 2nd argument */
luaD_checkstack(L, 3);
L->top += 3;
luaD_call(L, L->top - 3, 1);
res = restorestack(L, result);
L->top--;
setobjs2s(L, res, L->top);
}
The headroom goes to -2 and the stackcheck is done after adding the values to the
stack Yuk! The check should be done before the pushes, but as I said the Lua
functions set the stack head precisiely and L->ci->top
is generally less than
L->stack_last
so in this case no error was caused. Even so, entering the
luaD_call()
to invoke luaB_index()
we have Used: 19, Headroom: -2, Total: 17.
Not good. In the case of C functions this passes control to luaD_precall()
to do the
work, which then calls luaB_index
. And on entry the Lua stack frames are:
(gdb) luastack
Ndx top base func (stack = 0x3fff0b00)
0 21 1 0 Lua
1 10 2 1 Lua
2 25 8 7 Lua
3 34 12 11 C << Why is this base at 12? This makes no sense to me.
4 37 15 14 C
(gdb) stackcheck
Used: 1, Headroom: 21, Total: 22
After haf a dozen more drill-downs and stepping through the code, the window where it all goes ape shit is very small. sjson.decode()
is a thin convenience wrapper around the full implementation which creates and then dereferences a decoder object before returning the results. However this decoder object hangs around as an orphan unit the next GC is executed.
As this plays out, control is returned from the Lua routine that execute the sjson.decode()
to the caller whith then prints an informatory "I'm back" so the instruction immediately following the CALL
is a GETGLOBAL 3 "print"
, and retrieving flash based routines such as print
is done by a C helper routine luaB_index()
(see the first backtrace). By now the stack is almost full and to call this with contracted 20 slots head room requires resizing the stack.
Whenever, this type of resize is done the GC is also run for luck and it is this sweep that the sjson.decoder_destructor()
is called.
Another restarted session and upto sjson.decoder_destructor
get me to this backtrace:
#0 sjson_decoder_destructor (L=0x3fff00e8) at sjson.c:524
#1 0x40249479 in luaD_precall (L=0x3fff00e8, func=0x3fff4a98, nresults=0) at ldo.c:335
#2 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a98, nResults=0) at ldo.c:392
#3 0x4024bed5 in GCTM (L=0x3fff00e8) at lgc.c:495
#4 0x4024c3db in singlestep (L=0x3fff00e8) at lgc.c:626
#5 0x4024c64e in luaC_fullgc (L=0x3fff00e8) at lgc.c:703
#6 0x40246be2 in l_alloc (ud=0x3fff00e8, ptr=0x3fff49b8, osize=432, nsize=912) at lauxlib.c:940
#7 0x4024cdc9 in luaM_realloc_ (L=0x3fff00e8, block=0x3fff49b8, osize=432, nsize=912) at lmem.c:78
#8 0x40248ad8 in luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:148
#9 0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#10 0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:322
#11 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#12 0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#13 0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#14 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
#15 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#16 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#17 0x4026709e in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#18 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#19 0x40000f49 in ?? ()
(gdb) luastack
stack = 0x3fff3c88, last: 108, size: 114, (OK)
Ndx top base func
0 21 1 0
1 10 2 1
2 25 8 7
3 50 29 28
and lua stack is healthy at this point. So comparing this with the first error
traceback, it all goes wrong beween lines ldo.c
:322 and 335 and another trial
narrows this down to something in the callchain of line 322: luaD_checkstack(L, LUA_MINSTACK);
Examining the code along the call path, one thing that I notice is
thatluaD_reallocstack()
is not reentrant but that with eLua, this can trigger a full
GC which can call C library destructors such as in this case
sjson.decoder_destructor()
which first requires the stack to be resized!! And setting
a breakpoint at luaD_reallocstack
and conituing through get me to this:
(gdb) bt
#0 luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:145
#1 0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#2 0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a98, nresults=0) at ldo.c:322
#3 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a98, nResults=0) at ldo.c:392
#4 0x4024bed5 in GCTM (L=0x3fff00e8) at lgc.c:495
#5 0x4024c3db in singlestep (L=0x3fff00e8) at lgc.c:626
#6 0x4024c64e in luaC_fullgc (L=0x3fff00e8) at lgc.c:703
#7 0x40246be2 in l_alloc (ud=0x3fff00e8, ptr=0x3fff49b8, osize=432, nsize=912) at lauxlib.c:940
#8 0x4024cdc9 in luaM_realloc_ (L=0x3fff00e8, block=0x3fff49b8, osize=432, nsize=912) at lmem.c:78
#9 0x40248ad8 in luaD_reallocstack (L=0x3fff00e8, newsize=108) at ldo.c:148
#10 0x40248bd5 in luaD_growstack (L=0x3fff00e8, n=20) at ldo.c:166
#11 0x402493a0 in luaD_precall (L=0x3fff00e8, func=0x3fff4a80, nresults=1) at ldo.c:322
#12 0x4024969c in luaD_call (L=0x3fff00e8, func=0x3fff4a80, nResults=1) at ldo.c:392
#13 0x4025639b in callTMres (L=0x3fff00e8, res=0x3fff4a10, f=0x3fff3b18, p1=0x3ffffed0, p2=0x3fff2860) at lvm.c:108
#14 0x40256788 in luaV_gettable (L=0x3fff00e8, t=0x3ffffed0, key=0x3fff2860, val=0x3fff4a10) at lvm.c:146
#15 0x40257d90 in luaV_execute (L=0x3fff00e8, nexeccalls=2) at lvm.c:491
#16 0x402496a5 in luaD_call (L=0x3fff00e8, func=0x3fff49c0, nResults=0) at ldo.c:393
#17 0x40245484 in lua_call (L=0x3fff00e8, nargs=1, nresults=0) at lapi.c:824
#18 0x4026709e in do_node_task (task_fn_ref=2, prio=1 '\001') at node.c:329
#19 0x4024206d in task_dispatch (e=<optimized out>) at task.c:32
#20 0x40000f49 in ?? ()
So now all we need to do is bracket this call in luaD_reallocstack
to disable GC during the primitive operation. Another bug killed.
From my rading of the code, in a debug build, it appears that there is an assert on each push to make sure that we don't exceed the number of slots available. Can you share the lua source code with me?