Debugging workers and executors is hard because they are started automatically. One possible way is to sleep for a few seconds when the programs start. This gives us time to attach a debugger before the programs does anything.
One option is to create 2 files: /tmp/r_executor_startup_sleep_secs
and /tmp/r_executor_startup_sleep_secs
. The first thing the workers and executors do is to check if that file exists. If it exists the processes sleep for the number of seconds specified in the file:
$ cat /tmp/r_executor_startup_sleep_secs
30
//sleep for N secs if that file exists
FILE* f = fopen("/tmp/r_worker_startup_sleep_secs", "r");
if(f) {
char buffer[10];
fread(buffer, 1, 10, f);
int secs = atoi(buffer);
if(secs > 0 and secs < 120) {
LOG_INFO("Sleeping for %d secs, pid: %d\n", secs, getpid());
sleep(secs);
}
}
When we start Distributed R we can see the following line in the worker's log:
2015-Mar-06 10:22:25.360285 [INFO] Sleeping for 30 secs, pid: 111020
At that point we can attach to the process using gdb
. The worker has many threads running:
$ sudo gdb attach 111020
(gdb) info threads
Id Target Id Frame
19 Thread 0x7f8b8dffb700 (LWP 111052) "R-worker-bin" pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
18 Thread 0x7f8b8e7fc700 (LWP 111051) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
17 Thread 0x7f8b8effd700 (LWP 111050) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
16 Thread 0x7f8b8f7fe700 (LWP 111049) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
15 Thread 0x7f8b8ffff700 (LWP 111048) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
14 Thread 0x7f8bacff9700 (LWP 111047) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
13 Thread 0x7f8bad7fa700 (LWP 111046) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
12 Thread 0x7f8badffb700 (LWP 111045) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
11 Thread 0x7f8bae7fc700 (LWP 111044) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
10 Thread 0x7f8baeffd700 (LWP 111043) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
9 Thread 0x7f8baf7fe700 (LWP 111042) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
8 Thread 0x7f8baffff700 (LWP 111041) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
7 Thread 0x7f8bbc8fc700 (LWP 111040) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
6 Thread 0x7f8bbd0fd700 (LWP 111039) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
5 Thread 0x7f8bbd8fe700 (LWP 111038) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
4 Thread 0x7f8bbe0ff700 (LWP 111037) "R-worker-bin" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
3 Thread 0x7f8bbe900700 (LWP 111035) "R-worker-bin" 0x00007f8bc09f26a3 in epoll_wait ()
at ../sysdeps/unix/syscall-template.S:81
2 Thread 0x7f8bbf101700 (LWP 111034) "R-worker-bin" 0x00007f8bc09f26a3 in epoll_wait ()
at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7f8bc32c07c0 (LWP 111020) "R-worker-bin" 0x00007f8bc09e4cbd in poll ()
at ../sysdeps/unix/syscall-template.S:81
(gdb) continue
We can select different threads, set breakpoints ...:
(gdb) thread 3
(gdb) b function
The process is the same for the executor. In this case the file is /tmp/r_executor_startup_sleep_secs
:
//sleep for N secs if that file exists
FILE* f = fopen("/tmp/r_executor_startup_sleep_secs", "r");
if(f) {
char buffer[10];
fread(buffer, 1, 10, f);
int secs = atoi(buffer);
if(secs > 0 and secs < 120) {
LOG_INFO("Sleeping for %d secs, pid: %d\n", secs, getpid());
sleep(secs);
}
}
We can see this line in the log:
2015-Mar-06 10:22:55.392888 [INFO] Sleeping for 30 secs, pid: 111036
And we can attach to the process using gdb
:
sudo gdb attach 111036
Debugging the master is easy. We can attach gdb to the R session:
R> Sys.getpid()
[1] 108575
$ sudo gdb attach 108575
(gdb) info threads
Id Target Id Frame
8 Thread 0x7f7be5ffb700 (LWP 111473) "R" 0x00007f7bf4b3a6a3 in epoll_wait ()
at ../sysdeps/unix/syscall-template.S:81
7 Thread 0x7f7be67fc700 (LWP 111474) "R" 0x00007f7bf4b3a6a3 in epoll_wait ()
at ../sysdeps/unix/syscall-template.S:81
6 Thread 0x7f7bede6d700 (LWP 111570) "R" 0x00007f7bf4b3a6a3 in epoll_wait ()
at ../sysdeps/unix/syscall-template.S:81
5 Thread 0x7f7bed66c700 (LWP 111571) "R" 0x00007f7bf4b3a6a3 in epoll_wait ()
at ../sysdeps/unix/syscall-template.S:81
4 Thread 0x7f7be57fa700 (LWP 111572) "R" 0x00007f7bf4b2ccbd in poll ()
at ../sysdeps/unix/syscall-template.S:81
3 Thread 0x7f7be7fff700 (LWP 111655) "R" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
2 Thread 0x7f7be77fe700 (LWP 111656) "R" pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
* 1 Thread 0x7f7bf58307c0 (LWP 108575) "R" 0x00007f7bf4b31933 in select ()
at ../sysdeps/unix/syscall-template.S:81