Skip to content

Instantly share code, notes, and snippets.

@rzezeski
Last active September 4, 2015 16:56
Show Gist options
  • Save rzezeski/8a967f9cfdd3976869eb to your computer and use it in GitHub Desktop.
Save rzezeski/8a967f9cfdd3976869eb to your computer and use it in GitHub Desktop.
Mono bug-18026 stuck on CentOS 6.6 lx branded zone

Summary

The Mono VM appears to have a race/deadlock issue when using sgen GC on an lx branded zone. The condition can almost ALWAYS (only once did I see it run through to completion) be reproduced by running test bug-18026. This test stresses the thread pool, task queue, and allocation facilities of the VM -- spawning many tasks performing allocation while concurrently bouncing the thread pool size up and down.

I have seen the VM get stuck in three different ways, but I unfortunately I only have notes for two of them.

(1) Unable to escape stop-the-world (STW) GC

It enters STW and never leaves:

# dtrace -n 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return {}' -p 54449
dtrace: description 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return ' matched 2 probes
CPU     ID                    FUNCTION:NAME
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry

Specifically, it is stuck in this loop:

https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/sgen-stw.c#L114

While in that loop the restart_count is ALWAYS 1 and the info pointer is always the same.

I also saw this assetion two times:

Assertion at sgen-os-posix.c:103, condition `!info->stack_start' not met

(2) Stuck in threadpool

This seems to be the case the occurs most often. There is some activity at first and then the process gets stuck bouncing between two threads.

Given the functions called by thread 9 I'm pretty sure it's stuck in the async_invoke_thread() function used by the thread pool for running tasks.

https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/threadpool.c#L1673

# dtrace -n 'pid$target:*mono*::entry,pid$target:*mono*::return { printf("[%2d] %d", tid, arg0); }' -p 21570

<... SNIP ...>

  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008

Misc Notes

  • With sgen disabled the test always runs to completion: ./configure –prefix=/opt/mono-boehm –with-gc=boehm –with-sgen=no.
  • No issues running test on CentOS 6.6 on KVM.
  • I saw lx code trip this assert three times (but not yet on newest platform image): Assertion failed: ucp != NULL, file ../common/lx_brand.c, line 797
  • Mono's sgen GC and thread pool makes use of signals to interrupt threads: SIGXCPU & SIGPWR.
  • Test completes when running under strace.
  • When test is stuck can sometimes unstick it when tracing with DTrace or using pstack.

Steps to Reproduce

Create CentOS 6 lx zone:

# vmadm create <<< '
{
    "alias": "centos66-lx",
    "hostname": "centos66-lx",
    "brand": "lx",
    "kernel_version": "2.6.32",
    "max_physical_memory": 7168,
    "image_uuid": "aa11a466-15c4-11e5-add5-0700f90a1098",
    "internal_metadata": {
        "root_authorized_keys":"@@your pub key@@"
    },
    "resolvers": [ "8.8.8.8" ],
    "nics": [
        {
            "nic_tag": "admin",
            "ip": "10.0.1.221",
            "netmask": "255.255.255.0",
            "gateway": "10.0.0.1",
            "primary": true
        }
    ],
    "quota": 20
}
'

Login to lx zone:

# zlogin <UUID returned from vmadm>

Build mono 4.0.1.44 (also tried 4.0.2.4):

# yum install git autoconf libtool automake gettext gcc-c++ tar wget
# git clone git://github.com/mono/mono.git
# cd mono
# git checkout mono-4.0.1.44
# ./autogen.sh --prefix=/opt/mono
# make get-monolite-latest
# make -j 32

Run VM tests:

# cd mono
# make check

Run bug-18026 only:

# cd tests
# MONO_GC_DEBUG=9 MONO_PATH=/root/mono/mcs/class/lib/net_4_5 ../../runtime/mono-wrapper ./test-runner.exe -j a bug-18026.exe

Even at level 9 the VM seems to only log a small fraction of the GC messages, feels like something is broken there too.

Configure Output

   Engine:
	Host:	       x86_64-unknown-linux-gnu
	Target:	       x86_64-unknown-linux-gnu
	GC:	       sgen and Included Boehm GC with typed GC and parallel mark
	TLS:           __thread
	SIGALTSTACK:   yes
	Engine:        Building and using the JIT
	oprofile:      no
	BigArrays:     no
	DTrace:        no
	LLVM Back End: no (dynamically loaded: no)

   Libraries:
	.NET 4.5:      yes
	MonoDroid:     no
	MonoTouch:     no
	Xamarin.Mac:   no
	JNI support:   IKVM Native
	libgdiplus:    assumed to be installed
	zlib:
@rzezeski
Copy link
Author

This morning I was able to replicate what I saw during the stuck build
but instead running the original test: bug-18026. Like the build, it
had threads stuck in futex.

Terminal 1 (lx zone)

Uncomment WriteLine calls in test:

cd mono/mono/tests
vim bug-18026.cs

Run the test:

# MONO_PATH=/root/mono/mcs/class/lib/net_4_5 ../../runtime/mono-wrapper ./test-runner.exe -j a bug-18026.exe
Running tests:

Terminal 2 (lx zone)

tail the stdout for the test:

# cd mono
# tail -f mono/tests/bug-18026.exe.std*
10000
new
9999
done 0
9998
done 1
9997
done 2
done 3
9996
9995
9994
done 4
done 5
9993
9992
done 6
9991
done 7
done 8
done 9
...

Test gets stuck...

Terminal 3 (lx zone)

Run pstack on stuck process:

# /native/usr/bin/ptree $(pgrep mono)
5330  zsched
  5440  init
    5662  /usr/sbin/sshd
      9115  /usr/sbin/sshd -R
        396   -bash
          603   /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config ./test-ru
            620   /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config bug-18026

# /native/usr/bin/pstack 620
620:    /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config
-----------------  lwp# 1 / thread# 1  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 2 / thread# 2  --------------------
 00007ffffe00d930 SYS#0    ()
-----------------  lwp# 3 / thread# 3  --------------------
 00007fffff34da4a nanosleep (7ffffeabeb10, 0)
 00007fffff4cce46 lx_clock_nanosleep (1, 1, 7ffffeb0dcc0, 0) + 96
 00007fffff4d02ef lx_emulate (7ffffeabeca0, e6, 7ffffeabec70) + bf
-----------------  lwp# 4 / thread# 4  --------------------
 00007ffffe00e6fd SYS#0    ()
-----------------  lwp# 5 / thread# 5  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 6 / thread# 6  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 7 / thread# 7  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 8 / thread# 8  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 9 / thread# 9  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 10 / thread# 10  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 11 / thread# 11  --------------------
 00007ffffe00da21 SYS#0    ()

Terminal 4 (global zone)

Determine which Linux syscalls these threads are stuck in:

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd lofs idm sata crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> 0t620 ::pid2proc | ::walk thread | ::print kthread_t t_sysnum
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0xc7
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
> 0t620 ::pid2proc | ::walk thread | ::print kthread_t t_lwp | ::print klwp_t lwp_brand | ::print lx_lwp_data_t br_syscall_num
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xe6
br_syscall_num = 0x1
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca

All stuck in futex.

@rzezeski
Copy link
Author

This thread about the recent futex_wait bug in the Linux sounds a lot like what I'm experiencing.

  • User application mysteriously locking up.
  • Using debugging tools causes it to get unstuck.
  • Often happens around GC.

https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

The problem is there is no Linux kernel in lx. So it can't literally be this bug but perhaps a similar mistake was made in the lx futex implementation? FWIW, I've reproduced this on an older Sandybridge i7 and a new Haswell Xeon.

I also see no direct usage of futex(2) in the mono VM code, so I imagine other calls are mapped to futex by glibc?

@rzezeski
Copy link
Author

I dug into my stuck proc a bit more to determine the arguments to
lx_futex() and see if that might shed some light on the problem. I'm
not 100% sure I pulled out the arguments correctly but the values seem
to make sense (e.g. arg1 for all 8 lwps points to the same address and
all of them are 0x80 which is FUTEX_PRIVATE_FLAG, yet another thing
in common with the recent Linux futex bug).

First the stacks:

> 0t620 ::pid2proc | ::walk thread | ::stacks -a
THREAD           STATE    SOBJ                COUNT
fffff085cb860080 SLEEP    CV                      8
fffff6c94a4950c0 SLEEP    CV                      -
fffff6c94a187b80 SLEEP    CV                      -
fffff6c94a1790c0 SLEEP    CV                      -
fffff6c94a13d820 SLEEP    CV                      -
fffff6c94a129000 SLEEP    CV                      -
fffff6c93dc387e0 SLEEP    CV                      -
fffff53096b03ba0 SLEEP    CV                      -
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xbd
                 futex_wait+0x148
                 lx_futex+0x290
                 lx_syscall_enter+0x16a
                 sys_syscall+0x145

THREAD           STATE    SOBJ                COUNT
fffff54dba7a2180 SLEEP    CV                      1
                 swtch+0x141
                 cv_timedwait_sig_hires+0x39d
                 cv_waituntil_sig+0xfa
                 futex_wait+0x148
                 lx_futex+0x290
                 lx_syscall_enter+0x16a
                 sys_syscall+0x145

THREAD           STATE    SOBJ                COUNT
fffff08768909480 SLEEP    CV                      1
                 swtch+0x141
                 cv_timedwait_sig_hires+0x39d
                 cv_waituntil_sig+0xfa
                 nanosleep+0x19f

THREAD           STATE    SOBJ                COUNT
fffff6c94a1a3040 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 fifo_write+0x293
                 fop_write+0x8b
                 write+0x250
                 lx_write+0x3a
                 lx_syscall_enter+0x16a
                 sys_syscall+0x145

Then I teased out the the arguments to each lx_futex() call. Note that p_model = 0x200000 so I assume the first branch of lx_emulate_args() was taken.

https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/brand/lx/os/lx_syscall.c#L118

> fffff085cb860080::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107e
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a4950c0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107b
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a187b80::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107c
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a1790c0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107d
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a13d820::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107f
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a129000::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x942700
r_rsi = 0x80
r_rdx = 0
r_r10 = 0
r_r8 = 0
r_r9 = 0x27b
> fffff6c93dc387e0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x1080
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff53096b03ba0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x1081
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d

All are waiting on the same address. I have very little knowledge of futex but I find two things odd:

  1. The val argument (r_rdx) seems way high.
  2. One of the treads has val=0, compared to the much higher values
    for the other threads.

Finally, here are the args for the last remaining futex call which has
an op=0x189 (FUTEX_WAIT_BITSET | FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME).

> fffff54dba7a2180::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x9421e8
r_rsi = 0x189
r_rdx = 0
r_r10 = 0x7fffeddffca0
r_r8 = 0
r_r9 = 0xffffffff
>

@jjelinek
Copy link

Some of this seems similar to the issue we have reported as OS-4562, although the version of mono we get by default on Ubuntu 14.04 is 3.2.8. However, it looks like we're also stuck in GC in a similar way.

@rzezeski
Copy link
Author

rzezeski commented Sep 4, 2015

Some of this seems similar to the issue we have reported as OS-4562, although the version of mono we get by default on Ubuntu 14.04 is 3.2.8. However, it looks like we're also stuck in GC in a similar way.

Yes, OS-4562 has fixed the issue I reported here. Thank you.

Commit and platform image URL for anyone else who happens upon this gist:

TritonDataCenter/illumos-joyent@5b40dcb

http://us-east.manta.joyent.com/Joyent_Dev/public/SmartOS/smartos.html#20150903T073920Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment