The Mono VM appears to have a race/deadlock issue when using sgen GC on an lx branded zone. The condition can almost ALWAYS (only once did I see it run through to completion) be reproduced by running test bug-18026. This test stresses the thread pool, task queue, and allocation facilities of the VM -- spawning many tasks performing allocation while concurrently bouncing the thread pool size up and down.
I have seen the VM get stuck in three different ways, but I unfortunately I only have notes for two of them.
It enters STW and never leaves:
# dtrace -n 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return {}' -p 54449
dtrace: description 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return ' matched 2 probes
CPU ID FUNCTION:NAME
48 77343 sgen_stop_world:entry
48 77344 sgen_stop_world:return
48 77343 sgen_stop_world:entry
48 77344 sgen_stop_world:return
48 77343 sgen_stop_world:entry
48 77344 sgen_stop_world:return
48 77343 sgen_stop_world:entry
Specifically, it is stuck in this loop:
https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/sgen-stw.c#L114
While in that loop the restart_count
is ALWAYS 1
and the info
pointer is always the same.
I also saw this assetion two times:
Assertion at sgen-os-posix.c:103, condition `!info->stack_start' not met
This seems to be the case the occurs most often. There is some activity at first and then the process gets stuck bouncing between two threads.
Given the functions called by thread 9
I'm pretty sure it's stuck in
the async_invoke_thread()
function used by the thread pool for
running tasks.
https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/threadpool.c#L1673
# dtrace -n 'pid$target:*mono*::entry,pid$target:*mono*::return { printf("[%2d] %d", tid, arg0); }' -p 21570
<... SNIP ...>
7 87866 mono_sem_timedwait:return [ 9] 406
7 80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
7 85702 mono_runtime_is_shutting_down:return [ 9] 6
7 80548 check_for_interruption_critical:entry [ 9] 9708008
7 81358 mono_gc_set_skip_thread:entry [ 9] 0
7 82657 mono_thread_info_current:entry [ 9] 0
7 87964 mono_thread_info_current:return [ 9] 24
7 86773 mono_gc_set_skip_thread:return [ 9] 54
7 81357 sgen_gc_unlock:entry [ 9] 1
7 86772 sgen_gc_unlock:return [ 9] 32
7 80337 mono_thread_internal_current:entry [ 9] 9803904
7 85804 mono_thread_internal_current:return [ 9] 16
7 86013 check_for_interruption_critical:return [ 9] 36
7 81358 mono_gc_set_skip_thread:entry [ 9] 1
7 82657 mono_thread_info_current:entry [ 9] 1
7 87964 mono_thread_info_current:return [ 9] 24
7 86773 mono_gc_set_skip_thread:return [ 9] 54
7 81357 sgen_gc_unlock:entry [ 9] 1
7 86772 sgen_gc_unlock:return [ 9] 32
7 81995 mono_cq_count:entry [ 9] 10320848
7 87358 mono_cq_count:return [ 9] 10
7 82544 mono_sem_timedwait:entry [ 9] 9708008
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
59 87681 wapi_SleepEx:return [ 3] 205
59 80232 mono_runtime_is_shutting_down:entry [ 3] 256
59 85702 mono_runtime_is_shutting_down:return [ 3] 6
59 81995 mono_cq_count:entry [ 3] 10419040
59 87358 mono_cq_count:return [ 3] 10
59 82523 mono_msec_ticks:entry [ 3] 10419040
59 82522 mono_100ns_ticks:entry [ 3] 10419040
59 87847 mono_100ns_ticks:return [ 3] 190
59 87848 mono_msec_ticks:return [ 3] 82
59 82346 wapi_SleepEx:entry [ 3] 500
59 82330 get_current_thread_handle:entry [ 3] 500
59 82657 mono_thread_info_current:entry [ 3] 500
59 87964 mono_thread_info_current:return [ 3] 24
59 87665 get_current_thread_handle:return [ 3] 30
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
59 87681 wapi_SleepEx:return [ 3] 205
59 80232 mono_runtime_is_shutting_down:entry [ 3] 256
59 85702 mono_runtime_is_shutting_down:return [ 3] 6
59 81995 mono_cq_count:entry [ 3] 10419040
59 87358 mono_cq_count:return [ 3] 10
59 82523 mono_msec_ticks:entry [ 3] 10419040
59 82522 mono_100ns_ticks:entry [ 3] 10419040
59 87847 mono_100ns_ticks:return [ 3] 190
59 87848 mono_msec_ticks:return [ 3] 82
59 82346 wapi_SleepEx:entry [ 3] 500
59 82330 get_current_thread_handle:entry [ 3] 500
59 82657 mono_thread_info_current:entry [ 3] 500
59 87964 mono_thread_info_current:return [ 3] 24
59 87665 get_current_thread_handle:return [ 3] 30
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
59 87681 wapi_SleepEx:return [ 3] 205
59 80232 mono_runtime_is_shutting_down:entry [ 3] 256
59 85702 mono_runtime_is_shutting_down:return [ 3] 6
59 81995 mono_cq_count:entry [ 3] 10419040
59 87358 mono_cq_count:return [ 3] 10
59 82523 mono_msec_ticks:entry [ 3] 10419040
59 82522 mono_100ns_ticks:entry [ 3] 10419040
59 87847 mono_100ns_ticks:return [ 3] 190
59 87848 mono_msec_ticks:return [ 3] 82
59 82346 wapi_SleepEx:entry [ 3] 500
59 82330 get_current_thread_handle:entry [ 3] 500
59 82657 mono_thread_info_current:entry [ 3] 500
59 87964 mono_thread_info_current:return [ 3] 24
59 87665 get_current_thread_handle:return [ 3] 30
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
59 87681 wapi_SleepEx:return [ 3] 205
59 80232 mono_runtime_is_shutting_down:entry [ 3] 256
59 85702 mono_runtime_is_shutting_down:return [ 3] 6
59 81995 mono_cq_count:entry [ 3] 10419040
59 87358 mono_cq_count:return [ 3] 10
59 82523 mono_msec_ticks:entry [ 3] 10419040
59 82522 mono_100ns_ticks:entry [ 3] 10419040
59 87847 mono_100ns_ticks:return [ 3] 190
59 87848 mono_msec_ticks:return [ 3] 82
59 82346 wapi_SleepEx:entry [ 3] 500
59 82330 get_current_thread_handle:entry [ 3] 500
59 82657 mono_thread_info_current:entry [ 3] 500
59 87964 mono_thread_info_current:return [ 3] 24
59 87665 get_current_thread_handle:return [ 3] 30
59 82334 _wapi_thread_apc_pending:entry [ 3] 65546
59 82332 lookup_thread:entry [ 3] 65546
59 82061 _wapi_lookup_handle:entry [ 3] 65546
59 87411 _wapi_lookup_handle:return [ 3] 100
59 87667 lookup_thread:return [ 3] 32
59 87669 _wapi_thread_apc_pending:return [ 3] 28
7 87866 mono_sem_timedwait:return [ 9] 406
7 80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
7 85702 mono_runtime_is_shutting_down:return [ 9] 6
7 80548 check_for_interruption_critical:entry [ 9] 9708008
7 81358 mono_gc_set_skip_thread:entry [ 9] 0
7 82657 mono_thread_info_current:entry [ 9] 0
7 87964 mono_thread_info_current:return [ 9] 24
7 86773 mono_gc_set_skip_thread:return [ 9] 54
7 81357 sgen_gc_unlock:entry [ 9] 1
7 86772 sgen_gc_unlock:return [ 9] 32
7 80337 mono_thread_internal_current:entry [ 9] 9803904
7 85804 mono_thread_internal_current:return [ 9] 16
7 86013 check_for_interruption_critical:return [ 9] 36
7 81358 mono_gc_set_skip_thread:entry [ 9] 1
7 82657 mono_thread_info_current:entry [ 9] 1
7 87964 mono_thread_info_current:return [ 9] 24
7 86773 mono_gc_set_skip_thread:return [ 9] 54
7 81357 sgen_gc_unlock:entry [ 9] 1
7 86772 sgen_gc_unlock:return [ 9] 32
7 81995 mono_cq_count:entry [ 9] 10320848
7 87358 mono_cq_count:return [ 9] 10
7 82544 mono_sem_timedwait:entry [ 9] 9708008
- With sgen disabled the test always runs to completion:
./configure –prefix=/opt/mono-boehm –with-gc=boehm –with-sgen=no
. - No issues running test on CentOS 6.6 on KVM.
- I saw lx code trip this assert three times (but not yet on newest platform image):
Assertion failed: ucp != NULL, file ../common/lx_brand.c, line 797
- Mono's sgen GC and thread pool makes use of signals to interrupt threads:
SIGXCPU
&SIGPWR
. - Test completes when running under strace.
- When test is stuck can sometimes unstick it when tracing with DTrace or using pstack.
Create CentOS 6 lx zone:
# vmadm create <<< '
{
"alias": "centos66-lx",
"hostname": "centos66-lx",
"brand": "lx",
"kernel_version": "2.6.32",
"max_physical_memory": 7168,
"image_uuid": "aa11a466-15c4-11e5-add5-0700f90a1098",
"internal_metadata": {
"root_authorized_keys":"@@your pub key@@"
},
"resolvers": [ "8.8.8.8" ],
"nics": [
{
"nic_tag": "admin",
"ip": "10.0.1.221",
"netmask": "255.255.255.0",
"gateway": "10.0.0.1",
"primary": true
}
],
"quota": 20
}
'
Login to lx zone:
# zlogin <UUID returned from vmadm>
Build mono 4.0.1.44 (also tried 4.0.2.4):
# yum install git autoconf libtool automake gettext gcc-c++ tar wget
# git clone git://github.com/mono/mono.git
# cd mono
# git checkout mono-4.0.1.44
# ./autogen.sh --prefix=/opt/mono
# make get-monolite-latest
# make -j 32
Run VM tests:
# cd mono
# make check
Run bug-18026 only:
# cd tests
# MONO_GC_DEBUG=9 MONO_PATH=/root/mono/mcs/class/lib/net_4_5 ../../runtime/mono-wrapper ./test-runner.exe -j a bug-18026.exe
Even at level 9 the VM seems to only log a small fraction of the GC messages, feels like something is broken there too.
Engine:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
GC: sgen and Included Boehm GC with typed GC and parallel mark
TLS: __thread
SIGALTSTACK: yes
Engine: Building and using the JIT
oprofile: no
BigArrays: no
DTrace: no
LLVM Back End: no (dynamically loaded: no)
Libraries:
.NET 4.5: yes
MonoDroid: no
MonoTouch: no
Xamarin.Mac: no
JNI support: IKVM Native
libgdiplus: assumed to be installed
zlib:
I forgot to mention that you can witness the test get stuck by uncommenting the
WriteLine()
calls and tailing the stdout file: