The Mono VM appears to have a race/deadlock issue when using sgen GC on an lx branded zone. The condition can almost ALWAYS (only once did I see it run through to completion) be reproduced by running test bug-18026. This test stresses the thread pool, task queue, and allocation facilities of the VM -- spawning many tasks performing allocation while concurrently bouncing the thread pool size up and down.
I have seen the VM get stuck in three different ways, but I unfortunately I only have notes for two of them.
It enters STW and never leaves:
# dtrace -n 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return {}' -p 54449
dtrace: description 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return ' matched 2 probes
CPU     ID                    FUNCTION:NAME
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
Specifically, it is stuck in this loop:
https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/sgen-stw.c#L114
While in that loop the restart_count is ALWAYS 1 and the info
pointer is always the same.
I also saw this assetion two times:
Assertion at sgen-os-posix.c:103, condition `!info->stack_start' not met
This seems to be the case the occurs most often. There is some activity at first and then the process gets stuck bouncing between two threads.
Given the functions called by thread 9 I'm pretty sure it's stuck in
the async_invoke_thread() function used by the thread pool for
running tasks.
https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/threadpool.c#L1673
# dtrace -n 'pid$target:*mono*::entry,pid$target:*mono*::return { printf("[%2d] %d", tid, arg0); }' -p 21570
<... SNIP ...>
  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008
- With sgen disabled the test always runs to completion: ./configure –prefix=/opt/mono-boehm –with-gc=boehm –with-sgen=no.
- No issues running test on CentOS 6.6 on KVM.
- I saw lx code trip this assert three times (but not yet on newest platform image): Assertion failed: ucp != NULL, file ../common/lx_brand.c, line 797
- Mono's sgen GC and thread pool makes use of signals to interrupt threads: SIGXCPU&SIGPWR.
- Test completes when running under strace.
- When test is stuck can sometimes unstick it when tracing with DTrace or using pstack.
Create CentOS 6 lx zone:
# vmadm create <<< '
{
    "alias": "centos66-lx",
    "hostname": "centos66-lx",
    "brand": "lx",
    "kernel_version": "2.6.32",
    "max_physical_memory": 7168,
    "image_uuid": "aa11a466-15c4-11e5-add5-0700f90a1098",
    "internal_metadata": {
        "root_authorized_keys":"@@your pub key@@"
    },
    "resolvers": [ "8.8.8.8" ],
    "nics": [
        {
            "nic_tag": "admin",
            "ip": "10.0.1.221",
            "netmask": "255.255.255.0",
            "gateway": "10.0.0.1",
            "primary": true
        }
    ],
    "quota": 20
}
'
Login to lx zone:
# zlogin <UUID returned from vmadm>
Build mono 4.0.1.44 (also tried 4.0.2.4):
# yum install git autoconf libtool automake gettext gcc-c++ tar wget
# git clone git://github.com/mono/mono.git
# cd mono
# git checkout mono-4.0.1.44
# ./autogen.sh --prefix=/opt/mono
# make get-monolite-latest
# make -j 32
Run VM tests:
# cd mono
# make check
Run bug-18026 only:
# cd tests
# MONO_GC_DEBUG=9 MONO_PATH=/root/mono/mcs/class/lib/net_4_5 ../../runtime/mono-wrapper ./test-runner.exe -j a bug-18026.exe
Even at level 9 the VM seems to only log a small fraction of the GC messages, feels like something is broken there too.
   Engine:
	Host:	       x86_64-unknown-linux-gnu
	Target:	       x86_64-unknown-linux-gnu
	GC:	       sgen and Included Boehm GC with typed GC and parallel mark
	TLS:           __thread
	SIGALTSTACK:   yes
	Engine:        Building and using the JIT
	oprofile:      no
	BigArrays:     no
	DTrace:        no
	LLVM Back End: no (dynamically loaded: no)
   Libraries:
	.NET 4.5:      yes
	MonoDroid:     no
	MonoTouch:     no
	Xamarin.Mac:   no
	JNI support:   IKVM Native
	libgdiplus:    assumed to be installed
	zlib:
I forgot to mention that you can witness the test get stuck by uncommenting the
WriteLine()calls and tailing the stdout file: