Skip to content

Instantly share code, notes, and snippets.

@rzezeski
Last active September 4, 2015 16:56
Show Gist options
  • Save rzezeski/8a967f9cfdd3976869eb to your computer and use it in GitHub Desktop.
Save rzezeski/8a967f9cfdd3976869eb to your computer and use it in GitHub Desktop.
Mono bug-18026 stuck on CentOS 6.6 lx branded zone

Summary

The Mono VM appears to have a race/deadlock issue when using sgen GC on an lx branded zone. The condition can almost ALWAYS (only once did I see it run through to completion) be reproduced by running test bug-18026. This test stresses the thread pool, task queue, and allocation facilities of the VM -- spawning many tasks performing allocation while concurrently bouncing the thread pool size up and down.

I have seen the VM get stuck in three different ways, but I unfortunately I only have notes for two of them.

(1) Unable to escape stop-the-world (STW) GC

It enters STW and never leaves:

# dtrace -n 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return {}' -p 54449
dtrace: description 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return ' matched 2 probes
CPU     ID                    FUNCTION:NAME
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry

Specifically, it is stuck in this loop:

https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/sgen-stw.c#L114

While in that loop the restart_count is ALWAYS 1 and the info pointer is always the same.

I also saw this assetion two times:

Assertion at sgen-os-posix.c:103, condition `!info->stack_start' not met

(2) Stuck in threadpool

This seems to be the case the occurs most often. There is some activity at first and then the process gets stuck bouncing between two threads.

Given the functions called by thread 9 I'm pretty sure it's stuck in the async_invoke_thread() function used by the thread pool for running tasks.

https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/threadpool.c#L1673

# dtrace -n 'pid$target:*mono*::entry,pid$target:*mono*::return { printf("[%2d] %d", tid, arg0); }' -p 21570

<... SNIP ...>

  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008

Misc Notes

  • With sgen disabled the test always runs to completion: ./configure –prefix=/opt/mono-boehm –with-gc=boehm –with-sgen=no.
  • No issues running test on CentOS 6.6 on KVM.
  • I saw lx code trip this assert three times (but not yet on newest platform image): Assertion failed: ucp != NULL, file ../common/lx_brand.c, line 797
  • Mono's sgen GC and thread pool makes use of signals to interrupt threads: SIGXCPU & SIGPWR.
  • Test completes when running under strace.
  • When test is stuck can sometimes unstick it when tracing with DTrace or using pstack.

Steps to Reproduce

Create CentOS 6 lx zone:

# vmadm create <<< '
{
    "alias": "centos66-lx",
    "hostname": "centos66-lx",
    "brand": "lx",
    "kernel_version": "2.6.32",
    "max_physical_memory": 7168,
    "image_uuid": "aa11a466-15c4-11e5-add5-0700f90a1098",
    "internal_metadata": {
        "root_authorized_keys":"@@your pub key@@"
    },
    "resolvers": [ "8.8.8.8" ],
    "nics": [
        {
            "nic_tag": "admin",
            "ip": "10.0.1.221",
            "netmask": "255.255.255.0",
            "gateway": "10.0.0.1",
            "primary": true
        }
    ],
    "quota": 20
}
'

Login to lx zone:

# zlogin <UUID returned from vmadm>

Build mono 4.0.1.44 (also tried 4.0.2.4):

# yum install git autoconf libtool automake gettext gcc-c++ tar wget
# git clone git://github.com/mono/mono.git
# cd mono
# git checkout mono-4.0.1.44
# ./autogen.sh --prefix=/opt/mono
# make get-monolite-latest
# make -j 32

Run VM tests:

# cd mono
# make check

Run bug-18026 only:

# cd tests
# MONO_GC_DEBUG=9 MONO_PATH=/root/mono/mcs/class/lib/net_4_5 ../../runtime/mono-wrapper ./test-runner.exe -j a bug-18026.exe

Even at level 9 the VM seems to only log a small fraction of the GC messages, feels like something is broken there too.

Configure Output

   Engine:
	Host:	       x86_64-unknown-linux-gnu
	Target:	       x86_64-unknown-linux-gnu
	GC:	       sgen and Included Boehm GC with typed GC and parallel mark
	TLS:           __thread
	SIGALTSTACK:   yes
	Engine:        Building and using the JIT
	oprofile:      no
	BigArrays:     no
	DTrace:        no
	LLVM Back End: no (dynamically loaded: no)

   Libraries:
	.NET 4.5:      yes
	MonoDroid:     no
	MonoTouch:     no
	Xamarin.Mac:   no
	JNI support:   IKVM Native
	libgdiplus:    assumed to be installed
	zlib:
@rzezeski
Copy link
Author

rzezeski commented Sep 4, 2015

Some of this seems similar to the issue we have reported as OS-4562, although the version of mono we get by default on Ubuntu 14.04 is 3.2.8. However, it looks like we're also stuck in GC in a similar way.

Yes, OS-4562 has fixed the issue I reported here. Thank you.

Commit and platform image URL for anyone else who happens upon this gist:

TritonDataCenter/illumos-joyent@5b40dcb

http://us-east.manta.joyent.com/Joyent_Dev/public/SmartOS/smartos.html#20150903T073920Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment