Skip to content

Instantly share code, notes, and snippets.

@rzezeski
Last active September 4, 2015 16:56
Show Gist options
  • Save rzezeski/8a967f9cfdd3976869eb to your computer and use it in GitHub Desktop.
Save rzezeski/8a967f9cfdd3976869eb to your computer and use it in GitHub Desktop.
Mono bug-18026 stuck on CentOS 6.6 lx branded zone

Summary

The Mono VM appears to have a race/deadlock issue when using sgen GC on an lx branded zone. The condition can almost ALWAYS (only once did I see it run through to completion) be reproduced by running test bug-18026. This test stresses the thread pool, task queue, and allocation facilities of the VM -- spawning many tasks performing allocation while concurrently bouncing the thread pool size up and down.

I have seen the VM get stuck in three different ways, but I unfortunately I only have notes for two of them.

(1) Unable to escape stop-the-world (STW) GC

It enters STW and never leaves:

# dtrace -n 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return {}' -p 54449
dtrace: description 'pid$target:*mono*:sgen_stop_world:entry,pid$target:*mono*:sgen_stop_world:return ' matched 2 probes
CPU     ID                    FUNCTION:NAME
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry
 48  77344           sgen_stop_world:return
 48  77343            sgen_stop_world:entry

Specifically, it is stuck in this loop:

https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/sgen-stw.c#L114

While in that loop the restart_count is ALWAYS 1 and the info pointer is always the same.

I also saw this assetion two times:

Assertion at sgen-os-posix.c:103, condition `!info->stack_start' not met

(2) Stuck in threadpool

This seems to be the case the occurs most often. There is some activity at first and then the process gets stuck bouncing between two threads.

Given the functions called by thread 9 I'm pretty sure it's stuck in the async_invoke_thread() function used by the thread pool for running tasks.

https://github.com/mono/mono/blob/mono-4.0.1.44/mono/metadata/threadpool.c#L1673

# dtrace -n 'pid$target:*mono*::entry,pid$target:*mono*::return { printf("[%2d] %d", tid, arg0); }' -p 21570

<... SNIP ...>

  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
 59  87681              wapi_SleepEx:return [ 3] 205
 59  80232 mono_runtime_is_shutting_down:entry [ 3] 256
 59  85702 mono_runtime_is_shutting_down:return [ 3] 6
 59  81995              mono_cq_count:entry [ 3] 10419040
 59  87358             mono_cq_count:return [ 3] 10
 59  82523            mono_msec_ticks:entry [ 3] 10419040
 59  82522           mono_100ns_ticks:entry [ 3] 10419040
 59  87847          mono_100ns_ticks:return [ 3] 190
 59  87848           mono_msec_ticks:return [ 3] 82
 59  82346               wapi_SleepEx:entry [ 3] 500
 59  82330  get_current_thread_handle:entry [ 3] 500
 59  82657   mono_thread_info_current:entry [ 3] 500
 59  87964  mono_thread_info_current:return [ 3] 24
 59  87665 get_current_thread_handle:return [ 3] 30
 59  82334   _wapi_thread_apc_pending:entry [ 3] 65546
 59  82332              lookup_thread:entry [ 3] 65546
 59  82061        _wapi_lookup_handle:entry [ 3] 65546
 59  87411       _wapi_lookup_handle:return [ 3] 100
 59  87667             lookup_thread:return [ 3] 32
 59  87669  _wapi_thread_apc_pending:return [ 3] 28
  7  87866        mono_sem_timedwait:return [ 9] 406
  7  80232 mono_runtime_is_shutting_down:entry [ 9] 9708008
  7  85702 mono_runtime_is_shutting_down:return [ 9] 6
  7  80548 check_for_interruption_critical:entry [ 9] 9708008
  7  81358    mono_gc_set_skip_thread:entry [ 9] 0
  7  82657   mono_thread_info_current:entry [ 9] 0
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  80337 mono_thread_internal_current:entry [ 9] 9803904
  7  85804 mono_thread_internal_current:return [ 9] 16
  7  86013 check_for_interruption_critical:return [ 9] 36
  7  81358    mono_gc_set_skip_thread:entry [ 9] 1
  7  82657   mono_thread_info_current:entry [ 9] 1
  7  87964  mono_thread_info_current:return [ 9] 24
  7  86773   mono_gc_set_skip_thread:return [ 9] 54
  7  81357             sgen_gc_unlock:entry [ 9] 1
  7  86772            sgen_gc_unlock:return [ 9] 32
  7  81995              mono_cq_count:entry [ 9] 10320848
  7  87358             mono_cq_count:return [ 9] 10
  7  82544         mono_sem_timedwait:entry [ 9] 9708008

Misc Notes

  • With sgen disabled the test always runs to completion: ./configure –prefix=/opt/mono-boehm –with-gc=boehm –with-sgen=no.
  • No issues running test on CentOS 6.6 on KVM.
  • I saw lx code trip this assert three times (but not yet on newest platform image): Assertion failed: ucp != NULL, file ../common/lx_brand.c, line 797
  • Mono's sgen GC and thread pool makes use of signals to interrupt threads: SIGXCPU & SIGPWR.
  • Test completes when running under strace.
  • When test is stuck can sometimes unstick it when tracing with DTrace or using pstack.

Steps to Reproduce

Create CentOS 6 lx zone:

# vmadm create <<< '
{
    "alias": "centos66-lx",
    "hostname": "centos66-lx",
    "brand": "lx",
    "kernel_version": "2.6.32",
    "max_physical_memory": 7168,
    "image_uuid": "aa11a466-15c4-11e5-add5-0700f90a1098",
    "internal_metadata": {
        "root_authorized_keys":"@@your pub key@@"
    },
    "resolvers": [ "8.8.8.8" ],
    "nics": [
        {
            "nic_tag": "admin",
            "ip": "10.0.1.221",
            "netmask": "255.255.255.0",
            "gateway": "10.0.0.1",
            "primary": true
        }
    ],
    "quota": 20
}
'

Login to lx zone:

# zlogin <UUID returned from vmadm>

Build mono 4.0.1.44 (also tried 4.0.2.4):

# yum install git autoconf libtool automake gettext gcc-c++ tar wget
# git clone git://github.com/mono/mono.git
# cd mono
# git checkout mono-4.0.1.44
# ./autogen.sh --prefix=/opt/mono
# make get-monolite-latest
# make -j 32

Run VM tests:

# cd mono
# make check

Run bug-18026 only:

# cd tests
# MONO_GC_DEBUG=9 MONO_PATH=/root/mono/mcs/class/lib/net_4_5 ../../runtime/mono-wrapper ./test-runner.exe -j a bug-18026.exe

Even at level 9 the VM seems to only log a small fraction of the GC messages, feels like something is broken there too.

Configure Output

   Engine:
	Host:	       x86_64-unknown-linux-gnu
	Target:	       x86_64-unknown-linux-gnu
	GC:	       sgen and Included Boehm GC with typed GC and parallel mark
	TLS:           __thread
	SIGALTSTACK:   yes
	Engine:        Building and using the JIT
	oprofile:      no
	BigArrays:     no
	DTrace:        no
	LLVM Back End: no (dynamically loaded: no)

   Libraries:
	.NET 4.5:      yes
	MonoDroid:     no
	MonoTouch:     no
	Xamarin.Mac:   no
	JNI support:   IKVM Native
	libgdiplus:    assumed to be installed
	zlib:
@rzezeski
Copy link
Author

I forgot to mention that you can witness the test get stuck by uncommenting the WriteLine() calls and tailing the stdout file:

# cd mono/mono/tests
# vim bug-18026.cs
# make bug-18026.exe
# tail -f bug-18026.exe.stdout

@migueldeicaza
Copy link

Would you mind testing Mono/master?

@rzezeski
Copy link
Author

@migueldeicaza Working on running master now.

@rzezeski
Copy link
Author

The test has now run to completion 4 times on master. So it gets stuck on 4.0.1.44 and 4.0.2.4 but not master.

@rzezeski
Copy link
Author

I've seen the build lockup a few times now while building
documentation -- which makes use of mono runtime. Here is a report of
the last lockup. This happened while trying to build 4.0.2.4 with
alternate signal stack disabled --with-sigaltstack=no.

First I start a build in one session:

time make -j 12 2>&1 | tee /var/tmp/make.log

Eventually it gets stuck building docs:

Warning: couldn't process type file `./../class/Novell.Directory.Ldap/Documentation/en/global/SupportClass+SecureRandomSupport.xml' as it doesn't exist
Error reading namespace XML for Novell.Directory.Ldap.Asn1 at ./../class/Novell.Directory.Ldap/Documentation/en/ns-Novell.Directory.Ldap.Asn1.xml
Warning: couldn't process directory `./../class/Commons.Xml.Relaxng/Documentation/en' as it has no index.xml file
Error reading namespace XML for Novell.Directory.Ldap.Rfc2251 at ./../class/Novell.Directory.Ldap/Documentation/en/ns-Novell.Directory.Ldap.Rfc2251.xml
Error reading namespace XML for Novell.Directory.Ldap at ./../class/Novell.Directory.Ldap/Documentation/en/ns-Novell.Directory.Ldap.xml
Error reading namespace XML for Novell.Directory.Ldap.Utilclass at ./../class/Novell.Directory.Ldap/Documentation/en/ns-Novell.Directory.Ldap.Utilclass.xml
Error reading namespace XML for Novell.Directory.Ldap.Extensions at ./../class/Novell.Directory.Ldap/Documentation/en/ns-Novell.Directory.Ldap.Extensions.xml
Warning: couldn't process type file `./../class/Mono.Debugger.Soft/Documentation/en/Mono.Debugger.Soft/AssemblyLoadEventRequest.xml' as it doesn't exist
Error reading namespace XML for Novell.Directory.Ldap.Controls at ./../class/Novell.Directory.Ldap/Documentation/en/ns-Novell.Directory.Ldap.Controls.xml
Warning: couldn't process type file `./../class/Mono.Posix/Documentation/en/Mono.Unix.Native/EpollEvent.xml' as it doesn't exist
Warning: couldn't process type file `./../class/Mono.Posix/Documentation/en/Mono.Unix.Native/EpollEvents.xml' as it doesn't exist
Warning: couldn't process type file `./../class/Mono.Posix/Documentation/en/Mono.Unix.Native/EpollFlags.xml' as it doesn't exist
Warning: couldn't process type file `./../class/Mono.Posix/Documentation/en/Mono.Unix.Native/EpollOp.xml' as it doesn't exist
Warning: couldn't process type file `./../class/Mono.Posix/Documentation/en/Mono.Unix.Native/RealTimeSignum.xml' as it doesn't exist
Warning: couldn't process type file `./../class/Mono.Security.Win32/Documentation/en/Mono.Security.Cryptography/MD2.xml' as it doesn't exist
Warning: couldn't process type file `./../class/Mono.Security.Win32/Documentation/en/Mono.Security.Cryptography/MD4.xml' as it doesn't exist
Error reading namespace XML for Mono.Security.Cryptography at ./../class/System.Core/Documentation/en/ns-Mono.Security.Cryptography.xml
Warning: couldn't process type file `./../class/System.Design/Documentation/en/System.Web.UI.Design.WebControls/DataSourceIDConverter.xml' as it doesn't exist
Warning: couldn't process directory `./../class/System.DirectoryServices/Documentation/en' as it has no index.xml file
Error reading namespace XML for System.Resources at ./../class/System.Web/Documentation/en/ns-System.Resources.xml
Warning: couldn't process type file `./../class/System.Web/Documentation/en/System.Resources/ResXFileRef.xml' as it doesn't exist
Warning: couldn't process type file `./../class/System.Web/Documentation/en/System.Resources/ResXFileRef+Converter.xml' as it doesn't exist
Warning: couldn't process type file `./../class/System.Web/Documentation/en/System.Resources/ResXResourceReader.xml' as it doesn't exist
Warning: couldn't process type file `./../class/System.Web/Documentation/en/System.Resources/ResXResourceWriter.xml' as it doesn't exist

After waiting a while and seeing no progress I open another session to
ddebug. First I get the process tree:

[root@centos66-lx ~]# /native/usr/bin/ptree $(pgrep mono)
3522  zsched
  14475 /bin/login -h zone:global -f root
    14477 -bash
      7174  make -j 12
        7538  make all-recursive
          7539  /bin/sh -c fail= failcom='exit 1'; \ for f in x $MAKEFLAGS; do \   case $f in \
            12936 /bin/sh -c fail= failcom='exit 1'; \ for f in x $MAKEFLAGS; do \   case $f in \
              12937 make all
                12967 /bin/sh -c cd /root/mono/mcs && make --no-print-directory -s NO_DIR_CHECK=1 PRO
                  12968 make --no-print-directory -s NO_DIR_CHECK=1 PROFILES=binary_reference_assemblie
                    12969 make profile-do--binary_reference_assemblies--all profile-do--net_4_5--all prof
                      13774 make PROFILE=net_4_5 all
                        13778 /bin/sh -c set . $MAKEFLAGS; final_exit=:; \ case $2 in --unix) shift ;; esac;
                          24465 /bin/sh -c set . $MAKEFLAGS; final_exit=:; \ case $2 in --unix) shift ;; esac;
                            24466 make all
                              24468 make all-local
                                24469 /bin/sh -c echo "MDOC    [net_4_5] cs-errors.tree"; MONO_PATH="./../class/lib/n
                                  24474 /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config ./../clas
                                    24578 /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config ./../clas

Then I use the Linux pstack to see a trace (I actually mean to run
native illumos pstack but forgot to prefix with /native/usr/bin):

[root@centos66-lx ~]# pstack 24578
#0  0x00007ffffe00e75d in read () from /lib64/libpthread.so.0
#1  0x00000000006071f8 in wapi_CreateProcess ()
#2  0x0000000000575368 in ves_icall_System_Diagnostics_Process_CreateProcess_internal ()
#3  0x000000007fd6e70d in ?? ()
#4  0x00007fffe40014b0 in ?? ()
#5  0x00007ffffd451bb0 in ?? ()
#6  0x0000000000000001 in ?? ()
#7  0x00007ffffd405178 in ?? ()
#8  0x00007ffffd7ebe68 in ?? ()
#9  0x00007fffe40013f0 in ?? ()
#10 0x000000007fd6e28c in ?? ()
#11 0x00007ffffa7ff440 in ?? ()
#12 0x00007ffffa7ff180 in ?? ()
#13 0x000000007fd6d334 in ?? ()
#14 0x00007ffffd7ebcd0 in ?? ()
#15 0x000000007ffc0173 in ?? ()
#16 0x00007fffe40014b0 in ?? ()
#17 0x00000000005827ae in mono_thread_force_interruption_checkpoint ()
#18 0x000000007ffc0173 in ?? ()
#19 0x00007ffffd42afa0 in ?? ()
#20 0x00007ffffa8006d8 in ?? ()
#21 0x0000000000000398 in ?? ()
#22 0x0000000000ad9190 in ?? ()
#23 0x0000000000000025 in ?? ()
#24 0x00007fffe40013f0 in ?? ()
#25 0x00007ffffd42b1a0 in ?? ()
#26 0x00007ffffa7ff2a0 in ?? ()
#27 0x00007ffffa7ff200 in ?? ()
#28 0x0000000000000000 in ?? ()

pstack never returned, had to ^C

Then I used illumos native pstack:

[root@centos66-lx ~]# /native/usr/bin/pstack 24578
24578:  /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config
 00007ffffe00e75d SYS#0    ()

Then I pstack'd the parent:

[root@centos66-lx ~]# /native/usr/bin/pstack 24474
24474:  /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config
-----------------  lwp# 1 / thread# 1  --------------------
 00007fffff34db7a nanosleep (7fffffefe350, 7fffffefe340)
 00007fffff4d28d5 lx_nanosleep (7fffffefe350, 7fffffefe340) + 15
 00007fffff4d01ef lx_emulate (7fffff09eca0, 23, 7fffff09ec70) + bf
-----------------  lwp# 2 / thread# 2  --------------------
 00007ffffe00d930 SYS#0    ()
-----------------  lwp# 3 / thread# 3  --------------------
 00007fffff34db7a nanosleep (7ffffdfeeb10, 0)
 00007fffff4ccd46 lx_clock_nanosleep (1, 1, 7ffffe26dcc0, 0) + 96
 00007fffff4d01ef lx_emulate (7ffffdfeeca0, e6, 7ffffdfeec70) + bf
-----------------  lwp# 4 / thread# 4  --------------------
 00007fffff34e94a sigsuspend (7ffffd26dcf0)
 00007fffff4da1c0 lx_rt_sigsuspend (941fe0, 8) + 40
 00007fffff4d01ef lx_emulate (7ffffd26ddd0, 82, 7ffffd26dda0) + bf
-----------------  lwp# 5 / thread# 5  --------------------
 00007ffffe00e264 SYS#0    ()

Then I used varous dtrace scripts to look for any activity on stuck pid:

Check for any mono calls:

[root@centos66-lx ~]# dtrace -n 'pid$target::*mono*:entry,pid$target::*mono*:return {}' -p 24578
dtrace: description 'pid$target::*mono*:entry,pid$target::*mono*:return ' matched 5684 probes
^C

Check for any calls in the binary at all:

[root@centos66-lx ~]# dtrace -n 'pid$target:a.out::entry,pid$target:a.out::return {}' -p 24578
dtrace: description 'pid$target:a.out::entry,pid$target:a.out::return ' matched 10983 probes
^C

What about libc?

[root@centos66-lx ~]# dtrace -n 'pid$target:libc::entry,pid$target:libc::return {}' -p 24578
dtrace: description 'pid$target:libc::entry,pid$target:libc::return ' matched 5930 probes
^C

Throw hands in the air and try the entire damn pid:

[root@centos66-lx ~]# dtrace -n 'pid$target:::entry,pid$target:::return {}' -p 24578
dtrace: description 'pid$target:::entry,pid$target:::return ' matched 21853 probes
^C

Okay, maybe something is happening kernel side, hop into GZ and check fbt provider:

[root@lx-test ~]# dtrace -n 'fbt:::entry,fbt:::return /pid == $target/ {}' -p 24578
dtrace: description 'fbt:::entry,fbt:::return ' matched 69399 probes
^C

Nothing, okay how about a core dump:

[root@lx-test ~]# gcore 24578
gcore: core.24578 dump failed: unexpected short write

Sigh...maybe the earlier linux pstack call, which uses gdb, is causing
an issue (return to lx zone):

[root@centos66-lx ~] # ps aux
...
root     24598  0.0  0.0  40584   432 ?        T    20:12   0:00 /usr/bin/gdb --quiet --readnever -nx /proc/24578/exe 24578
root     24578  0.0  0.4  68508 32644 ?        T    20:01   0:00 /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config ./..

Start the proc back up:

[root@centos66-lx ~]# prun 24578
[root@centos66-lx ~]# ps u 24578
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

Okay the pid is gone now:

[root@centos66-lx ~]# /native/usr/bin/ptree $(pgrep mono)
3522  zsched
  14475 /bin/login -h zone:global -f root
    14477 -bash
      7174  make -j 12
        7538  make all-recursive
          7539  /bin/sh -c fail= failcom='exit 1'; \ for f in x $MAKEFLAGS; do \   case $f in \
            12936 /bin/sh -c fail= failcom='exit 1'; \ for f in x $MAKEFLAGS; do \   case $f in \
              12937 make all
                12967 /bin/sh -c cd /root/mono/mcs && make --no-print-directory -s NO_DIR_CHECK=1 PRO
                  12968 make --no-print-directory -s NO_DIR_CHECK=1 PROFILES=binary_reference_assemblie
                    12969 make profile-do--binary_reference_assemblies--all profile-do--net_4_5--all prof
                      13774 make PROFILE=net_4_5 all
                        13778 /bin/sh -c set . $MAKEFLAGS; final_exit=:; \ case $2 in --unix) shift ;; esac;
                          24465 /bin/sh -c set . $MAKEFLAGS; final_exit=:; \ case $2 in --unix) shift ;; esac;
                            24466 make all
                              24468 make all-local
                                24469 /bin/sh -c echo "MDOC    [net_4_5] cs-errors.tree"; MONO_PATH="./../class/lib/n
                                  24474 /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config ./../class

Parent still alive? What is it doing?

[root@centos66-lx ~]# ps u 24474
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     24474  0.6  0.6  72536 48244 ?        Sl   20:01   0:10 /root/mono/mono/mini/mono --config /root/m

[root@centos66-lx ~]# /native/usr/bin/pstack 24474
24474:  /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config
-----------------  lwp# 1 / thread# 1  --------------------
 00007fffff34db7a nanosleep (7fffffefe350, 7fffffefe340)
 00007fffff4d28d5 lx_nanosleep (7fffffefe350, 7fffffefe340) + 15
 00007fffff4d01ef lx_emulate (7fffff09eca0, 23, 7fffff09ec70) + bf
-----------------  lwp# 2 / thread# 2  --------------------
 00007ffffe00d930 SYS#0    ()
-----------------  lwp# 3 / thread# 3  --------------------
 00007fffff34db7a nanosleep (7ffffdfeeb10, 0)
 00007fffff4ccd46 lx_clock_nanosleep (1, 1, 7ffffe26dcc0, 0) + 96
 00007fffff4d01ef lx_emulate (7ffffdfeeca0, e6, 7ffffdfeec70) + bf
-----------------  lwp# 4 / thread# 4  --------------------
 00007fffff34e94a sigsuspend (7ffffd26dcf0)
 00007fffff4da1c0 lx_rt_sigsuspend (941fe0, 8) + 40
 00007fffff4d01ef lx_emulate (7ffffd26ddd0, 82, 7ffffd26dda0) + bf
-----------------  lwp# 5 / thread# 5  --------------------
 00007ffffe00e264 SYS#0    ()
-----------------  lwp# 6 / thread# 6  --------------------
 00007ffffe00e264 SYS#0    ()
-----------------  lwp# 7 / thread# 7  --------------------
 00007ffffe00e264 SYS#0    ()

What the heck, I'll try linux pstack again:

[root@centos66-lx ~]# pstack 24474

[root@centos66-lx ~]# ps u 24474
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

And it died. Here's the final build terminal output:

/bin/sh: line 1: 24474 Killed                  MONO_PATH="./../class/lib/net_4_5:$MONO_PATH" /root/mono/runtime/mono-wrapper ./../class/lib/net_4_5/mdoc.exe --debug assemble -o cs-errors -f error cs-errors.config
make[7]: *** [cs-errors.tree] Error 137
make[6]: *** [do-all] Error 2
make[5]: *** [all-recursive] Error 1
make[4]: *** [profile-do--net_4_5--all] Error 2
make[3]: *** [profiles-do--all] Error 2
make[2]: *** [all-local] Error 2
make[2]: Leaving directory `/root/mono/runtime'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/mono'
make: *** [all] Error 2

real    34m51.890s
user    7m1.720s
sys 2m16.718s

@rzezeski
Copy link
Author

Build process got stuck again.

make[2]: Entering directory `/root/mono/runtime'
if test -w /root/mono/mcs; then :; else chmod -R +w /root/mono/mcs; fi
cd /root/mono/mcs && make --no-print-directory -s NO_DIR_CHECK=1 PROFILES='binary_reference_assemblies net_4_5 xbuild_12 xbuild_14   ' CC='gcc' all-profiles
make[6]: mcs: Command not found
make[6]: *** [build/deps/basic-profile-check.exe] Error 127
*** The compiler 'mcs' doesn't appear to be usable.
*** Trying the 'monolite' directory.
Bootstrap compiler: Mono C# compiler version 4.1.0.0
Makefile:437: warning: overriding commands for target `clean-local'
../../build/library.make:206: warning: ignoring old commands for target `clean-local'
Makefile:437: warning: overriding commands for target `clean-local'
../../build/library.make:206: warning: ignoring old commands for target `clean-local'
MDOC    [net_4_5] cs-errors.tree

Here is pstack:

[root@centos66-lx ~]# /native/usr/bin/pstack 26750
26750:  /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config
-----------------  lwp# 1 / thread# 1  --------------------
 00007fffff34db7a nanosleep (7fffffefe350, 7fffffefe340)
 00007fffff4d28d5 lx_nanosleep (7fffffefe350, 7fffffefe340) + 15
 00007fffff4d01ef lx_emulate (7fffff08eca0, 23, 7fffff08ec70) + bf
-----------------  lwp# 2 / thread# 2  --------------------
 00007ffffe00d930 SYS#0    ()
-----------------  lwp# 3 / thread# 3  --------------------
 00007fffff34db7a nanosleep (7ffffd272b10, 0)
 00007fffff4ccd46 lx_clock_nanosleep (1, 1, 7ffffdffdcc0, 0) + 96
 00007fffff4d01ef lx_emulate (7ffffd272ca0, e6, 7ffffd272c70) + bf
-----------------  lwp# 4 / thread# 4  --------------------
 00007fffff34e94a sigsuspend (7ffffc979cf0)
 00007fffff4da1c0 lx_rt_sigsuspend (941fe0, 8) + 40
 00007fffff4d01ef lx_emulate (7ffffc979dd0, 82, 7ffffc979da0) + bf
-----------------  lwp# 5 / thread# 5  --------------------
 00007ffffe00e264 SYS#0    ()

Use mdb -k from GZ to determine which Linux syscall threads 2 & 5
are stuck on:

> 0t26750 ::pid2proc | ::walk thread | ::print kthread_t t_sysnum
t_sysnum = 0xc7
t_sysnum = 0
t_sysnum = 0xc7
t_sysnum = 0x60
t_sysnum = 0
> 0t26750 ::pid2proc | ::walk thread | ::print kthread_t t_lwp | ::print klwp_t lwp_brand | ::print lx_lwp_data_t br_syscall_num
br_syscall_num = 0x23
br_syscall_num = 0xca
br_syscall_num = 0xe6
br_syscall_num = 0x82
br_syscall_num = 0xca

0xca is 202 which maps to futex on 64-bit. Implemented here:

https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/brand/lx/syscall/lx_futex.c

Now that I understand how to easily tease out the Linux syscall these
threads (thanks to Josh Clulow!) are stuck on I can go back to the
bug-18026 test and verify it is stuck for the same reason. I'll post
my findings in a separate comment.

@bcantrill
Copy link

Following your steps to reproduce. As per the mono README, I did a make get-monolite-latest to get a version of Mono suitable enough to bootstrap the build process.

@rzezeski
Copy link
Author

@bcantrill I included that in my list of steps but I inserted the configure output before it so it was easy to miss. I cleaned up my gist so it's easier to follow.

https://gist.github.com/rzezeski/8a967f9cfdd3976869eb#steps-to-reproduce

@rzezeski
Copy link
Author

This morning I was able to replicate what I saw during the stuck build
but instead running the original test: bug-18026. Like the build, it
had threads stuck in futex.

Terminal 1 (lx zone)

Uncomment WriteLine calls in test:

cd mono/mono/tests
vim bug-18026.cs

Run the test:

# MONO_PATH=/root/mono/mcs/class/lib/net_4_5 ../../runtime/mono-wrapper ./test-runner.exe -j a bug-18026.exe
Running tests:

Terminal 2 (lx zone)

tail the stdout for the test:

# cd mono
# tail -f mono/tests/bug-18026.exe.std*
10000
new
9999
done 0
9998
done 1
9997
done 2
done 3
9996
9995
9994
done 4
done 5
9993
9992
done 6
9991
done 7
done 8
done 9
...

Test gets stuck...

Terminal 3 (lx zone)

Run pstack on stuck process:

# /native/usr/bin/ptree $(pgrep mono)
5330  zsched
  5440  init
    5662  /usr/sbin/sshd
      9115  /usr/sbin/sshd -R
        396   -bash
          603   /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config ./test-ru
            620   /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config bug-18026

# /native/usr/bin/pstack 620
620:    /root/mono/mono/mini/mono --config /root/mono/runtime/etc/mono/config
-----------------  lwp# 1 / thread# 1  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 2 / thread# 2  --------------------
 00007ffffe00d930 SYS#0    ()
-----------------  lwp# 3 / thread# 3  --------------------
 00007fffff34da4a nanosleep (7ffffeabeb10, 0)
 00007fffff4cce46 lx_clock_nanosleep (1, 1, 7ffffeb0dcc0, 0) + 96
 00007fffff4d02ef lx_emulate (7ffffeabeca0, e6, 7ffffeabec70) + bf
-----------------  lwp# 4 / thread# 4  --------------------
 00007ffffe00e6fd SYS#0    ()
-----------------  lwp# 5 / thread# 5  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 6 / thread# 6  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 7 / thread# 7  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 8 / thread# 8  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 9 / thread# 9  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 10 / thread# 10  --------------------
 00007ffffe00b5bc SYS#0    ()
-----------------  lwp# 11 / thread# 11  --------------------
 00007ffffe00da21 SYS#0    ()

Terminal 4 (global zone)

Determine which Linux syscalls these threads are stuck in:

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd lofs idm sata crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> 0t620 ::pid2proc | ::walk thread | ::print kthread_t t_sysnum
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0xc7
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
t_sysnum = 0
> 0t620 ::pid2proc | ::walk thread | ::print kthread_t t_lwp | ::print klwp_t lwp_brand | ::print lx_lwp_data_t br_syscall_num
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xe6
br_syscall_num = 0x1
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca
br_syscall_num = 0xca

All stuck in futex.

@rzezeski
Copy link
Author

This thread about the recent futex_wait bug in the Linux sounds a lot like what I'm experiencing.

  • User application mysteriously locking up.
  • Using debugging tools causes it to get unstuck.
  • Often happens around GC.

https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64

The problem is there is no Linux kernel in lx. So it can't literally be this bug but perhaps a similar mistake was made in the lx futex implementation? FWIW, I've reproduced this on an older Sandybridge i7 and a new Haswell Xeon.

I also see no direct usage of futex(2) in the mono VM code, so I imagine other calls are mapped to futex by glibc?

@rzezeski
Copy link
Author

I dug into my stuck proc a bit more to determine the arguments to
lx_futex() and see if that might shed some light on the problem. I'm
not 100% sure I pulled out the arguments correctly but the values seem
to make sense (e.g. arg1 for all 8 lwps points to the same address and
all of them are 0x80 which is FUTEX_PRIVATE_FLAG, yet another thing
in common with the recent Linux futex bug).

First the stacks:

> 0t620 ::pid2proc | ::walk thread | ::stacks -a
THREAD           STATE    SOBJ                COUNT
fffff085cb860080 SLEEP    CV                      8
fffff6c94a4950c0 SLEEP    CV                      -
fffff6c94a187b80 SLEEP    CV                      -
fffff6c94a1790c0 SLEEP    CV                      -
fffff6c94a13d820 SLEEP    CV                      -
fffff6c94a129000 SLEEP    CV                      -
fffff6c93dc387e0 SLEEP    CV                      -
fffff53096b03ba0 SLEEP    CV                      -
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xbd
                 futex_wait+0x148
                 lx_futex+0x290
                 lx_syscall_enter+0x16a
                 sys_syscall+0x145

THREAD           STATE    SOBJ                COUNT
fffff54dba7a2180 SLEEP    CV                      1
                 swtch+0x141
                 cv_timedwait_sig_hires+0x39d
                 cv_waituntil_sig+0xfa
                 futex_wait+0x148
                 lx_futex+0x290
                 lx_syscall_enter+0x16a
                 sys_syscall+0x145

THREAD           STATE    SOBJ                COUNT
fffff08768909480 SLEEP    CV                      1
                 swtch+0x141
                 cv_timedwait_sig_hires+0x39d
                 cv_waituntil_sig+0xfa
                 nanosleep+0x19f

THREAD           STATE    SOBJ                COUNT
fffff6c94a1a3040 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 fifo_write+0x293
                 fop_write+0x8b
                 write+0x250
                 lx_write+0x3a
                 lx_syscall_enter+0x16a
                 sys_syscall+0x145

Then I teased out the the arguments to each lx_futex() call. Note that p_model = 0x200000 so I assume the first branch of lx_emulate_args() was taken.

https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/brand/lx/os/lx_syscall.c#L118

> fffff085cb860080::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107e
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a4950c0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107b
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a187b80::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107c
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a1790c0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107d
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a13d820::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x107f
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff6c94a129000::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x942700
r_rsi = 0x80
r_rdx = 0
r_r10 = 0
r_r8 = 0
r_r9 = 0x27b
> fffff6c93dc387e0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x1080
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d
> fffff53096b03ba0::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x99084c
r_rsi = 0x80
r_rdx = 0x1081
r_r10 = 0
r_r8 = 0x990800
r_r9 = 0x83d

All are waiting on the same address. I have very little knowledge of futex but I find two things odd:

  1. The val argument (r_rdx) seems way high.
  2. One of the treads has val=0, compared to the much higher values
    for the other threads.

Finally, here are the args for the last remaining futex call which has
an op=0x189 (FUTEX_WAIT_BITSET | FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME).

> fffff54dba7a2180::print kthread_t t_lwp | ::print klwp_t lwp_regs | ::print struct regs r_rdi r_rsi r_rdx r_r10 r_r8 r_r9
r_rdi = 0x9421e8
r_rsi = 0x189
r_rdx = 0
r_r10 = 0x7fffeddffca0
r_r8 = 0
r_r9 = 0xffffffff
>

@jjelinek
Copy link

Some of this seems similar to the issue we have reported as OS-4562, although the version of mono we get by default on Ubuntu 14.04 is 3.2.8. However, it looks like we're also stuck in GC in a similar way.

@rzezeski
Copy link
Author

rzezeski commented Sep 4, 2015

Some of this seems similar to the issue we have reported as OS-4562, although the version of mono we get by default on Ubuntu 14.04 is 3.2.8. However, it looks like we're also stuck in GC in a similar way.

Yes, OS-4562 has fixed the issue I reported here. Thank you.

Commit and platform image URL for anyone else who happens upon this gist:

TritonDataCenter/illumos-joyent@5b40dcb

http://us-east.manta.joyent.com/Joyent_Dev/public/SmartOS/smartos.html#20150903T073920Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment