Issue: RETURNN: Horovod hang with stalled rank, after some time #323
Files generated via:
gdb -p 1594 -ex 'thread apply all bt' -ex="set confirm off" -ex quit > gdblog.p1494.txt
| [639970.703933] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodem | |
| ask=(null),cpuset=/,mems_allowed=0 | |
| [639970.703937] CPU: 9 PID: 1823 Comm: Xorg Tainted: P OE 5.4.0-58-generic #64-Ubuntu [639970.703938] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS (WI-FI | |
| ), BIOS 1407 04/01/2020 | |
| [639970.703938] Call Trace: | |
| [639970.703942] dump_stack+0x6d/0x9a | |
| [639970.703945] warn_alloc.cold+0x7b/0xdf | |
| [639970.703946] __alloc_pages_slowpath+0xe07/0xe50 | |
| [639970.703947] ? get_page_from_fre |
| /usr/local/bin/python3 "/Users/az/Library/Application Support/JetBrains/Toolbox/apps/PyCharm-C/ch-0/202.7660.27/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 57798 --file /Users/az/Programmierung/import-parallel-wavegan/pytorch_to_returnn.py --pwg_config mb_melgan.v2.yaml --pwg_checkpoint mb_melgan_models/checkpoint-1000000steps.pkl --features data/features.npy | |
| pydev debugger: process 58079 is connecting | |
| Connected to pydev debugger (build 202.7660.27) | |
| 2020-11-26 18:21:44.088082: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA | |
| 2020-11-26 18:21:44.101822: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fabda13afa0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: | |
| 2020-11-26 18:21:44.101846: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Ve |
| { | |
| 'melgan': { | |
| 'class': 'subnetwork', | |
| 'from': 'data', | |
| 'subnetwork': { | |
| 'layer0': {'class': 'pad', 'mode': 'reflect', 'axes': 'spatial', 'padding': (3, 3), 'from': 'data'}, | |
| 'layer1': { | |
| 'class': 'conv', | |
| 'from': 'layer0', | |
| 'activation': None, |
| .tmp_root: (hidden) | |
| data: None -> None | |
| melgan: <ModuleEntry <Sequential>> -> <TensorEntry name:? tensor:? returnn_data:'layer23_output' [B,T|'spatial:0:melgan/layer16/stack/layer2',F|4] axes {0:0,2:1,1:2}> | |
| data: None -> None | |
| layer0: <ModuleEntry <ReflectionPad1d>> -> <TensorEntry name:? tensor:? returnn_data:'layer0_output' [B,F|80,T|'spatial:1:melgan/layer0'] axes id> | |
| layer1: <ModuleEntry <Conv1d>> -> <TensorEntry name:? tensor:? returnn_data:'layer1_output' [B,F|384,T|'time:var:extern_data:data'] axes id> | |
| layer2: <ModuleEntry <LeakyReLU>> -> <TensorEntry name:? tensor:? returnn_data:'layer2_output' [B,F|384,T|'time:var:extern_data:data'] axes id> | |
| layer3: <ModuleEntry <ConvTranspose1d>> -> <TensorEntry name:? tensor:? returnn_data:'layer3_output' [B,T|'spatial:0:melgan/layer3',F|192] axes {0:0,2:1,1:2}> | |
| layer4: <ModuleEntry <ResidualStack>> -> <TensorEntry name:? tensor:? returnn_data:'add_output' [B,T|'spatial:0:melgan/layer4/stack/layer2',F|192] axes {0:0,2:1,1:2}> | |
| data: None -> None |
| #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 | |
| #1 0x00007ffff7620dbd in __GI___pthread_mutex_lock (mutex=0xe6cfb0) at ../nptl/pthread_mutex_lock.c:80 | |
| #2 0x00007fffd03f5d91 in google::protobuf::DescriptorPool::FindFileByName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const () | |
| from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2 | |
| #3 0x00007fffd044328e in google::protobuf::(anonymous namespace)::AssignDescriptorsImpl(google::protobuf::internal::DescriptorTable const*) () | |
| from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2 | |
| #4 0x00007ffff7625a99 in __pthread_once_slow ( | |
| once_control=0x7fffd0ee5d74 <descriptor_table_google_2fprotobuf_2fdescriptor_2eproto_once>, | |
| init_routine=0x7fffed339ac0 <std::__once_proxy()>) at pthread_once.c:116 | |
| #5 0x00007fffd0436 |
| #0 0x00007f32dfcaa96f in __GI___poll (fds=0x55d0b6aaa880, nfds=10, timeout=1999) at ../sysdeps/unix/sysv/linux/poll.c:29 | |
| #1 0x00007f32dba101ae in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 | |
| #2 0x00007f32dba102e3 in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 | |
| #3 0x00007f32dda07565 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () | |
| from /lib/x86_64-linux-gnu/libQt5Core.so.5 | |
| #4 0x00007f32dd9ae4db in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /lib/x86_64-linux-gnu/libQt5Core.so.5 | |
| #5 0x00007f32de5cdc6d in QDialog::exec() () from /lib/x86_64-linux-gnu/libQt5Widgets.so.5 | |
| #6 0x00007f32dec1d731 in KMessageBox::createKMessageBox(QDialog*, QDialogButtonBox*, QIcon const&, QString const&, QStringList const&, QString const&, bool*, QFlags<KMessageBox::Option>, QString const&, QMessageBox::Icon) () from /lib/x86_64-linux-gnu/libKF5WidgetsAddons.so.5 | |
| #7 0x00007f32dec1dd01 in KMessageBox::createKMessageBox(QDialog*, QDialogButtonB |
Issue: RETURNN: Horovod hang with stalled rank, after some time #323
Files generated via:
gdb -p 1594 -ex 'thread apply all bt' -ex="set confirm off" -ex quit > gdblog.p1494.txt
| %% Copyright 2007 Ulf Lindgren | |
| % | |
| % This work may be distributed and/or modified under the conditions of the LaTeX | |
| % Project Public License, either version 1.3 of this license or (at your option) | |
| % any later version. The latest version of this license is in | |
| % http://www.latex-project.org/lppl.txt | |
| % and version 1.3 or later is part of all distributions of LaTeX version | |
| % 2005/12/01 or later. | |
| % | |
| % This work has the LPPL maintenance status `maintained'. |
| Exception in thread Thread-107: | |
| Traceback (most recent call last): | |
| File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner | |
| self.run() | |
| File "/usr/lib/python3.5/threading.py", line 862, in run | |
| self._target(*self._args, **self._kwargs) | |
| File "/home/zeyer/.local/lib/python3.5/site-packages/kivy/input/providers/hidinput.py", line 685, in _thread_run | |
| data = fd.read(struct_input_event_sz) | |
| OSError: [Errno 19] No such device |
I collect screenshots and screencasts for the Python package better_exchook here.