This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #0 0x00007ffe84fe4b39 in clock_gettime () | |
| #1 0x00007ff84f337876 in __GI___clock_gettime (clock_id=4, tp=0x7ffe84ec4950) at ../sysdeps/unix/clock_gettime.c:115 | |
| #2 0x00007ff83f9a9c4e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #3 0x00007ff83fa388d3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #4 0x00007ff83f9927cc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #5 0x00007ff83f992929 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #6 0x00007ff83f8a38e7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #7 0x00007ff83f9ea2f2 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #8 0x00007ff82c7d6964 in cudart::cudaApiStreamSynchronize(CUstream_st*) () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so | |
| #9 0x00007ff82c814c9d in cudaStreamSynchronize () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.62.15 --master_port=6006 train_minimal.py; echo $? > /tmp/tmux/10 | |
| .1542748959789204.out | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed: success (0/8) | |
| Loading model | |
| Distributed: success (1/8) | |
| Loading model |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Attaching to program: /home/ubuntu/anaconda3/envs/pytorch_source/bin/python, process 3936 | |
| [New LWP 3963] | |
| [New LWP 3966] | |
| [New LWP 3989] | |
| [Thread debugging using libthread_db enabled] | |
| Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". | |
| 0x00007ffd67f80b39 in clock_gettime () | |
| (gdb) bt | |
| #0 0x00007ffd67f80b39 in clock_gettime () | |
| #1 0x00007f1b4d5f3876 in __GI___clock_gettime (clock_id=4, tp=0x7ffd67ef2780) at ../sysdeps/unix/clock_gettime.c:115 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import os | |
| import torch | |
| import torch.nn as nn | |
| import torch.backends.cudnn as cudnn | |
| import torch.utils.data | |
| import torch.utils.data.distributed | |
| import torch.distributed as dist | |
| from torch.nn.parallel import DistributedDataParallel | |
| cudnn.benchmark = True |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Distributed initializing process group | |
| Loading model | |
| Loading distributed | |
| Forward | |
| Backward | |
| --------------------------------------------------------------------------- | |
| IndexError Traceback (most recent call last) | |
| <ipython-input-1-a0b9fdb95505> in <module>() | |
| 33 print('Backward') | |
| 34 loss = out.sum() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| algo-1-X9QH2_1 | [2018-12-11 01:17:04 +0000] [60] [ERROR] Socket error processing request. | |
| algo-1-X9QH2_1 | Traceback (most recent call last): | |
| algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 66, in handle | |
| algo-1-X9QH2_1 | six.reraise(*sys.exc_info()) | |
| algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/six.py", line 625, in reraise | |
| algo-1-X9QH2_1 | raise value | |
| algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 56, in handle | |
| algo-1-X9QH2_1 | self.handle_request(listener_name, req, client, addr) | |
| algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request | |
| algo-1-X9QH2_1 | addr) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import torch\n", | |
| "import torch.nn as nn\n", |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.