Skip to content

Instantly share code, notes, and snippets.

View bearpelican's full-sized avatar

Andrew Shaw bearpelican

View GitHub Profile
#0 0x00007ffe84fe4b39 in clock_gettime ()
#1 0x00007ff84f337876 in __GI___clock_gettime (clock_id=4, tp=0x7ffe84ec4950) at ../sysdeps/unix/clock_gettime.c:115
#2 0x00007ff83f9a9c4e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ff83fa388d3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007ff83f9927cc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007ff83f992929 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007ff83f8a38e7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007ff83f9ea2f2 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8 0x00007ff82c7d6964 in cudart::cudaApiStreamSynchronize(CUstream_st*) () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so
#9 0x00007ff82c814c9d in cudaStreamSynchronize () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so
@bearpelican
bearpelican / distributed_nccl_debug.txt
Last active November 20, 2018 21:29
2 p3-8xlarge (8 gpus)
CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.62.15 --master_port=6006 train_minimal.py; echo $? > /tmp/tmux/10
.1542748959789204.out
Distributed initializing process group
Distributed initializing process group
Distributed initializing process group
Distributed initializing process group
Distributed: success (0/8)
Loading model
Distributed: success (1/8)
Loading model
Attaching to program: /home/ubuntu/anaconda3/envs/pytorch_source/bin/python, process 3936
[New LWP 3963]
[New LWP 3966]
[New LWP 3989]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffd67f80b39 in clock_gettime ()
(gdb) bt
#0 0x00007ffd67f80b39 in clock_gettime ()
#1 0x00007f1b4d5f3876 in __GI___clock_gettime (clock_id=4, tp=0x7ffd67ef2780) at ../sysdeps/unix/clock_gettime.c:115
import os
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import torch.utils.data
import torch.utils.data.distributed
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
cudnn.benchmark = True
Distributed initializing process group
Loading model
Loading distributed
Forward
Backward
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-1-a0b9fdb95505> in <module>()
33 print('Backward')
34 loss = out.sum()
algo-1-X9QH2_1 | [2018-12-11 01:17:04 +0000] [60] [ERROR] Socket error processing request.
algo-1-X9QH2_1 | Traceback (most recent call last):
algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 66, in handle
algo-1-X9QH2_1 | six.reraise(*sys.exc_info())
algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/six.py", line 625, in reraise
algo-1-X9QH2_1 | raise value
algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 56, in handle
algo-1-X9QH2_1 | self.handle_request(listener_name, req, client, addr)
algo-1-X9QH2_1 | File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
algo-1-X9QH2_1 | addr)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.