Skip to content

Instantly share code, notes, and snippets.

View bearpelican's full-sized avatar

Andrew Shaw bearpelican

View GitHub Profile
Attaching to program: /home/ubuntu/anaconda3/envs/pytorch_source/bin/python, process 3936
[New LWP 3963]
[New LWP 3966]
[New LWP 3989]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffd67f80b39 in clock_gettime ()
(gdb) bt
#0 0x00007ffd67f80b39 in clock_gettime ()
#1 0x00007f1b4d5f3876 in __GI___clock_gettime (clock_id=4, tp=0x7ffd67ef2780) at ../sysdeps/unix/clock_gettime.c:115
@bearpelican
bearpelican / distributed_nccl_debug.txt
Last active November 20, 2018 21:29
2 p3-8xlarge (8 gpus)
CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.62.15 --master_port=6006 train_minimal.py; echo $? > /tmp/tmux/10
.1542748959789204.out
Distributed initializing process group
Distributed initializing process group
Distributed initializing process group
Distributed initializing process group
Distributed: success (0/8)
Loading model
Distributed: success (1/8)
Loading model
#0 0x00007ffe84fe4b39 in clock_gettime ()
#1 0x00007ff84f337876 in __GI___clock_gettime (clock_id=4, tp=0x7ffe84ec4950) at ../sysdeps/unix/clock_gettime.c:115
#2 0x00007ff83f9a9c4e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ff83fa388d3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007ff83f9927cc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007ff83f992929 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007ff83f8a38e7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007ff83f9ea2f2 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8 0x00007ff82c7d6964 in cudart::cudaApiStreamSynchronize(CUstream_st*) () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so
#9 0x00007ff82c814c9d in cudaStreamSynchronize () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so
import argparse
import os
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import torch.utils.data
import torch.utils.data.distributed
cudnn.benchmark = True
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
import argparse
import os
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.utils.data
import torch.utils.data.distributed
from torch.nn.parallel import DistributedDataParallel
~~epoch hours top1 top5
Dataset changed.
Image size: 128
Batch size: 128
Train Directory: /home/ubuntu/data/imagenet-sz/160/train
Validation Directory: /home/ubuntu/data/imagenet-sz/160/validation
Changing LR from None to 1.9220382165605094
Changing LR from 2.2379617834394905 to 2.2399999999999998
~~0 0.01241 4.248 12.422
~~epoch hours top1Accuracy
Distributed: init_process_group success
Loaded model
Defined loss and optimizer
Created data loaders
Begin training
Changing LR from None to 1.4
~~0 0.01853289027777778 14.500
~~epoch hours top1Accuracy
Distributed: init_process_group success
Loaded model
Defined loss and optimizer
Created data loaders
Begin training
Begin training loop: 1530911465.107739
Prefetcher first preload complete
Received input: 3.9817962646484375