bearpelican’s gists

bearpelican / stacktrace_pytorch_minimal_deadlock.txt

Created November 20, 2018 10:21

	#0 0x00007ffe84fe4b39 in clock_gettime ()
	#1 0x00007ff84f337876 in __GI___clock_gettime (clock_id=4, tp=0x7ffe84ec4950) at ../sysdeps/unix/clock_gettime.c:115
	#2 0x00007ff83f9a9c4e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
	#3 0x00007ff83fa388d3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
	#4 0x00007ff83f9927cc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
	#5 0x00007ff83f992929 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
	#6 0x00007ff83f8a38e7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
	#7 0x00007ff83f9ea2f2 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
	#8 0x00007ff82c7d6964 in cudart::cudaApiStreamSynchronize(CUstream_st*) () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so
	#9 0x00007ff82c814c9d in cudaStreamSynchronize () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so

bearpelican / distributed_nccl_debug.txt

Last active November 20, 2018 21:29

2 p3-8xlarge (8 gpus)

	CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.62.15 --master_port=6006 train_minimal.py; echo $? > /tmp/tmux/10
	.1542748959789204.out
	Distributed initializing process group
	Distributed initializing process group
	Distributed initializing process group
	Distributed initializing process group
	Distributed: success (0/8)
	Loading model
	Distributed: success (1/8)
	Loading model

bearpelican / backtrace_cuda_launch_blocking.txt

Created November 20, 2018 21:27

	Attaching to program: /home/ubuntu/anaconda3/envs/pytorch_source/bin/python, process 3936
	[New LWP 3963]
	[New LWP 3966]
	[New LWP 3989]
	[Thread debugging using libthread_db enabled]
	Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
	0x00007ffd67f80b39 in clock_gettime ()
	(gdb) bt
	#0 0x00007ffd67f80b39 in clock_gettime ()
	#1 0x00007f1b4d5f3876 in __GI___clock_gettime (clock_id=4, tp=0x7ffd67ef2780) at ../sysdeps/unix/clock_gettime.c:115

bearpelican / mismatched_buckets.py

Created November 21, 2018 00:28

	import os
	import torch
	import torch.nn as nn
	import torch.backends.cudnn as cudnn
	import torch.utils.data
	import torch.utils.data.distributed
	import torch.distributed as dist
	from torch.nn.parallel import DistributedDataParallel

	cudnn.benchmark = True

bearpelican / mismatched_buckets_error.txt

Created November 21, 2018 00:30

	Distributed initializing process group
	Loading model
	Loading distributed
	Forward
	Backward
	---------------------------------------------------------------------------
	IndexError Traceback (most recent call last)
	<ipython-input-1-a0b9fdb95505> in <module>()
	33 print('Backward')
	34 loss = out.sum()

bearpelican / local_code_rmtree_stacktrace.txt

Created December 11, 2018 01:38

	algo-1-X9QH2_1 \| [2018-12-11 01:17:04 +0000] [60] [ERROR] Socket error processing request.
	algo-1-X9QH2_1 \| Traceback (most recent call last):
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 66, in handle
	algo-1-X9QH2_1 \| six.reraise(*sys.exc_info())
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/six.py", line 625, in reraise
	algo-1-X9QH2_1 \| raise value
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 56, in handle
	algo-1-X9QH2_1 \| self.handle_request(listener_name, req, client, addr)
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
	algo-1-X9QH2_1 \| addr)

bearpelican / Mixed Precision - Filter Comparison.ipynb

Created December 28, 2018 00:48

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

bearpelican / Mixed Precision - unet - arch - comparison.ipynb

Created December 28, 2018 00:49

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

bearpelican / unet - arch - pixel - shuffle.ipynb

Created December 28, 2018 18:47

bearpelican / unet-pixel-shuffle-comparison.ipynb

Created December 28, 2018 18:54

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

	algo-1-X9QH2_1 \| [2018-12-11 01:17:04 +0000] [60] [ERROR] Socket error processing request.
	algo-1-X9QH2_1 \| Traceback (most recent call last):
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 66, in handle
	algo-1-X9QH2_1 \| six.reraise(*sys.exc_info())
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/six.py", line 625, in reraise
	algo-1-X9QH2_1 \| raise value
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/base_async.py", line 56, in handle
	algo-1-X9QH2_1 \| self.handle_request(listener_name, req, client, addr)
	algo-1-X9QH2_1 \| File "/usr/local/lib/python3.7/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request
	algo-1-X9QH2_1 \| addr)

Andrew Shaw bearpelican