This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Attaching to program: /home/ubuntu/anaconda3/envs/pytorch_source/bin/python, process 3936 | |
| [New LWP 3963] | |
| [New LWP 3966] | |
| [New LWP 3989] | |
| [Thread debugging using libthread_db enabled] | |
| Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". | |
| 0x00007ffd67f80b39 in clock_gettime () | |
| (gdb) bt | |
| #0 0x00007ffd67f80b39 in clock_gettime () | |
| #1 0x00007f1b4d5f3876 in __GI___clock_gettime (clock_id=4, tp=0x7ffd67ef2780) at ../sysdeps/unix/clock_gettime.c:115 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.62.15 --master_port=6006 train_minimal.py; echo $? > /tmp/tmux/10 | |
| .1542748959789204.out | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed: success (0/8) | |
| Loading model | |
| Distributed: success (1/8) | |
| Loading model |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #0 0x00007ffe84fe4b39 in clock_gettime () | |
| #1 0x00007ff84f337876 in __GI___clock_gettime (clock_id=4, tp=0x7ffe84ec4950) at ../sysdeps/unix/clock_gettime.c:115 | |
| #2 0x00007ff83f9a9c4e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #3 0x00007ff83fa388d3 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #4 0x00007ff83f9927cc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #5 0x00007ff83f992929 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #6 0x00007ff83f8a38e7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #7 0x00007ff83f9ea2f2 in cuStreamSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 | |
| #8 0x00007ff82c7d6964 in cudart::cudaApiStreamSynchronize(CUstream_st*) () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so | |
| #9 0x00007ff82c814c9d in cudaStreamSynchronize () from /home/ubuntu/anaconda3/envs/pytorch_source/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import argparse | |
| import os | |
| import torch | |
| import torch.nn as nn | |
| import torch.backends.cudnn as cudnn | |
| import torch.utils.data | |
| import torch.utils.data.distributed | |
| cudnn.benchmark = True |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import argparse | |
| import os | |
| import torch | |
| import torch.nn as nn | |
| import torch.backends.cudnn as cudnn | |
| import torch.distributed as dist | |
| import torch.utils.data | |
| import torch.utils.data.distributed | |
| from torch.nn.parallel import DistributedDataParallel |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ~~epoch hours top1 top5 | |
| Dataset changed. | |
| Image size: 128 | |
| Batch size: 128 | |
| Train Directory: /home/ubuntu/data/imagenet-sz/160/train | |
| Validation Directory: /home/ubuntu/data/imagenet-sz/160/validation | |
| Changing LR from None to 1.9220382165605094 | |
| Changing LR from 2.2379617834394905 to 2.2399999999999998 | |
| ~~0 0.01241 4.248 12.422 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ~~epoch hours top1Accuracy | |
| Distributed: init_process_group success | |
| Loaded model | |
| Defined loss and optimizer | |
| Created data loaders | |
| Begin training | |
| Changing LR from None to 1.4 | |
| ~~0 0.01853289027777778 14.500 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ~~epoch hours top1Accuracy | |
| Distributed: init_process_group success | |
| Loaded model | |
| Defined loss and optimizer | |
| Created data loaders | |
| Begin training | |
| Begin training loop: 1530911465.107739 | |
| Prefetcher first preload complete | |
| Received input: 3.9817962646484375 |