Last active
November 20, 2018 21:29
-
-
Save bearpelican/eb2bc2bc1818581efef91172277e4b3b to your computer and use it in GitHub Desktop.
2 p3-8xlarge (8 gpus)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.62.15 --master_port=6006 train_minimal.py; echo $? > /tmp/tmux/10 | |
| .1542748959789204.out | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed initializing process group | |
| Distributed: success (0/8) | |
| Loading model | |
| Distributed: success (1/8) | |
| Loading model | |
| Distributed: success (3/8) | |
| Loading model | |
| Distributed: success (2/8) | |
| Loading model | |
| Network to fp16 | |
| Network to fp16 | |
| 1 Deadlock may happen here | |
| 1 Creating tensor: 1.0 | |
| ip-192-168-62-15:3933:3933 [0] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3933:3933 [0] NCCL INFO NET/IB : Using interface ens3 for sideband communication | |
| ip-192-168-62-15:3933:3933 [0] NCCL INFO Using internal Network Socket | |
| ip-192-168-62-15:3933:3933 [0] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3933:3933 [0] NCCL INFO NET/Socket : 1 interfaces found | |
| NCCL version 2.3.7+cuda10.0 | |
| ip-192-168-62-15:3933:3933 [0] NCCL INFO rank 0 nranks 8 | |
| 1 Deadlock may happen here | |
| 1 Creating tensor: 1.0 | |
| ip-192-168-62-15:3934:3934 [1] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3934:3934 [1] NCCL INFO NET/IB : Using interface ens3 for sideband communication | |
| ip-192-168-62-15:3934:3934 [1] NCCL INFO Using internal Network Socket | |
| ip-192-168-62-15:3934:3934 [1] NCCL INFO rank 1 nranks 8 | |
| Network to fp16 | |
| 1 Deadlock may happen here | |
| 1 Creating tensor: 1.0 | |
| ip-192-168-62-15:3936:3936 [3] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3936:3936 [3] NCCL INFO NET/IB : Using interface ens3 for sideband communication | |
| ip-192-168-62-15:3936:3936 [3] NCCL INFO Using internal Network Socket | |
| ip-192-168-62-15:3936:3936 [3] NCCL INFO rank 3 nranks 8 | |
| ip-192-168-62-15:3934:3985 [1] NCCL INFO comm 0x7f704c01e510 rank 1 nranks 8 | |
| ip-192-168-62-15:3934:3985 [1] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3934:3985 [1] NCCL INFO NET/Socket : 1 interfaces found | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO comm 0x7f097c01e510 rank 0 nranks 8 | |
| ip-192-168-62-15:3936:3986 [3] NCCL INFO comm 0x7f1a9801e510 rank 3 nranks 8 | |
| ip-192-168-62-15:3936:3986 [3] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3936:3986 [3] NCCL INFO NET/Socket : 1 interfaces found | |
| Network to fp16 | |
| 1 Deadlock may happen here | |
| 1 Creating tensor: 1.0 | |
| ip-192-168-62-15:3935:3935 [2] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3935:3935 [2] NCCL INFO NET/IB : Using interface ens3 for sideband communication | |
| ip-192-168-62-15:3935:3935 [2] NCCL INFO Using internal Network Socket | |
| ip-192-168-62-15:3935:3935 [2] NCCL INFO rank 2 nranks 8 | |
| ip-192-168-62-15:3935:3987 [2] NCCL INFO comm 0x7faf3801e510 rank 2 nranks 8 | |
| ip-192-168-62-15:3935:3987 [2] NCCL INFO NET : Using interface ens3:192.168.62.15<0> | |
| ip-192-168-62-15:3935:3987 [2] NCCL INFO NET/Socket : 1 interfaces found | |
| ip-192-168-62-15:3935:3987 [2] NCCL INFO CUDA Dev 2, IP Interfaces : ens3(PHB) | |
| ip-192-168-62-15:3934:3985 [1] NCCL INFO CUDA Dev 1, IP Interfaces : ens3(PHB) | |
| ip-192-168-62-15:3936:3986 [3] NCCL INFO CUDA Dev 3, IP Interfaces : ens3(PHB) | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO CUDA Dev 0, IP Interfaces : ens3(PHB) | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO Using 256 threads | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO Min Comp Cap 7 | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO Ring 00 : 0 1 2 3 4 5 6 7 | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO Ring 00 : 7 -> 0 via NET/Socket/0 | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC | |
| ip-192-168-62-15:3935:3987 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC | |
| ip-192-168-62-15:3934:3985 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC | |
| ip-192-168-62-15:3936:3986 [3] NCCL INFO nvmlDeviceGetNvLinkState() failed: Not Supported | |
| ip-192-168-62-15:3935:3987 [2] NCCL INFO comm 0x7faf3801e510 rank 2 nranks 8 - COMPLETE | |
| ip-192-168-62-15:3935:3935 [2] NCCL INFO AllReduce: opCount 0 sendbuff 0x7faf403f8a00 recvbuff 0x7faf403f8a00 count 1 datatype 7 op 0 root 0 comm 0x7faf3801e510 [nranks=8] stream 0x562e3fa686e0 | |
| ip-192-168-62-15:3936:3986 [3] NCCL INFO comm 0x7f1a9801e510 rank 3 nranks 8 - COMPLETE | |
| ip-192-168-62-15:3933:3984 [0] NCCL INFO comm 0x7f097c01e510 rank 0 nranks 8 - COMPLETE | |
| ip-192-168-62-15:3934:3985 [1] NCCL INFO comm 0x7f704c01e510 rank 1 nranks 8 - COMPLETE | |
| ip-192-168-62-15:3933:3933 [0] NCCL INFO ip-192-168-62-15:3936:3936 [3] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f09903f8a00 recvbuff 0x7f09903f8a00 count 1 datatype 7 op 0 root 0 comm 0x7f097c01e510 [nranks=8] stream 0x5559e79b1aa0AllReduce: opCount 0 sendbuff 0x7f1aa0 | |
| 3f8a00 recvbuff 0x7f1aa03f8a00 count 1 datatype 7 op 0 root 0 comm 0x7f1a9801e510 [nranks=8] stream 0x555c4a102d90 | |
| ip-192-168-62-15:3934:3934 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f70583f8a00 recvbuff 0x7f70583f8a00 count 1 datatype 7 op 0 root 0 comm 0x7f704c01e510 [nranks=8] stream 0x55b2bf549cf0ip-192-168-62-15:3933:3933 [0] NCCL INFO | |
| Launch mode Parallel |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment