Skip to content

Instantly share code, notes, and snippets.

@bearpelican
Last active November 20, 2018 21:29
Show Gist options
  • Select an option

  • Save bearpelican/eb2bc2bc1818581efef91172277e4b3b to your computer and use it in GitHub Desktop.

Select an option

Save bearpelican/eb2bc2bc1818581efef91172277e4b3b to your computer and use it in GitHub Desktop.
2 p3-8xlarge (8 gpus)
CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.62.15 --master_port=6006 train_minimal.py; echo $? > /tmp/tmux/10
.1542748959789204.out
Distributed initializing process group
Distributed initializing process group
Distributed initializing process group
Distributed initializing process group
Distributed: success (0/8)
Loading model
Distributed: success (1/8)
Loading model
Distributed: success (3/8)
Loading model
Distributed: success (2/8)
Loading model
Network to fp16
Network to fp16
1 Deadlock may happen here
1 Creating tensor: 1.0
ip-192-168-62-15:3933:3933 [0] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3933:3933 [0] NCCL INFO NET/IB : Using interface ens3 for sideband communication
ip-192-168-62-15:3933:3933 [0] NCCL INFO Using internal Network Socket
ip-192-168-62-15:3933:3933 [0] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3933:3933 [0] NCCL INFO NET/Socket : 1 interfaces found
NCCL version 2.3.7+cuda10.0
ip-192-168-62-15:3933:3933 [0] NCCL INFO rank 0 nranks 8
1 Deadlock may happen here
1 Creating tensor: 1.0
ip-192-168-62-15:3934:3934 [1] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3934:3934 [1] NCCL INFO NET/IB : Using interface ens3 for sideband communication
ip-192-168-62-15:3934:3934 [1] NCCL INFO Using internal Network Socket
ip-192-168-62-15:3934:3934 [1] NCCL INFO rank 1 nranks 8
Network to fp16
1 Deadlock may happen here
1 Creating tensor: 1.0
ip-192-168-62-15:3936:3936 [3] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3936:3936 [3] NCCL INFO NET/IB : Using interface ens3 for sideband communication
ip-192-168-62-15:3936:3936 [3] NCCL INFO Using internal Network Socket
ip-192-168-62-15:3936:3936 [3] NCCL INFO rank 3 nranks 8
ip-192-168-62-15:3934:3985 [1] NCCL INFO comm 0x7f704c01e510 rank 1 nranks 8
ip-192-168-62-15:3934:3985 [1] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3934:3985 [1] NCCL INFO NET/Socket : 1 interfaces found
ip-192-168-62-15:3933:3984 [0] NCCL INFO comm 0x7f097c01e510 rank 0 nranks 8
ip-192-168-62-15:3936:3986 [3] NCCL INFO comm 0x7f1a9801e510 rank 3 nranks 8
ip-192-168-62-15:3936:3986 [3] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3936:3986 [3] NCCL INFO NET/Socket : 1 interfaces found
Network to fp16
1 Deadlock may happen here
1 Creating tensor: 1.0
ip-192-168-62-15:3935:3935 [2] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3935:3935 [2] NCCL INFO NET/IB : Using interface ens3 for sideband communication
ip-192-168-62-15:3935:3935 [2] NCCL INFO Using internal Network Socket
ip-192-168-62-15:3935:3935 [2] NCCL INFO rank 2 nranks 8
ip-192-168-62-15:3935:3987 [2] NCCL INFO comm 0x7faf3801e510 rank 2 nranks 8
ip-192-168-62-15:3935:3987 [2] NCCL INFO NET : Using interface ens3:192.168.62.15<0>
ip-192-168-62-15:3935:3987 [2] NCCL INFO NET/Socket : 1 interfaces found
ip-192-168-62-15:3935:3987 [2] NCCL INFO CUDA Dev 2, IP Interfaces : ens3(PHB)
ip-192-168-62-15:3934:3985 [1] NCCL INFO CUDA Dev 1, IP Interfaces : ens3(PHB)
ip-192-168-62-15:3936:3986 [3] NCCL INFO CUDA Dev 3, IP Interfaces : ens3(PHB)
ip-192-168-62-15:3933:3984 [0] NCCL INFO CUDA Dev 0, IP Interfaces : ens3(PHB)
ip-192-168-62-15:3933:3984 [0] NCCL INFO Using 256 threads
ip-192-168-62-15:3933:3984 [0] NCCL INFO Min Comp Cap 7
ip-192-168-62-15:3933:3984 [0] NCCL INFO Ring 00 : 0 1 2 3 4 5 6 7
ip-192-168-62-15:3933:3984 [0] NCCL INFO Ring 00 : 7 -> 0 via NET/Socket/0
ip-192-168-62-15:3933:3984 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
ip-192-168-62-15:3935:3987 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
ip-192-168-62-15:3934:3985 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
ip-192-168-62-15:3936:3986 [3] NCCL INFO nvmlDeviceGetNvLinkState() failed: Not Supported
ip-192-168-62-15:3935:3987 [2] NCCL INFO comm 0x7faf3801e510 rank 2 nranks 8 - COMPLETE
ip-192-168-62-15:3935:3935 [2] NCCL INFO AllReduce: opCount 0 sendbuff 0x7faf403f8a00 recvbuff 0x7faf403f8a00 count 1 datatype 7 op 0 root 0 comm 0x7faf3801e510 [nranks=8] stream 0x562e3fa686e0
ip-192-168-62-15:3936:3986 [3] NCCL INFO comm 0x7f1a9801e510 rank 3 nranks 8 - COMPLETE
ip-192-168-62-15:3933:3984 [0] NCCL INFO comm 0x7f097c01e510 rank 0 nranks 8 - COMPLETE
ip-192-168-62-15:3934:3985 [1] NCCL INFO comm 0x7f704c01e510 rank 1 nranks 8 - COMPLETE
ip-192-168-62-15:3933:3933 [0] NCCL INFO ip-192-168-62-15:3936:3936 [3] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f09903f8a00 recvbuff 0x7f09903f8a00 count 1 datatype 7 op 0 root 0 comm 0x7f097c01e510 [nranks=8] stream 0x5559e79b1aa0AllReduce: opCount 0 sendbuff 0x7f1aa0
3f8a00 recvbuff 0x7f1aa03f8a00 count 1 datatype 7 op 0 root 0 comm 0x7f1a9801e510 [nranks=8] stream 0x555c4a102d90
ip-192-168-62-15:3934:3934 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f70583f8a00 recvbuff 0x7f70583f8a00 count 1 datatype 7 op 0 root 0 comm 0x7f704c01e510 [nranks=8] stream 0x55b2bf549cf0ip-192-168-62-15:3933:3933 [0] NCCL INFO
Launch mode Parallel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment