Skip to content

Instantly share code, notes, and snippets.

@gangliao
Last active May 24, 2024 10:53
Show Gist options
  • Save gangliao/235fe1f055cf4c944efb698ae1f67577 to your computer and use it in GitHub Desktop.
Save gangliao/235fe1f055cf4c944efb698ae1f67577 to your computer and use it in GitHub Desktop.
Horovod with TCP and IB
import matplotlib.pyplot as plt
#for plotting
import numpy as np
# create plot
fig, ax = plt.subplots()
bar_width = 0.15
opacity = 0.8
xlabel= np.array([8, 16, 32, 64])
index = np.arange(4)
tcpVals = [304, 255, 118, 83]
ibVals = [304, 151, 92, 50]
r1 = plt.bar(index, tcpVals, bar_width,
alpha=opacity,
color='#595959',
label='TCP')
r2 = plt.bar(index + bar_width, ibVals, bar_width,
alpha=opacity,
color='#F6921E',
label='InfiniBand')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(xlabel)
plt.xlabel('# GPUs')
plt.ylabel('# Secs')
ttlStr = 'Ring Allreduce on Acoustic Model (LSTM + WarpCTC)'
plt.title(ttlStr)
plt.legend()
plt.tight_layout()
plt.show()
docker run --net=host --runtime=nvidia -d -v /search/odin/tf-code-acoustics/train-data:/search/odin/tf-code-acoustics/train-data -it -P --privileged  -v /dev/infiniband/:/dev/infiniband --name horovod_v3   10.142.104.73:8043/dlp/horovod:latest bash -c "/usr/sbin/sshd -p 55557; sleep infinity"

8 GPUs 1 Node

mpirun  --allow-run-as-root  -np 8 -H 10.141.186.118:8 -x NCCL_IB_DISABLE=1   -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_include eth0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_1x8_256.log &


mpirun  --allow-run-as-root  -np 8 -H 10.141.250.15:8 -x NCCL_IB_DISABLE=0   -x NCCL_SOCKET_IFNAME=ib0 --mca btl_tcp_if_include ib0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_1x8_256_ib.log &

16 GPUs 2 Nodes

mpirun  --allow-run-as-root  -np 16 -H 10.141.186.118:8,10.141.186.119:8 -x NCCL_IB_DISABLE=1   -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_include eth0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_2x8_512.log &


mpirun  --allow-run-as-root  -np 16 -H 10.141.250.15:8,10.141.250.16:8 -x NCCL_IB_DISABLE=0   -x NCCL_SOCKET_IFNAME=ib0 --mca btl_tcp_if_include ib0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_2x8_512_ib.log &

32 GPUs 4 Nodes

mpirun  --allow-run-as-root  -np 32 -H 10.141.186.118:8,10.141.186.119:8,10.141.186.111:8,10.141.186.117:8 -x NCCL_IB_DISABLE=1   -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_include eth0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_4x8_1024.log &


mpirun  --allow-run-as-root  -np 32 -H 10.141.250.15:8,10.141.250.16:8,10.141.250.8:8,10.141.250.14:8 -x NCCL_IB_DISABLE=0   -x NCCL_SOCKET_IFNAME=ib0 --mca btl_tcp_if_include ib0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_4x8_1024_ib.log &

64 GPUs 8 Nodes

mpirun  --allow-run-as-root  -np 64 -H 10.141.186.118:8,10.141.186.119:8,10.141.186.111:8,10.141.186.117:8,10.141.162.80:8,10.141.170.36:8,10.141.202.77:8,10.141.202.71:8 -x NCCL_IB_DISABLE=1   -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_include eth0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_8x8_2048.log &


mpirun  --allow-run-as-root  -np 64 -H 10.141.250.15:8,10.141.250.16:8,10.141.250.8:8,10.141.250.14:8,10.141.250.33:8,10.141.251.7:8,10.141.251.45:8,10.141.251.55:8 -x NCCL_IB_DISABLE=0   -x NCCL_SOCKET_IFNAME=ib0 --mca btl_tcp_if_include ib0 --mca plm_rsh_args "-p 55557" bash run-restore.sh ctc &> test_8x8_2048_ib.log &

Benchmark

Node GPUs Batch Size TCP InfiniBand
1 8 256 304s 304s
2 16 512 255s 151s
4 32 1024 118s 92s
8 64 1024 83s 50s
@abidmalikwaterloo
Copy link

How you measure the execution performance?

@GoodJoey
Copy link

GoodJoey commented Apr 2, 2018

nice! what's the gpu are u using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment