Skip to content

Instantly share code, notes, and snippets.

View jacobkahn's full-sized avatar

Jacob Kahn jacobkahn

View GitHub Profile
@jacobkahn
jacobkahn / gist:4e43bfccd79bf864970a98c191cd6c9b
Created July 24, 2019 06:41
NCCL Tests on 1 node p3dn.24xlarge + efa - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-39-141:50814:50814 [0] NCCL INFO Launch mode Parallel
8 2 float sum 13.84 0.00 0.00 1e-07 13.75 0.00 0.00 1e-07
16 4 float sum 13.82 0.00 0.00 1e-07 13.96 0.00 0.00 1e-07
32 8 float sum 13.79 0.00 0.00 6e-08 13.79 0.00 0.00 6e-08
64 16 float sum 13.88 0.00 0.01 6e-08 13.73 0.00 0.01 6e-08
128 32 float sum 14.04 0.01 0.02 6e-08 13.94 0.01 0.02 6e-08
@jacobkahn
jacobkahn / gist:71a0c1c7ae2f7260471d5bb5da829a67
Last active July 24, 2019 07:02
NCCL Tests on 32 node p3dn.24xlarge + ethernet - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 1790.6 0.00 0.00 1e-06 1787.9 0.00 0.00 1e-06
16 4 float sum 1844.8 0.00 0.00 5e-07 1798.8 0.00 0.00 5e-07
32 8 float sum 1789.6 0.00 0.00 1e-06 1795.4 0.00 0.00 1e-06
64 16 float sum 1799.5 0.00 0.00 1e-06 1801.5 0.00 0.00 1e-06
128 32 float sum 1816.3 0.00 0.00 1e-06 1814.9 0.00 0.00 1e-06
256 64 float sum 1826.6 0.00 0.00 1e-06 1883.2 0.00 0.00 1e-06
@jacobkahn
jacobkahn / gist:43076dbce3922677058dc8094eb726b7
Last active July 24, 2019 06:49
NCCL Tests on 32 node p3dn.24xlarge + EFA - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 2460.5 0.00 0.00 1e-06 2458.0 0.00 0.00 5e-07
16 4 float sum 2457.7 0.00 0.00 5e-07 2467.2 0.00 0.00 5e-07
32 8 float sum 2466.2 0.00 0.00 5e-07 2466.3 0.00 0.00 5e-07
64 16 float sum 2459.3 0.00 0.00 1e-06 2460.7 0.00 0.00 1e-06
128 32 float sum 2457.8 0.00 0.00 1e-06 2456.7 0.00 0.00 1e-06
256 64 float sum 2453.9 0.00 0.00 1e-06 2457.4 0.00 0.00 1e-06
@jacobkahn
jacobkahn / gist:786d5c573bd9fb2a80843fe006d21030
Created July 24, 2019 05:10
NCCL Tests on 16 node p3dn.24xlarge + ethernet - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 118.3 0.00 0.00 2e-07 117.2 0.00 0.00 1e-07
16 4 float sum 116.2 0.00 0.00 0e+00 117.3 0.00 0.00 1e-07
32 8 float sum 118.5 0.00 0.00 1e-07 118.3 0.00 0.00 1e-07
64 16 float sum 118.7 0.00 0.00 1e-07 117.1 0.00 0.00 6e-08
128 32 float sum 118.5 0.00 0.00 6e-08 118.2 0.00 0.00 6e-08
256 64 float sum 118.7 0.00 0.00 6e-08 119.1 0.00 0.00 6e-08
@jacobkahn
jacobkahn / gist:210a827162960df11df3c9666ab54719
Created July 24, 2019 04:38
NCCL Tests on 16 node p3dn.24xlarge + efa - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 1247.4 0.00 0.00 1e-06 1239.1 0.00 0.00 0e+00
16 4 float sum 1239.1 0.00 0.00 2e-07 1239.6 0.00 0.00 2e-07
32 8 float sum 1242.5 0.00 0.00 2e-07 1241.4 0.00 0.00 2e-07
64 16 float sum 1237.8 0.00 0.00 5e-07 1240.8 0.00 0.00 5e-07
128 32 float sum 1240.6 0.00 0.00 5e-07 1238.0 0.00 0.00 5e-07
256 64 float sum 1238.2 0.00 0.00 5e-07 1237.4 0.00 0.00 5e-07
@jacobkahn
jacobkahn / gist:5fd4cd3e49a10c04105b777611327bbe
Created July 24, 2019 01:13
Output of `cat /opt/amazon/efa/installed_packages` on each node
172.31.39.141
# EFA installer version: 1.1.0
# Debug packages installed: no
# Packages installed:
efa-0.9.2-1.amzn1.x86_64 libfabric-1.7.0amzn1.1-1.amzn1.x86_64 libfabric-devel-1.7.0amzn1.1-1.amzn1.x86_64 openmpi-3.1.3-1.amzn1.x86_64
172.31.38.14
# EFA installer version: 1.1.0
# Debug packages installed: no
# Packages installed:
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# Rank 0 Pid 20303 on ip-172-31-39-141 device 0 [0x00] Tesla V100-SXM2-32GB
# Rank 1 Pid 20304 on ip-172-31-39-141 device 1 [0x00] Tesla V100-SXM2-32GB
# Rank 2 Pid 20305 on ip-172-31-39-141 device 2 [0x00] Tesla V100-SXM2-32GB
# Rank 3 Pid 20306 on ip-172-31-39-141 device 3 [0x00] Tesla V100-SXM2-32GB
# Rank 4 Pid 20307 on ip-172-31-39-141 device 4 [0x00] Tesla V100-SXM2-32GB
# Rank 5 Pid 20308 on ip-172-31-39-141 device 5 [0x00] Tesla V100-SXM2-32GB
# Rank 6 Pid 20309 on ip-172-31-39-141 device 6 [0x00] Tesla V100-SXM2-32GB
@jacobkahn
jacobkahn / gist:527f7eb2fe85e56074163544ede4e491
Created July 24, 2019 01:06
aws ec2 describe-security-groups --group-ids sg-0d37d17f642362f03
{
"SecurityGroups": [
{
"IpPermissionsEgress": [
{
"IpProtocol": "-1",
"PrefixListIds": [],
"IpRanges": [
{
"CidrIp": "0.0.0.0/0"
@jacobkahn
jacobkahn / gist:218cebce3aca0b1328dd68da05f28f73
Created July 24, 2019 01:03
fi_info -p efa on each host
172.31.39.141
provider: efa
fabric: EFA-fe80::cbb:82ff:fef5:d306
domain: efa_0-rdm
version: 3.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::cbb:82ff:fef5:d306
domain: efa_0-dgrm
@jacobkahn
jacobkahn / gist:3510a154e50b18defd236eeea461e3fb
Created July 24, 2019 00:35
NCCL Tests on 2 node p3dn.24xlarge + ethernet - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 112.7 0.00 0.00 2e-07 110.9 0.00 0.00 1e-07
16 4 float sum 110.5 0.00 0.00 0e+00 111.3 0.00 0.00 1e-07
32 8 float sum 110.6 0.00 0.00 1e-07 111.1 0.00 0.00 1e-07
64 16 float sum 112.2 0.00 0.00 1e-07 112.1 0.00 0.00 6e-08
128 32 float sum 113.1 0.00 0.00 6e-08 113.3 0.00 0.00 6e-08
256 64 float sum 112.5 0.00 0.00 6e-08 113.8 0.00 0.00 6e-08