Skip to content

Instantly share code, notes, and snippets.

@Stonesjtu
Forked from understeer/latency.txt
Created June 15, 2019 22:05
Show Gist options
  • Save Stonesjtu/76e74f8452da3d42bfe88b0f7bcfab5b to your computer and use it in GitHub Desktop.
Save Stonesjtu/76e74f8452da3d42bfe88b0f7bcfab5b to your computer and use it in GitHub Desktop.
HPC-oriented Latency Numbers Every Programmer Should Know
Latency Comparison Numbers
--------------------------
L1 cache reference/hit 1.5 ns 4 cycles
Floating-point add/mult/FMA operation 1.5 ns 4 cycles
L2 cache reference/hit 5 ns 12 ~ 17 cycles
Branch mispredict 6 ns 15 ~ 20 cycles
L3 cache hit (unshared cache line) 16 ns 42 cycles
L3 cache hit (shared line in another core) 25 ns 65 cycles
Mutex lock/unlock 25 ns
L3 cache hit (modified in another core) 29 ns 75 cycles
L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns)
QPI hop to a another CPU (time per hop)   40 ns
64MB main memory reference (local CPU)     46   ns                     TinyMemBench on "Broadwell" E5-2690v4
64MB main memory reference (remote CPU) 70 ns TinyMemBench on "Broadwell" E5-2690v4
256MB main memory reference (local CPU) 75 ns TinyMemBench on "Broadwell" E5-2690v4
256MB main memory reference (remote CPU) 120 ns TinyMemBench on "Broadwell" E5-2690v4
Send 4KB over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH2 over Intel Omni-Path / Mellanox EDR
Compress 1KB with Google Snappy 3,000 ns 3 us
Send 4KB over 10 Gbps ethernet 10,000 ns 10 us
Write 4KB randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us)
Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 gen 3.0 link
Read 4KB randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%)
Read 1MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD
Write 4KB randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
Read 4KB randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%)
Round trip within same datacenter 500,000 ns 500 us One-way ping across Ethernet is ~250us
Read 1MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD
Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec server hard disk (seek time would be additional latency)
Random Disk Access (seek+rotation) 10,000,000 ns 10,000 us 10 ms
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Total CPU pipeline length?
NVIDIA Tesla GPU values
-----------------------
GPU Shared Memory access 30 ns 30~90 cycles (bank conflicts will introduce more latency)
GPU Global Memory access 200 ns 200~800 cycles, depending upon GPU generation and access patterns
Launch CUDA kernel on GPU 10,000 ns 10 us Host CPU instructs GPU to start executing a kernel
Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink
Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link
Floating-point add/mult operation?
Shift operation?
Atomic operation in GPU Global Memory?
Total GPU pipeline length?
Launch CUDA kernel (via dynamic parallelism)?
Intel Xeon CPU values
---------------------
Wake up from C1 state 500 ns varies from <0.5us to 2us
Wake up from C3 state 15,000 ns 15 us varies from 10us to 50us
Wake up from C6 state 30,000 ns 30 us varies from 20us to 60us
Warm up Intel SkyLake AVX units 14,000 ns 14 us AVX units go to sleep after ~675 us
Notes
-----
1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns
Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle.
Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle.
"Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops.
GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design.
Credit
------
Adapted from: https://gist.github.com/jboner/2841832
Original curator: http://research.google.com/people/jeff/
Originally by Peter Norvig: http://norvig.com/21-days.html#answers
Additional Data Gathered/Correlated from:
-----------------------------------------
Memory latency tool: https://github.com/ssvb/tinymembench
CPU data from Agner Fog: http://www.agner.org/optimize/
CPU cache and QPI data: https://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html
Intel performance analysis: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Intel Broadwell CPU data: http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt
Intel SkyLake CPU data: http://www.7-cpu.com/cpu/Skylake.html
MVAPICH2 fabric testing: http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf
NVMe SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf
SATA SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf
GPU optimization: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
CPU/GPU data locality: https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
GPU Memory Hierarchy: https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg
Intel Xeon C-state data: http://ena-hpc.org/2014/pdf/paper_06.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment