-
-
Save Stonesjtu/76e74f8452da3d42bfe88b0f7bcfab5b to your computer and use it in GitHub Desktop.
HPC-oriented Latency Numbers Every Programmer Should Know
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Latency Comparison Numbers | |
-------------------------- | |
L1 cache reference/hit 1.5 ns 4 cycles | |
Floating-point add/mult/FMA operation 1.5 ns 4 cycles | |
L2 cache reference/hit 5 ns 12 ~ 17 cycles | |
Branch mispredict 6 ns 15 ~ 20 cycles | |
L3 cache hit (unshared cache line) 16 ns 42 cycles | |
L3 cache hit (shared line in another core) 25 ns 65 cycles | |
Mutex lock/unlock 25 ns | |
L3 cache hit (modified in another core) 29 ns 75 cycles | |
L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns) | |
QPI hop to a another CPU (time per hop) 40 ns | |
64MB main memory reference (local CPU) 46 ns TinyMemBench on "Broadwell" E5-2690v4 | |
64MB main memory reference (remote CPU) 70 ns TinyMemBench on "Broadwell" E5-2690v4 | |
256MB main memory reference (local CPU) 75 ns TinyMemBench on "Broadwell" E5-2690v4 | |
256MB main memory reference (remote CPU) 120 ns TinyMemBench on "Broadwell" E5-2690v4 | |
Send 4KB over 100 Gbps HPC fabric 1,040 ns 1 us MVAPICH2 over Intel Omni-Path / Mellanox EDR | |
Compress 1KB with Google Snappy 3,000 ns 3 us | |
Send 4KB over 10 Gbps ethernet 10,000 ns 10 us | |
Write 4KB randomly to NVMe SSD 30,000 ns 30 us DC P3608 NVMe SSD (best case; QOS 99% is 500us) | |
Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink | |
Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 gen 3.0 link | |
Read 4KB randomly from NVMe SSD 120,000 ns 120 us DC P3608 NVMe SSD (QOS 99%) | |
Read 1MB sequentially from NVMe SSD 208,000 ns 208 us ~4.8GB/sec DC P3608 NVMe SSD | |
Write 4KB randomly to SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%) | |
Read 4KB randomly from SATA SSD 500,000 ns 500 us DC S3510 SATA SSD (QOS 99.9%) | |
Round trip within same datacenter 500,000 ns 500 us One-way ping across Ethernet is ~250us | |
Read 1MB sequentially from SATA SSD 1,818,000 ns 1,818 us 2 ms ~550MB/sec DC S3510 SATA SSD | |
Read 1MB sequentially from disk 5,000,000 ns 5,000 us 5 ms ~200MB/sec server hard disk (seek time would be additional latency) | |
Random Disk Access (seek+rotation) 10,000,000 ns 10,000 us 10 ms | |
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms | |
Total CPU pipeline length? | |
NVIDIA Tesla GPU values | |
----------------------- | |
GPU Shared Memory access 30 ns 30~90 cycles (bank conflicts will introduce more latency) | |
GPU Global Memory access 200 ns 200~800 cycles, depending upon GPU generation and access patterns | |
Launch CUDA kernel on GPU 10,000 ns 10 us Host CPU instructs GPU to start executing a kernel | |
Transfer 1MB to/from NVLink GPU 30,000 ns 30 us ~33GB/sec on NVIDIA 40GB NVLink | |
Transfer 1MB to/from PCI-E GPU 80,000 ns 80 us ~12GB/sec on PCI-Express x16 link | |
Floating-point add/mult operation? | |
Shift operation? | |
Atomic operation in GPU Global Memory? | |
Total GPU pipeline length? | |
Launch CUDA kernel (via dynamic parallelism)? | |
Intel Xeon CPU values | |
--------------------- | |
Wake up from C1 state 500 ns varies from <0.5us to 2us | |
Wake up from C3 state 15,000 ns 15 us varies from 10us to 50us | |
Wake up from C6 state 30,000 ns 30 us varies from 20us to 60us | |
Warm up Intel SkyLake AVX units 14,000 ns 14 us AVX units go to sleep after ~675 us | |
Notes | |
----- | |
1 ns = 10^-9 seconds | |
1 us = 10^-6 seconds = 1,000 ns | |
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns | |
Assumes a CPU clock frequency of 2.6GHz (common for Xeon server CPUs). That's ~0.385ns per clock cycle. | |
Assumes a GPU clock frequency of 1GHz (NVIDIA Tesla GPUs range from 0.8~1.4GHz). That's 1ns per clock cycle. | |
"Local" and "Remote" cache/memory values are from dual-socket Intel Xeon. Larger SMP systems have more hops. | |
GPU NVLink connections are not always 40GB. They range from 20GB to 80GB, depending upon the server platform design. | |
Credit | |
------ | |
Adapted from: https://gist.github.com/jboner/2841832 | |
Original curator: http://research.google.com/people/jeff/ | |
Originally by Peter Norvig: http://norvig.com/21-days.html#answers | |
Additional Data Gathered/Correlated from: | |
----------------------------------------- | |
Memory latency tool: https://github.com/ssvb/tinymembench | |
CPU data from Agner Fog: http://www.agner.org/optimize/ | |
CPU cache and QPI data: https://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html | |
Intel performance analysis: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf | |
Intel Broadwell CPU data: http://users.atw.hu/instlatx64/GenuineIntel00306D4_Broadwell2_NewMemLat.txt | |
Intel SkyLake CPU data: http://www.7-cpu.com/cpu/Skylake.html | |
MVAPICH2 fabric testing: http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2016/DK_Status_and_Roadmap_MUG16.pdf | |
NVMe SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3608-spec.pdf | |
SATA SSD: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3510-spec.pdf | |
GPU optimization: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf | |
CPU/GPU data locality: https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf | |
GPU Memory Hierarchy: https://arxiv.org/pdf/1509.02308&ved...qHEz78QnmcIVCSXvg&sig2=IdzxfrzQgNv8yq7e1mkeVg | |
Intel Xeon C-state data: http://ena-hpc.org/2014/pdf/paper_06.pdf |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment