Full source code can be found here
It is well-known that a hard disk has a long delay from that we request the data to that we get the data. Usually we measure the hard disk latency in milliseconds which is an eternity for a CPU. The bandwidth of a hard disk is decent good as SSD:s today can reach 1 GiB/second.
What is less known is that RAM has the same characteristics, bad latency with good bandwidth.
You can measure RAM latency and badndwidth using Intel® Memory Latency Checker. On my machine the RAM latency under semi-high load is ~120 ns (The 3r:1w
bandwidth is 16GiB/second). This means that the CPU on my machine has to wait for ~400 cycles for data, an eternity.
To "fix" the latency problem over the years CPUs has accumulated more and more cache as well as advanced techniques to schedule reads earlier and delaying writes to avoid starving for data. So much in fact that only ~1% of the CPU die is used for computation and the rest is used to fix the problem of slow RAM. In addition, because bandwidth is good and latency is poor CPUs reads more data than what the CPU requested (prefetching).
The CPU on my machine has 3 layers of cache:
L1: 256KiB
L2: 1MiB
L3: 6MiB
A simple way of thinking about the cost of reading data is that if reading from L1 has the cost of 1
, the cost of reading from L2 is 10
, the cost of reading from L3 is 100
and the cost of reading from RAM is 1000
.
When we need to write highly performant code that process a lot of data we want to minimize the foot-print of the data to ensure as much as possible fits into the L1 cache and that we uses the bandwidth as efficiently as possible. Because of the prefetcher we should read data sequentially.
What lot of data means depends on the context. Sometimes it's 16KiB, sometimes it's 16TiB but as the L3 cache on my machine is 6MiB any data set bigger than 6MiB is considered "big" in my examples.
One of the fundamental principles of .NET (and Java) is that memory is managed. .NET prevents developers from unintentionally corrupting data and has a GC that can defragment the heap for us. .NET also hides the location of the memory for us as it's not important for the correctness of a program. The location is important when we need to ensure we access memory sequentially for performance.
One of the simplest ways to access data sequentially is to use arrays. Arrays are a continuous block of memory, if we traverse it from 0..last
we access the array data sequentially.
The actual data in an array is often a reference to an object which may or may not lay sequential in memory. The GC tries to preserve the order of the objects in the way they were created but it's not certain that correlates to how you will access the data.
.NET allows us to create arrays of struct
. A struct
in .NET is a value object which means that when we iterate an array of struct
we are indeed accessing the data sequentially.
In order to measure different data access schemes, I have created a few test cases that uses various patterns to access data.
The basic premise that we have a large set of particles that contain:
- Mass
- Current position
- Previous position
- Hot Data (data that would be accessed often)
- Cold Data (data that would be accessed seldom)
We will perform something called verlet integration over all particles, this is common in game physics. It's a very simple computation
let newPos = currentPos + currentPos - previousPos + globalAcceleration
previousPos <- currentPos
currentPos <- newPos
This computation only access current and previous position and doesn't look at neither the hot data or the cold data.
We obviously like the verlet integration to be as fast as possible.
With the mass, current and previous position, hot data and cold data a particle takes 216 bytes. We will have up to 1,000,000 particles which means the particles will take 216,000,0000 bytes which is way more than my 6MiB of L3 cache. Therefore we expect that the way we access data will have an impact.
For the test cases Class and Class (Shuffle) we create a Particle
class like this:
[<NoComparison>]
[<NoEquality>]
type Particle =
class
val mutable mass : float
val mutable current : Vector
val mutable previous : Vector
val mutable hot : HotData
val mutable cold : ColdData
new (mass, current, previous) = { mass = mass; current = current; previous = previous; hot = HotData (); cold = ColdData () }
member x.Mass = x.mass
member x.Position = x.current
member x.Verlet globalAcceleration =
let next = x.current + x.current - x.previous + globalAcceleration
x.previous <- x.current
x.current <- next
static member New timeStep mass position velocity =
let current = position
let previous = position - (timeStep * (velocity : Vector))
Particle (mass, current, previous)
end
We then create two arrays.
- For test case Class we iterate over the first array where all object references in the order the objects were created. This mean we access the particle data sequentially when we iterate the array.
- For test Class (Shuffle) we iterate over the second array where all object references are randomly shuffled. This mean we access the particle data randomly when we iterate the array.Note: if you store particles in a dictionary and iterate it the order is effectively random as well
For small sets of data we expect no difference in performance when iterating the arrays as the data fits into L1. When the data sets grow we are expecting Class to perform better because we make better use of the prefetcher.
For the test case Struct and Struct (Shuffle) we create a Particle
struct like this:
[<NoComparison>]
[<NoEquality>]
type Particle =
struct
val mutable mass : float
val mutable current : Vector
val mutable previous : Vector
val mutable hot : HotData
val mutable cold : ColdData
new (mass, current, previous) = { mass = mass; current = current; previous = previous; hot = HotData(); cold = ColdData () }
member x.Mass = x.mass
member x.Position = x.current
member x.Verlet globalAcceleration =
let next = x.current + x.current - x.previous + globalAcceleration
x.previous <- x.current
x.current <- next
static member New timeStep mass position velocity =
let current = position
let previous = position - (timeStep * (velocity : Vector))
Particle (mass, current, previous)
end
The only difference is that we use struct
over class
. We create two arrays like before, one in order of creation, the other shuffled. We expect no difference in performance between these two as struct
should guarantee that always access data sequentially.
For test case Hot & Cold in our fictitious program we have identified hot data (data that is accessed often) that is grouped together, cold data (accessed more seldomly) is put into a different group. For this particular algorithm we only need to access the hot data.
This is common approach to make sure we don't waste memory bandwidth on loading cold data which we don't access during verlet integration. We still have some overhead as there are hot data the verlet interation doesn't need but that are needed for the constraint relaxation.
It looks like this:
[<NoComparison>]
[<NoEquality>]
type Particle =
struct
val mutable mass : float
val mutable current : Vector
val mutable previous : Vector
val mutable hot : HotData
new (mass, current, previous) = { mass = mass; current = current; previous = previous; hot = HotData(); }
member x.Mass = x.mass
member x.Position = x.current
member x.Verlet globalAcceleration =
let next = x.current + x.current - x.previous + globalAcceleration
x.previous <- x.current
x.current <- next
static member New timeStep mass position velocity =
let current = position
let previous = position - (timeStep * (velocity : Vector))
Particle (mass, current, previous)
end
So the full particle data set is now two arrays hotDatum, coldDatum
. For most algorithms we just process the hot data.
When analyzing the jitted assembly code I found several inefficiencies like this:
; Loads Vector from array and store in local variable
00007ffb`8aa3436d f30f6f00 movdqu xmm0,xmmword ptr [rax]
00007ffb`8aa34371 f30f7f442418 movdqu xmmword ptr [rsp+18h],xmm0
00007ffb`8aa34377 4c8b4010 mov r8,qword ptr [rax+10h]
00007ffb`8aa3437b 4c89442428 mov qword ptr [rsp+28h],r8
; Loads same Vector from array and store in local variable
00007ffb`8aa34380 f30f6f00 movdqu xmm0,xmmword ptr [rax]
00007ffb`8aa34384 f30f7f0424 movdqu xmmword ptr [rsp],xmm0
00007ffb`8aa34389 4c8b4010 mov r8,qword ptr [rax+10h]
00007ffb`8aa3438d 4c89442410 mov qword ptr [rsp+10h],r8
; Loads first local variable into xmm0, xmm1, xmm2
00007ffb`8aa34392 4c8d442418 lea r8,[rsp+18h]
00007ffb`8aa34397 f2410f1000 movsd xmm0,mmword ptr [r8]
00007ffb`8aa3439c f2410f104808 movsd xmm1,mmword ptr [r8+8]
00007ffb`8aa343a2 f2410f105010 movsd xmm2,mmword ptr [r8+10h]
; Loads second local variable into xmm3, xmm4, xmm5, same content
00007ffb`8aa343a8 4c8d0424 lea r8,[rsp]
00007ffb`8aa343ac f2410f1018 movsd xmm3,mmword ptr [r8]
00007ffb`8aa343b1 f2410f106008 movsd xmm4,mmword ptr [r8+8]
00007ffb`8aa343b7 f2410f106810 movsd xmm5,mmword ptr [r8+10h]
00007ffb`8aa343bd f20f58c3 addsd xmm0,xmm3
00007ffb`8aa343c1 f20f58cc addsd xmm1,xmm4
00007ffb`8aa343c5 f20f58d5 addsd xmm2,xmm5
00007ffb`8aa343c9 4c8d442430 lea r8,[rsp+30h]
; Store intermediate result in third local variable
00007ffb`8aa343ce f2410f1100 movsd mmword ptr [r8],xmm0
00007ffb`8aa343d3 f2410f114808 movsd mmword ptr [r8+8],xmm1
00007ffb`8aa343d9 f2410f115010 movsd mmword ptr [r8+10h],xmm2
Overall there seems to be lots of roundtrips to and from invisible local variables that cost precious cycles as well as making the jitted code harder to understand.
I don't understand exactly why this is so but I suspect the jitter struggles to establish a holistic view that allows it to eliminate the intermediate results from each operation and therefore stores the results as invisible local variables to be sure. If the jitter was more advanced or had more CPU time available to it it should be able to eliminate the intermediate results.
To address this test case Hot & Cold (No Algebra) no longer rely on Vector algebra but instead uses a helper for the verlet integration:
let inline verlet (current : Vector) (previous : Vector) (globalAcceleration : Vector) : Vector =
let x = current.X + current.X - previous.X + globalAcceleration.X
let y = current.Y + current.Y - previous.Y + globalAcceleration.Y
let z = current.Z + current.Y - previous.Z + globalAcceleration.Z
Vector (x, y, z)
This function triplicate the logic but now only rely on float algebra. Floats are well-known to the jitter and it might be able to eliminate local variables better in this case.
Let's have a look:
; Load ea for current into r8
00007ffb`8aa246c9 488d4108 lea rax,[rcx+8]
00007ffb`8aa246cd 4c8bc0 mov r8,rax
; Load x,y,z for current into xmm0, xmm1, xmm2
00007ffb`8aa246d0 f2410f1000 movsd xmm0,mmword ptr [r8]
00007ffb`8aa246d5 f2410f104808 movsd xmm1,mmword ptr [r8+8]
00007ffb`8aa246db f2410f105010 movsd xmm2,mmword ptr [r8+10h]
; Load ea for previous into r8
00007ffb`8aa246e1 4c8d4120 lea r8,[rcx+20h]
; Load x,y,z for current into xmm3, xmm4, xmm5
00007ffb`8aa246e5 f2410f1018 movsd xmm3,mmword ptr [r8]
00007ffb`8aa246ea f2410f106008 movsd xmm4,mmword ptr [r8+8]
00007ffb`8aa246f0 f2410f106810 movsd xmm5,mmword ptr [r8+10h]
; current.X + current.X - previous.X + globalAcceleration.X
00007ffb`8aa246f6 f20f58c0 addsd xmm0,xmm0
00007ffb`8aa246fa f20f5cc3 subsd xmm0,xmm3
00007ffb`8aa246fe f20f5802 addsd xmm0,mmword ptr [rdx]
; current.Y + current.Y - previous.Y + globalAcceleration.Y
00007ffb`8aa24702 0f28d9 movaps xmm3,xmm1
00007ffb`8aa24705 f20f58d9 addsd xmm3,xmm1
00007ffb`8aa24709 f20f5cdc subsd xmm3,xmm4
00007ffb`8aa2470d f20f585a08 addsd xmm3,mmword ptr [rdx+8]
; current.Z + current.Y - previous.Z + globalAcceleration.Z
00007ffb`8aa24712 f20f58ca addsd xmm1,xmm2
00007ffb`8aa24716 f20f5ccd subsd xmm1,xmm5
00007ffb`8aa2471a f20f584a10 addsd xmm1,mmword ptr [rdx+10h]
00007ffb`8aa2471f 488d1424 lea rdx,[rsp]
; Vector (x, y, z)
00007ffb`8aa24723 f20f1102 movsd mmword ptr [rdx],xmm0
00007ffb`8aa24727 f20f115a08 movsd mmword ptr [rdx+8],xmm3
00007ffb`8aa2472c f20f114a10 movsd mmword ptr [rdx+10h],xmm1
It looks much better as we don't see writes of intermediate results. In addition this is the full verlet computation, before we only had the initial steps of the computation in about the same number of lines.
The test case Hot & Cold (No algebra) should perform better thanks to this.
Typically we create a class/struct and from that create an array that we iterate. This is somewhat comparable to Row-Stores
in databases. Each row/array entry holds a full object. Many databases also supports Column-Stores
as this is more efficient if you have lots of properties in an object but only accesses a few of them (like during report generation). We can apply the same approach to objects in memory which is called Structures of Arrays (SOA)
. What we normally do is called Arrays of Structures (AOS)
.
This will give a near optimal usage of cache and bandwidth but we get extra overhead because we will need to dereferenc several arrays for each computation.
For test case Structures of Arrays we implement it as this:
type Selection =
| A
| B
[<NoComparison>]
[<NoEquality>]
type Particles (timeStep, initParticles : InitParticle []) =
class
let count = initParticles.Length
let masses : float [] = initParticles |> Array.map (fun ip -> ip.Mass)
let positionsA : Vector [] = initParticles |> Array.map (fun ip -> ip.Position)
let positionsB : Vector [] = initParticles |> Array.mapi (fun i ip ->
let position = positionsA.[i]
let velocity = ip.Velocity
position - (timeStep * velocity)
)
let hotData : HotData [] = Array.zeroCreate count
let coldData : ColdData[] = Array.zeroCreate count
let mutable selection = Selection.A
let rec loop globalAcceleration (a : Vector []) (b : Vector []) i =
b.[i] <- verlet a.[i] b.[i] globalAcceleration
if i > 0 then loop globalAcceleration a b (i - 1)
member x.Verlet globalAcceleration =
selection <-
match selection with
| Selection.A ->
loop globalAcceleration positionsA positionsB (count - 1)
Selection.B
| Selection.B ->
loop globalAcceleration positionsB positionsA (count - 1)
Selection.A
member x.Positions =
match selection with
| Selection.A -> positionsA
| Selection.B -> positionsB
end
So instead of having a Particle
class we now have a class Particles
that manages all particles. The idea is that mass, currentPosition and nextPosition properties are stored as arrays. A "this
" pointer for a particle object is the index into the arrays.
In addition, the example avoids copying floats by letting positionsA and positionsB switch the roles of currentPosition and nextPosition.
Another small optimization is to loop towards 0 as this saves a register which saves us loading the length from the stack if we run out of registers. The prefetcher fetches both before and after the address we accessed so it gives no negative impact on performance.
Checking the jitted code we see that the end of loop checking is not optimal
00007ffb`8aa04bc3 85ff test edi,edi
00007ffb`8aa04bc5 7e07 jle 00007ffb`8aa04bce
00007ffb`8aa04bc7 ffcf dec edi
00007ffb`8aa04bc9 e977ffffff jmp 00007ffb`8aa04b4a
It would be better if it looked like this instead:
00007ffb`8aa04bc7 ffcf dec edi
00007ffb`8aa04bc5 7e07 jne 00007ffb`8aa04b4a
For tight loops this can make a difference. I suspect this is due to how tail recursion is unpacked in F# and should be possible to improve upon.
In .NET 4.6 we got SIMD enabled data structures like System.Numerics.Vector3
. This should give a performance improvement. An issue with .NET 4.6 SIMD is that System.Numerics.Vector3
only supports single precision floats. This divides the size of the test data by two so it's not a fair comparison. However, I thought it could be interesting to include anyway so I added test case Structures of Arrays (SIMD).
Since the jitter recognizes the SIMD types we can use vector algebra and avoid lot hidden local variables in the jitted code:
; Checking out of bounds for current array
00007ffb`8aa14c8a 8b4308 mov eax,dword ptr [rbx+8]
00007ffb`8aa14c8d 3bf0 cmp esi,eax
00007ffb`8aa14c8f 7367 jae 00007ffb`8aa14cf8
00007ffb`8aa14c91 4863c6 movsxd rax,esi
00007ffb`8aa14c94 486bc003 imul rax,rax,3
; Loads current into xmm0 (xmm0 contains all three float32s as this is SIMD)
00007ffb`8aa14c98 488d548310 lea rdx,[rbx+rax*4+10h]
00007ffb`8aa14c9d f30f104a08 movss xmm1,dword ptr [rdx+8]
00007ffb`8aa14ca2 f20f1002 movsd xmm0,mmword ptr [rdx]
00007ffb`8aa14ca6 0fc6c144 shufps xmm0,xmm1,44h
; Checking out of bounds for previous array
00007ffb`8aa14caa 8b5508 mov edx,dword ptr [rbp+8]
00007ffb`8aa14cad 3bf2 cmp esi,edx
00007ffb`8aa14caf 7347 jae 00007ffb`8aa14cf8
; Loads previous into xmm1 (xmm1 contains all three float32s as this is SIMD)
00007ffb`8aa14cb1 488d448510 lea rax,[rbp+rax*4+10h]
00007ffb`8aa14cb6 f30f105008 movss xmm2,dword ptr [rax+8]
00007ffb`8aa14cbb f20f1008 movsd xmm1,mmword ptr [rax]
00007ffb`8aa14cbf 0fc6ca44 shufps xmm1,xmm2,44h
; current + current
00007ffb`8aa14cc3 0f58c0 addps xmm0,xmm0
; - previous
00007ffb`8aa14cc6 0f5cc1 subps xmm0,xmm1
; Loads globalAcceleration (would be better to have this preloaded into xmm2 but oh well)
00007ffb`8aa14cc9 f30f105708 movss xmm2,dword ptr [rdi+8]
00007ffb`8aa14cce f20f100f movsd xmm1,mmword ptr [rdi]
00007ffb`8aa14cd2 0fc6ca44 shufps xmm1,xmm2,44h
; + globalAcceleration
00007ffb`8aa14cd6 0f58c1 addps xmm0,xmm1
; Stores result
00007ffb`8aa14cd9 f20f1100 movsd mmword ptr [rax],xmm0
00007ffb`8aa14cdd 660f70c802 pshufd xmm1,xmm0,2
00007ffb`8aa14ce2 f30f114808 movss dword ptr [rax+8],xmm1
; End of loop?
00007ffb`8aa14ce7 85f6 test esi,esi
00007ffb`8aa14ce9 7e04 jle 00007ffb`8aa14cef
; loop globalAcceleration a b (i - 1)
00007ffb`8aa14ceb ffce dec esi
00007ffb`8aa14ced eb96 jmp 00007ffb`8aa14c8a
Apart from not preloading globalAcceleration
into a xmm2
and the slight suboptimal end of loop check it looks good and we expect Structures of Arrays (SIMD) to do well.
These are the different test cases we like to measure:
- Class - Array of Particle classes in the order of creation
- Class (Shuffle) - Array of Particle classes in random order
- Struct - Array of Particle structs in the order of creation
- Struct (Shuffle) - Array of Particle structs in random order
- Hot & Cold - Separated Hot & Cold into two arrays
- Hot & Cold (No Algebra) - Separated Hot & Cold into two arrays and avoids vector algebra
- Structures of Arrays - Uses
SOA
overAOS
- Structures of Arrays (SIMD) - Uses
SOA
overAOS
with .NET 4.6 SIMD Note; .NET SIMD only support single float precision so not exactly equivalent test. - Structures of Arrays (SSE2) - Uses
SOA
overAOS
with C++ SSE2 intrinsics - Structures of Arrays (AVX) - Uses
SOA
overAOS
with C++ AVX intrinsics - Structures of Arrays (SIMD2) - Uses
SOA
overAOS
with .NET 4.6 + System.Numerics.Vectors for 64 bit floats
We clearly see that Class (Shuffle) degrades a lot compared to the alternatives. This is because when the data no longer fits into the L3 cache and we access the data randomly we get no benefit of the prefetcher and the CPU has to wait a long time for data. We also see that the Class (Shuffle) starts degrading in several steps. This is because there is prefetching between L1, L2 and L3 as well.
In no other cases shuffling seems to effect performance negatively as expected.
Class, Struct and Struct (Shuffle) degrades somewhat when we run out of L3 cache but because of prefetching it never gets as bad as Class (Shuffle).
Hot & Cold (both variants) don't seem to degrade even though we run out L3 cache. This is because the prefetcher manages to stay ahead of the CPU.
Hot & Cold (No Algebra) does better because we managed to eliminate writes of intermediate results by refactoring the code.
Structures of Arrays (both variants) does the best. As Hot & Cold doesn't degrade due to data starvation it sholdn't be because of more efficient use bandwidth. Instead, it's mainly because the way the code is written we avoid a method call for each particle. When we are calling a method we first marshalls the input arguments and then the method unmarshalls them. As the verlet computation is quite cheap the method call itself is taking some time.
Structures of Arrays (SIMD) is the fastest alternative which we were expecting.
In order to easier estimate the order on how the performance depend on the data size I provide a chart where x-axis and y-axis is logarithmic
Added results for Structures of Arrays (SSE2) and Structures of Arrays (AVX).
The assembler code looks good for (AVX)
; Note: AVX processes 4 doubles per operation
; Load current x 4
00007FF6E4ED14E0 vmovupd ymm0,ymmword ptr [r10+rdx]
; Load previous x 4
00007FF6E4ED14E6 vmovupd ymm1,ymmword ptr [rdx]
; Save rdx for when storing result
00007FF6E4ED14EA mov rax,rdx
; Increment pointer
00007FF6E4ED14ED add rdx,20h
; current + current x 4
00007FF6E4ED14F1 vaddpd ymm0,ymm0,ymm0
; - previous x 4
00007FF6E4ED14F5 vsubpd ymm1,ymm0,ymm1
; + globalAccelerator x 4
00007FF6E4ED14F9 vaddpd ymm2,ymm1,ymm4
; Store result x 4
00007FF6E4ED14FD vmovapd ymmword ptr [rax],ymm2
; Decrement counter
00007FF6E4ED1501 sub r8,1
; If > 0 then loop
00007FF6E4ED1505 jne `anonymous namespace'::particles<`anonymous namespace'::avx>::verlet_axis+80h (07FF6E4ED14E0h)
The results are interesting. The AVX code is as expected doing really well but then surprisingly drops off sharply at 586 particles. My assumption is that we run out of L1 cache at this point but because we are hitting the bandwidth limit of L2 the prefetcher can't keep up. L1 peak rate seems to be around 74 GiB/second, when we hit L2 the rate falls to around 43 GiB/second)
The reason AVX and SSE2 falls behind Structures of Arrays (SIMD) should be that the latter uses float32
over float
which means the bandwidth requirement is about half.
As pointed out by Jack Mott (@jackmott) there is actually 64 bit SIMD support in .NET. Frank Niemeyer (@FrankNiemeyer) pointed out that with basic types in SOA one should be able to make better use of SIMD.
Structures of Arrays (SIMD2) makes use of 64 bit SIMD and basic types. The code processes each axis independently like this:
let rec loop globalAcceleration (a : Vector2 []) (b : Vector2 []) i =
let current = a.[i]
let previous = b.[i]
let next = current + current - previous + globalAcceleration
b.[i] <- next
if i > 0 then loop globalAcceleration a b (i - 1)
The assembler code looks decent:
; let current = a.[i]
; Array out of bounds?
00007ffc`6cfd5190 3bc1 cmp eax,ecx
00007ffc`6cfd5192 733b jae 00007ffc`6cfd51cf
00007ffc`6cfd5194 4c63d8 movsxd r11,eax
00007ffc`6cfd5197 49c1e304 shl r11,4
; Load current (two doubles) into xmm0
00007ffc`6cfd519b 430f10441810 movups xmm0,xmmword ptr [r8+r11+10h]
; let previous = b.[i]
; Array out of bounds?
00007ffc`6cfd51a1 413bc2 cmp eax,r10d
00007ffc`6cfd51a4 7329 jae 00007ffc`6cfd51cf
00007ffc`6cfd51a6 4f8d5c1910 lea r11,[r9+r11+10h]
; Load previous (two doubles) into xmm1
00007ffc`6cfd51ab 410f100b movups xmm1,xmmword ptr [r11]
; current + current
00007ffc`6cfd51af 660f58c0 addpd xmm0,xmm0
; - previous
00007ffc`6cfd51b3 660f5cc1 subpd xmm0,xmm1
; Load global accleration (two doubles) into xmm1
00007ffc`6cfd51b7 0f100a movups xmm1,xmmword ptr [rdx]
; + globalAcceleration
00007ffc`6cfd51ba 660f58c1 addpd xmm0,xmm1
; b.[i] <- next
00007ffc`6cfd51be 410f1103 movups xmmword ptr [r11],xmm0
; End of loop?
00007ffc`6cfd51c2 85c0 test eax,eax
00007ffc`6cfd51c4 7e04 jle 00007ffc`6cfd51ca
; No, let's loop backwards
00007ffc`6cfd51c6 ffc8 dec eax
00007ffc`6cfd51c8 ebc6 jmp 00007ffc`6cfd5190
Apart from unnecessary checking of array out of bounds and a inefficient loop check at the end the rest of the code looks ok.
From the results we see that the performance of Structures of Arrays (SIMD2) is close Structures of Arrays (SIMD) at smaller data sets however since Structures of Arrays (SIMD2) uses 64 bit floats it seems we run out of main memory bandwidth at larger data sets. We reach a plateau at ~300,000 particles. Structures of Arrays (SIMD) should have the same plateau at ~600,000 particles (data size is half as big) but it seems the memory bandwidth can keep up and the performance remains constant for all data sizes.
For Class (Shuffle) one sees three plateaus in the test data that are delimited by a linear growth of performance (as the axes are both logarithmic).
I have marked the three plateaus with A
, B
and C
.
An explanation for the three plateaus could be that the A
plateau is when the data fits into L1 cache. For plateau B
the data still fit into L2 but because L2 is slower the performance degrades somewhat. C
plateau then matches L3 cache.
As nice explanation as this is it doesn't match the input data.
Consider plateau B
. It starts at 210 particles and end at 1485 particles. Since the size of a particle reported by .NET is 216 bytes (also matches manual estimate) that would mean the A
runs out at 44KiB and B
runs out at 313KiB but that doesn't match the numbers for my machine. In addition, plateau C
ends at 1830KiB.
The plateaus are interesting but I don't know how to correctly interpret them at this point.
Finally, it seems that the performance for Class (Shuffle) asymptotically goes towards 1200 ms. As it starts at 200ms it means that the memory latency overhead is 1 second. We are processing 10,000,000 particles that means that the latency per particle is 100 ns which is quite close to the 120 ns reported by Intel® Memory Latency Checker tool.
A former colleague of mine, Henning, pointed out that the Core I5 has 64KiB of L1 cache per core, the total for all 4 cores is 256KiB. In addition, only 32KiB is data cache the other 32KiB is instruction cache. The L2 cache is also just 256KiB per core but 1MiB in total. The L3 cache is 6MiB shared for all 4 cores.
So plateau A
ends at 210 particles which is then 44KiB, this number is close to 32KiB L1 data cache available to the core. Looking more closely at the charts it seems plateau B
ends at at 1123 particles which is 236KiB which is close to the 256KiB L2 cache available to the core.
So the numbers and the input data matches decently for L1 and L2 cache now.
To be honest most developers don't need top-notch CPU performance, most are happy with decent CPU performance. However, I know that there are a few out there that needs to process a lot of data in a short amount of time.
For those guys it's very important to minimize the size of the data (in order to fit as much in cache as possible) as well as making sure to access the data sequentially (to make efficient use of the prefetcher).
A managed language environment like .NET and Java doesn't allow direct control of the memory layout of objects but since the GC tries to maintain the order the objects were created as long as we allocate them in one go (to avoid hole in the allocations) and access them in that order we make good use of the prefetcher.
In addition, .NET has struct
types that allows us to create arrays of struct
which guarantees that as long as we iterate the array in order we access the memory sequentially.
A common pattern to make better use of the memory bandwidth and cache is to split the data into hot and cold data. However, this requires intimate knowledge of the data access patterns or testing to get right.
That's why I think Structures of Arrays (SOA)
is interesting as it will allow optimum usage of the cache and bandwidth but with the drawback the we need to dereference several arrays during the computation. When we run out of registers that means loading the pointers from the stack which itself is costly. Testing again is important to get it right.
The blog post was intended to just be about data sizes and how it effects performance but when I started looking at the jitted code I found some efficiency problems I just had to address. I hope it's ok.
I hope this was interesting to you.
Mårten
.net SIMD works on doubles, assuming your cpu has AVX2
you may be interested in: https://github.com/jackmott/SIMDArray