In a recent paper, a method is described for calculating the loop-carried dependencies (LCD), if any, of a assembly level loop, as follows:
OSACA can detect LCDs by creating a DAG of a code comprising two back-to-back copies of the loop body. It can thus analyze all possible paths from each vertex of the first kernel section and detect cyclic LCDs if there exists a dependency chain from one instruction form to its corresponding duplicate.
However, I don't think this is sufficient to catch all loop carried dependencies. In particular, you can have a case where a dependency is loop carried, but where no vertex (instruction) in the second iteration depends on the same vertex1 in the first. Rather, some other vertex in the second depends on the first, and in some subsequent iteration, the cycle is completed.
An example:
add eax, 1 ; A (deps: E-previous)
add eax, 1 ; B (deps: A)
add eax, 1 ; C (deps: B)
; swap eax, ebx (exploded version of xchg eax, ebx)
mov ecx, eax ; D (deps: C)
mov eax, ebx ; E (deps: F-previous)
mov ebx, ecx ; F (deps: D)
Is there a loop carried dependecy here? Yes, there is: the chain of add
instructions in all even iterations depends on the result of the chain in the preceeding even iteration, and the same is true for odd iterations. In particular, there is an intra-iteration chain A->B->C->D->F
, but this chain only depends on E
from the previous iteration which doens't appear in the chain. The other chain is simply E
, which depends on F
from the previous iteration. So the dependency is carried, but it takes two iterations for any dependency to flow back to the same vertex.
We see then that there are two equivalent carried dependecy chains of length 3 (if we assume the mov
are eliminated with zero latency), but each takes 2 iterations, so the total LCD length is 1.5. This also shows you can have non-integer LCD lengths when measured per iteration!
I would show the OSACA generated graphs here, but I couldn't get it to run, so you'll have to use your imagination. If anyone wants to contribute ASCII art, I'll include it :).
1 By same I mean "its corresponding dulicate" as in the quoted portion of the paper.
Here, I would have to disagree, unless say HPC code is very different than most other code. Most memory->memory dependencies in a loop would be dynamic, due to occasionally or often overlapp writes and reads, not something you could determine statically. That's because if you could determine it statically, so could the compiler (in fact the compiler has even more semantics at the HLL level so it has an easier job compared to looking at the assembly) and it would try as hard as it could these values in a register. If it does need to spill due to register pressure, it would choose read-only values first, since memory write-read dependencies are bad due to their long latency.
So only in the unusual case that the compiler has to spill so many values that it includes locations read and written in the same loop would you usually find this case.
Now, you do find this case all the time in regular code - but not inside high performance loops: you find it e.g., across function call boundaries where arguments or return values are written on one side and read on the other, but I think it's outside the scope of OSACA.
All that to say I wouldn't put too much faith in static analysis for memory deps: annotations I think are much more promising.
I admit I could be totally wrong about this, e.g., not consider some scenario that leads to lots of RW dependencies without being able to enregister them.