In a recent paper, a method is described for calculating the loop-carried dependencies (LCD), if any, of a assembly level loop, as follows:
OSACA can detect LCDs by creating a DAG of a code comprising two back-to-back copies of the loop body. It can thus analyze all possible paths from each vertex of the first kernel section and detect cyclic LCDs if there exists a dependency chain from one instruction form to its corresponding duplicate.
However, I don't think this is sufficient to catch all loop carried dependencies. In particular, you can have a case where a dependency is loop carried, but where no vertex (instruction) in the second iteration depends on the same vertex1 in the first. Rather, some other vertex in the second depends on the first, and in some subsequent iteration, the cycle is completed.
An example:
add eax, 1 ; A (deps: E-previous)
add eax, 1 ; B (deps: A)
add eax, 1 ; C (deps: B)
; swap eax, ebx (exploded version of xchg eax, ebx)
mov ecx, eax ; D (deps: C)
mov eax, ebx ; E (deps: F-previous)
mov ebx, ecx ; F (deps: D)
Is there a loop carried dependecy here? Yes, there is: the chain of add
instructions in all even iterations depends on the result of the chain in the preceeding even iteration, and the same is true for odd iterations. In particular, there is an intra-iteration chain A->B->C->D->F
, but this chain only depends on E
from the previous iteration which doens't appear in the chain. The other chain is simply E
, which depends on F
from the previous iteration. So the dependency is carried, but it takes two iterations for any dependency to flow back to the same vertex.
We see then that there are two equivalent carried dependecy chains of length 3 (if we assume the mov
are eliminated with zero latency), but each takes 2 iterations, so the total LCD length is 1.5. This also shows you can have non-integer LCD lengths when measured per iteration!
I would show the OSACA generated graphs here, but I couldn't get it to run, so you'll have to use your imagination. If anyone wants to contribute ASCII art, I'll include it :).
1 By same I mean "its corresponding dulicate" as in the quoted portion of the paper.
I created this other issue for the memory dependency thing. On Twitter, Georg Hager mentioned that the type of "skips an iteration" thing mentioned here usually go though memory dependencies, but I'm not sure about that. I could be - but in any case I guess at a fundamental level the issues are more or less orthogonal so it makes sense to have two separate issues?
BTW, I while I think you can solve the first issue fairly easily, I am not sure how you could solve the memory dependence issue complete, since this is static analysis without knowledge of memory addresses - certainly you can prove that certain loads must alias certain earlier stores, but of course this is not sufficient in general since other loads may alias depending on unknown register values. I left a more detailed comment in the issue.