In a recent paper, a method is described for calculating the loop-carried dependencies (LCD), if any, of a assembly level loop, as follows:
OSACA can detect LCDs by creating a DAG of a code comprising two back-to-back copies of the loop body. It can thus analyze all possible paths from each vertex of the first kernel section and detect cyclic LCDs if there exists a dependency chain from one instruction form to its corresponding duplicate.
However, I don't think this is sufficient to catch all loop carried dependencies. In particular, you can have a case where a dependency is loop carried, but where no vertex (instruction) in the second iteration depends on the same vertex1 in the first. Rather, some other vertex in the second depends on the first, and in some subsequent iteration, the cycle is completed.
An example:
add eax, 1 ; A (deps: E-previous)
add eax, 1 ; B (deps: A)
add eax, 1 ; C (deps: B)
; swap eax, ebx (exploded version of xchg eax, ebx)
mov ecx, eax ; D (deps: C)
mov eax, ebx ; E (deps: F-previous)
mov ebx, ecx ; F (deps: D)
Is there a loop carried dependecy here? Yes, there is: the chain of add
instructions in all even iterations depends on the result of the chain in the preceeding even iteration, and the same is true for odd iterations. In particular, there is an intra-iteration chain A->B->C->D->F
, but this chain only depends on E
from the previous iteration which doens't appear in the chain. The other chain is simply E
, which depends on F
from the previous iteration. So the dependency is carried, but it takes two iterations for any dependency to flow back to the same vertex.
We see then that there are two equivalent carried dependecy chains of length 3 (if we assume the mov
are eliminated with zero latency), but each takes 2 iterations, so the total LCD length is 1.5. This also shows you can have non-integer LCD lengths when measured per iteration!
I would show the OSACA generated graphs here, but I couldn't get it to run, so you'll have to use your imagination. If anyone wants to contribute ASCII art, I'll include it :).
1 By same I mean "its corresponding dulicate" as in the quoted portion of the paper.
Hi Travis,
I'm one of the developers of OSACA and first of all want to thank you for being interested in our tool :)!
We are always happy about discussing strengths and weaknesses and are open for any suggestions of enhancement.
Second of all, you are right with your hypothesis, OSACA can not identify the dependency chain created by you.
I quickly reproduced your example
and this is how the output including the dependency graph would look like:
As you can see, no loop-carried dependency is detected, which is because we only unroll the loop twice.
Your and similar examples therefore could be covered by unrolling the loop N times, which still would not be totally sound for all possible loop-carried dependencies since
N
is finite.This also applies to dependencies inside the memory, e.g.:
Here in each iteration the value in
rax
is stored in the memory (movq %rax, 8(%rbx)
)and will be loaded again in the upcoming iteration (movq (%rbx), %rax
).This is not typical for the types of code we analyze, but at least simple cases should be recognized in future releases.
Nevertheless, we are aware of these weaknesses and they are on our list of improvement in the future!