Last active
November 19, 2017 06:45
-
-
Save rygorous/22180ced9c7a00bd68dd to your computer and use it in GitHub Desktop.
FMA3 codegen
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
So the general idea with FMA3 is that it's designed to have enough degrees of | |
freedom so you can just have a more general FMA4 op in the IR and pick the right | |
FMA3 op *very* late (ideally, after register allocation!). | |
The generalized FMA op looks like this: | |
dst = FMA ±src0, ±src1, ±src2 | |
where at most one of the src's can be a memory operand. This computes | |
(±src0 * ±src1) ± src2. The two signs for src0 and src1 combine | |
into a sign for the multiply result and a sign for the add, so in terms of | |
instruction encoding we need two bits for signs, for the four possible combinations: | |
+(src0 * src1) + src2 [VFMADD] | |
+(src0 * src1) - src2 [VFMSUB] | |
-(src0 * src1) + src2 [VFNMADD] | |
-(src0 * src1) - src2 [VFNMSUB] | |
and as you can see all these options exist in the ISA. | |
- If dst, src0, src1, src2 are all pair-wise distinct, then we can't generate this | |
instruction directly. To legalize, we need to insert a MOV to seed one of the | |
operands (doesn't matter which) into the destination register, after which we're | |
in the restricted "at most 3 distinct registers" case. | |
- If dst appears as one of src0/src1/src2, we use the permutation of VFM* that has | |
the dst register as its first operand. There are 6 permutations of 3 elements; | |
we can swap the two arguments to multiply without changing the result, so we only | |
need to encode 3 such permutations, and FMA3 has a sufficient set. | |
This works even with memory operands (despite the additional constraint that the | |
memory operand has to be the 3rd source op of the encoded instruction). I don't | |
remember the details anymore though, but it all works out. | |
So in practice you just treat it as a FMA4 architecture and only really pick the | |
instruction you use very late. In some cases you end up having to insert an extra | |
MOV, but unless all 3 source operands stay live after the operation, you can always | |
avoid this - provided your register allocator knows that you'd like it to reuse | |
dead registers ASAP (i.e. ideally within the same instruction) if they appear as | |
source operands. | |
---- | |
Explicit enumeration of cases: | |
1. |{dst, src0, src1, src2}| = 4, i.e. 4 distinct operands - need a temp MOV. | |
This is true whether there's a memory operand involved or not. | |
2. VFop dst, dst, src1, src2 (dst=src0 and no mem ops) | |
VFop dst, src0, dst, src2 (dst=src1 and no mem ops) | |
VFop dst, dst, [src1], src2 (dst=src0 with mem op on mul) | |
VFop dst, [src0], dst, src2 (dst=src1 with mem op on mul) | |
The two pairs are equivalent since src0/src1 commute, so you can swap src0 and src1 | |
operands on the source op if necessary. So without loss of generality, dst=src0. | |
-> Emit: VFop132 dst, src2, src1 | |
(note src1 is in the mem-op position already) | |
3. VFop dst, dst, src1, [src2] (dst=src0 with mem op on add) | |
VFop dst, src0, dst, [src2] (dst=src1 with mem op on add) | |
Symmetric again, so w.l.o.g. dst=src0. | |
-> Emit: VFop213 dst, src1, [src2] | |
4. VFop dst, src0, src1, dst (dst=src2 and no mem ops) | |
VFop dst, [src0], src1, dst (dst=src2 with mem op on mul) | |
VFop dst, src0, [src1], dst (dst=src2 with mem op on mul) | |
Again symmetric, so w.l.o.g. src0 is the memory operand. | |
-> Emit: VFop231 dst, src1, src0 | |
That covers all cases (unless I missed something, but I don't think so). | |
So, the actual algorithm is this: | |
// input: VFop dst, src0, src1, src2; <=1 of the three sources a mem operand. | |
if (all_4_distinct) { | |
emit(MOV(dst, src0)); | |
src0 = dst; // now we're in 3-distinct case. | |
} | |
if (dst == src1) | |
swap(src0, src1); | |
// now either dst=src0 or dst=src2. | |
if (dst == src0) { | |
if (is_mem_op(src2)) | |
emit(VFop213(dst, src1, src2)); | |
else | |
emit(VFop132(dst, src2, src1)); | |
} else { | |
if (is_mem_op(src0)) | |
emit(VFop231(dst, src1, src0)); | |
else | |
emit(VFop231(dst, src0, src1)); | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment