Skip to content

Instantly share code, notes, and snippets.

@rygorous
Last active November 19, 2017 06:45
Show Gist options
  • Save rygorous/22180ced9c7a00bd68dd to your computer and use it in GitHub Desktop.
Save rygorous/22180ced9c7a00bd68dd to your computer and use it in GitHub Desktop.
FMA3 codegen
So the general idea with FMA3 is that it's designed to have enough degrees of
freedom so you can just have a more general FMA4 op in the IR and pick the right
FMA3 op *very* late (ideally, after register allocation!).
The generalized FMA op looks like this:
dst = FMA ±src0, ±src1, ±src2
where at most one of the src's can be a memory operand. This computes
(±src0 * ±src1) ± src2. The two signs for src0 and src1 combine
into a sign for the multiply result and a sign for the add, so in terms of
instruction encoding we need two bits for signs, for the four possible combinations:
+(src0 * src1) + src2 [VFMADD]
+(src0 * src1) - src2 [VFMSUB]
-(src0 * src1) + src2 [VFNMADD]
-(src0 * src1) - src2 [VFNMSUB]
and as you can see all these options exist in the ISA.
- If dst, src0, src1, src2 are all pair-wise distinct, then we can't generate this
instruction directly. To legalize, we need to insert a MOV to seed one of the
operands (doesn't matter which) into the destination register, after which we're
in the restricted "at most 3 distinct registers" case.
- If dst appears as one of src0/src1/src2, we use the permutation of VFM* that has
the dst register as its first operand. There are 6 permutations of 3 elements;
we can swap the two arguments to multiply without changing the result, so we only
need to encode 3 such permutations, and FMA3 has a sufficient set.
This works even with memory operands (despite the additional constraint that the
memory operand has to be the 3rd source op of the encoded instruction). I don't
remember the details anymore though, but it all works out.
So in practice you just treat it as a FMA4 architecture and only really pick the
instruction you use very late. In some cases you end up having to insert an extra
MOV, but unless all 3 source operands stay live after the operation, you can always
avoid this - provided your register allocator knows that you'd like it to reuse
dead registers ASAP (i.e. ideally within the same instruction) if they appear as
source operands.
----
Explicit enumeration of cases:
1. |{dst, src0, src1, src2}| = 4, i.e. 4 distinct operands - need a temp MOV.
This is true whether there's a memory operand involved or not.
2. VFop dst, dst, src1, src2 (dst=src0 and no mem ops)
VFop dst, src0, dst, src2 (dst=src1 and no mem ops)
VFop dst, dst, [src1], src2 (dst=src0 with mem op on mul)
VFop dst, [src0], dst, src2 (dst=src1 with mem op on mul)
The two pairs are equivalent since src0/src1 commute, so you can swap src0 and src1
operands on the source op if necessary. So without loss of generality, dst=src0.
-> Emit: VFop132 dst, src2, src1
(note src1 is in the mem-op position already)
3. VFop dst, dst, src1, [src2] (dst=src0 with mem op on add)
VFop dst, src0, dst, [src2] (dst=src1 with mem op on add)
Symmetric again, so w.l.o.g. dst=src0.
-> Emit: VFop213 dst, src1, [src2]
4. VFop dst, src0, src1, dst (dst=src2 and no mem ops)
VFop dst, [src0], src1, dst (dst=src2 with mem op on mul)
VFop dst, src0, [src1], dst (dst=src2 with mem op on mul)
Again symmetric, so w.l.o.g. src0 is the memory operand.
-> Emit: VFop231 dst, src1, src0
That covers all cases (unless I missed something, but I don't think so).
So, the actual algorithm is this:
// input: VFop dst, src0, src1, src2; <=1 of the three sources a mem operand.
if (all_4_distinct) {
emit(MOV(dst, src0));
src0 = dst; // now we're in 3-distinct case.
}
if (dst == src1)
swap(src0, src1);
// now either dst=src0 or dst=src2.
if (dst == src0) {
if (is_mem_op(src2))
emit(VFop213(dst, src1, src2));
else
emit(VFop132(dst, src2, src1));
} else {
if (is_mem_op(src0))
emit(VFop231(dst, src1, src0));
else
emit(VFop231(dst, src0, src1));
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment