rygorous · November 19, 2017 06:45
diff --git a/fma3_codegen.txt b/fma3_codegen.txt
 So the general idea with FMA3 is that it's designed to have enough degrees of
 freedom so you can just have a more general FMA4 op in the IR and pick the right
 FMA3 op *very* late (ideally, after register allocation!).

 The generalized FMA op looks like this:

  dst = FMA ±src0, ±src1, ±src2

 where at most one of the src's can be a memory operand. This computes
 (±src0 * ±src1) ± src2. The two signs for src0 and src1 combine
 into a sign for the multiply result and a sign for the add, so in terms of
 instruction encoding we need two bits for signs, for the four possible combinations:

  +(src0 * src1) + src2   [VFMADD]
  +(src0 * src1) - src2   [VFMSUB]
  -(src0 * src1) + src2   [VFNMADD]
  -(src0 * src1) - src2   [VFNMSUB]
  
 and as you can see all these options exist in the ISA.

 - If dst, src0, src1, src2 are all pair-wise distinct, then we can't generate this
  instruction directly. To legalize, we need to insert a MOV to seed one of the
  operands (doesn't matter which) into the destination register, after which we're
  in the restricted "at most 3 distinct registers" case.
 - If dst appears as one of src0/src1/src2, we use the permutation of VFM* that has
  the dst register as its first operand. There are 6 permutations of 3 elements;
  we can swap the two arguments to multiply without changing the result, so we only
  need to encode 3 such permutations, and FMA3 has a sufficient set.
  
 This works even with memory operands (despite the additional constraint that the
 memory operand has to be the 3rd source op of the encoded instruction). I don't
 remember the details anymore though, but it all works out.

 So in practice you just treat it as a FMA4 architecture and only really pick the
 instruction you use very late. In some cases you end up having to insert an extra
 MOV, but unless all 3 source operands stay live after the operation, you can always
 avoid this - provided your register allocator knows that you'd like it to reuse
 dead registers ASAP (i.e. ideally within the same instruction) if they appear as
 source operands.

 ----

 Explicit enumeration of cases:

 1. |{dst, src0, src1, src2}| = 4, i.e. 4 distinct operands - need a temp MOV.
   This is true whether there's a memory operand involved or not.
   
 2. VFop dst, dst, src1, src2    (dst=src0 and no mem ops)
   VFop dst, src0, dst, src2    (dst=src1 and no mem ops)
   VFop dst, dst, [src1], src2  (dst=src0 with mem op on mul)
   VFop dst, [src0], dst, src2  (dst=src1 with mem op on mul)
   
   The two pairs are equivalent since src0/src1 commute, so you can swap src0 and src1
   operands on the source op if necessary. So without loss of generality, dst=src0.
 
   -> Emit: VFop132 dst, src2, src1
      (note src1 is in the mem-op position already)

 3. VFop dst, dst, src1, [src2] (dst=src0 with mem op on add)
   VFop dst, src0, dst, [src2] (dst=src1 with mem op on add)
   
   Symmetric again, so w.l.o.g. dst=src0.
   
   -> Emit: VFop213 dst, src1, [src2]
   
 4. VFop dst, src0, src1, dst  (dst=src2 and no mem ops)
   VFop dst, [src0], src1, dst (dst=src2 with mem op on mul)
   VFop dst, src0, [src1], dst (dst=src2 with mem op on mul)

   Again symmetric, so w.l.o.g. src0 is the memory operand.

   -> Emit: VFop231 dst, src1, src0

 That covers all cases (unless I missed something, but I don't think so).

 So, the actual algorithm is this:

    // input: VFop dst, src0, src1, src2; <=1 of the three sources a mem operand.

    if (all_4_distinct) {
        emit(MOV(dst, src0));
        src0 = dst; // now we're in 3-distinct case.
    }
    
    if (dst == src1)
        swap(src0, src1);
    // now either dst=src0 or dst=src2.

    if (dst == src0) {
        if (is_mem_op(src2))
            emit(VFop213(dst, src1, src2));
        else
            emit(VFop132(dst, src2, src1));
    } else {
        if (is_mem_op(src0))
            emit(VFop231(dst, src1, src0));
        else
            emit(VFop231(dst, src0, src1));
    }
	So the general idea with FMA3 is that it's designed to have enough degrees of
	freedom so you can just have a more general FMA4 op in the IR and pick the right
	FMA3 op very late (ideally, after register allocation!).

	The generalized FMA op looks like this:

	dst = FMA ±src0, ±src1, ±src2

	where at most one of the src's can be a memory operand. This computes
	(±src0 * ±src1) ± src2. The two signs for src0 and src1 combine
	into a sign for the multiply result and a sign for the add, so in terms of
	instruction encoding we need two bits for signs, for the four possible combinations:

	+(src0 * src1) + src2 [VFMADD]
	+(src0 * src1) - src2 [VFMSUB]
	-(src0 * src1) + src2 [VFNMADD]
	-(src0 * src1) - src2 [VFNMSUB]

	and as you can see all these options exist in the ISA.

	- If dst, src0, src1, src2 are all pair-wise distinct, then we can't generate this
	instruction directly. To legalize, we need to insert a MOV to seed one of the
	operands (doesn't matter which) into the destination register, after which we're
	in the restricted "at most 3 distinct registers" case.
	- If dst appears as one of src0/src1/src2, we use the permutation of VFM* that has
	the dst register as its first operand. There are 6 permutations of 3 elements;
	we can swap the two arguments to multiply without changing the result, so we only
	need to encode 3 such permutations, and FMA3 has a sufficient set.

	This works even with memory operands (despite the additional constraint that the
	memory operand has to be the 3rd source op of the encoded instruction). I don't
	remember the details anymore though, but it all works out.

	So in practice you just treat it as a FMA4 architecture and only really pick the
	instruction you use very late. In some cases you end up having to insert an extra
	MOV, but unless all 3 source operands stay live after the operation, you can always
	avoid this - provided your register allocator knows that you'd like it to reuse
	dead registers ASAP (i.e. ideally within the same instruction) if they appear as
	source operands.

	----

	Explicit enumeration of cases:

	1. \|{dst, src0, src1, src2}\| = 4, i.e. 4 distinct operands - need a temp MOV.
	This is true whether there's a memory operand involved or not.

	2. VFop dst, dst, src1, src2 (dst=src0 and no mem ops)
	VFop dst, src0, dst, src2 (dst=src1 and no mem ops)
	VFop dst, dst, [src1], src2 (dst=src0 with mem op on mul)
	VFop dst, [src0], dst, src2 (dst=src1 with mem op on mul)

	The two pairs are equivalent since src0/src1 commute, so you can swap src0 and src1
	operands on the source op if necessary. So without loss of generality, dst=src0.

	-> Emit: VFop132 dst, src2, src1
	(note src1 is in the mem-op position already)

	3. VFop dst, dst, src1, [src2] (dst=src0 with mem op on add)
	VFop dst, src0, dst, [src2] (dst=src1 with mem op on add)

	Symmetric again, so w.l.o.g. dst=src0.

	-> Emit: VFop213 dst, src1, [src2]

	4. VFop dst, src0, src1, dst (dst=src2 and no mem ops)
	VFop dst, [src0], src1, dst (dst=src2 with mem op on mul)
	VFop dst, src0, [src1], dst (dst=src2 with mem op on mul)

	Again symmetric, so w.l.o.g. src0 is the memory operand.

	-> Emit: VFop231 dst, src1, src0

	That covers all cases (unless I missed something, but I don't think so).

	So, the actual algorithm is this:

	// input: VFop dst, src0, src1, src2; <=1 of the three sources a mem operand.

	if (all_4_distinct) {
	emit(MOV(dst, src0));
	src0 = dst; // now we're in 3-distinct case.
	}

	if (dst == src1)
	swap(src0, src1);
	// now either dst=src0 or dst=src2.

	if (dst == src0) {
	if (is_mem_op(src2))
	emit(VFop213(dst, src1, src2));
	else
	emit(VFop132(dst, src2, src1));
	} else {
	if (is_mem_op(src0))
	emit(VFop231(dst, src1, src0));
	else
	emit(VFop231(dst, src0, src1));
	}