instead of this...
vmul.f32 q10, q15, q9
vmul.f32 q11, q13, q9
vfma.f32 q10, q3, q8
vfma.f32 q11, q14, q8
it's doing this...
vmul.f32 q12, q15, q9
vfma.f32 q12, q3, q8
vorr q10, q12, q12
vmul.f32 q12, q13, q9
vfma.f32 q12, q14, q8
vorr q11, q12, q12
this appears to be the result of vectorizable_store
https://github.com/gcc-mirror/gcc/blob/releases/gcc-9.3.0/gcc/tree-vect-stmts.c#L6328-L6330
on targets that load and store lanes of data
https://github.com/gcc-mirror/gcc/blob/releases/gcc-9.3.0/gcc/tree-vect-stmts.c#L7212
immediately marking a register as clobbered before assignment to it
https://github.com/gcc-mirror/gcc/blob/releases/gcc-9.3.0/gcc/tree-vect-stmts.c#L7222
which results in the compiler refusing to coalesce the registers, thus the computation being performed in a separate register for no good reason.