Tanner Gooding vs SingleAccretion

[6:31] SingleAccretion: (Completely unrelated)

Let's test my API design ability. What is the difference between SetHWIntrinsic and ChangeHWIntrinsic?

[6:33] Tanner Gooding: I would assume Change might do other transformations required if they aren't directly compatible or something

I don't particularly like the name 😄

[6:37] SingleAccretion: (Alternatives are very welcome of course) I would assume Change might do other transformations required if they aren't directly compatible or something So it looks like it was the right call for me not to be in the API proposals game, since the situation is exactly the opposite 😄.

Set: sets the intrinsic id and takes the new operands as arguments. Change: asserts the new intrinsic has the same number of operands as the node itself, then just changes the intrinsic id.

Things I considered as well: ResetHWIntrinsic/ChangeHWIntrinsic, ResetHWIntrinsic/SetWIntrinsicId, SetHWIntrinsicId/SetHWIntrinsicIdKeepOperands...

[6:38] Tanner Gooding: normally I'd just have GetHWIntrinsicID and SetHWIntrinsicID typically get/set are just as they are in .NET, they just read or write a given field, possibly with minimal validation that the input isn't completely invalid

[6:39] Tanner Gooding: (assuming you are looking at removing the public field access and moving it to helper methods?)

[6:42] Tanner Gooding: we also have some prior art for Change*, such as ChangeOper and ChangeOperUnchecked in both cases, they do some extra checks/fixups as appropriate but I don't like the name still: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/compiler.hpp#L1389-L1437

[6:43] Tanner Gooding: change isn't super intuitive, IMO and violates a number of other design practices

I think its generally better to create a new node of the right type and then replace but its a JIT so I'm guessing that's why they support changing in place

[6:43] SingleAccretion: (assuming you are looking at removing the public field access and moving it to helper methods?) I have that already done indeed. But the reason I need to do that at all is because I need to run some more complex logic when changing the id now because the underlying operand array may need to be expanded or switched (from a dynamic one to an inline representation), so just setting the id directly is very dangerous. So I am looking at ways to make the API more palatable. Currently I have this: https://paste.mod.gg/qjzpjscgysyu/0But I do not like it a lot because of lack of validation.

[6:44] SingleAccretion: but its a JIT so I'm guessing that's why they support changing in place Indeed, it is all about TP...

[6:44] Tanner Gooding: I'm still not convinced the inline array approach is the "right" way to get rid of gtlist >.>

[6:45] Tanner Gooding: I think the worst part about gtlist is that it is inefficient due to 1 operand per node when nodes are plenty big enough to store a count + several operands

[6:46] Tanner Gooding: basically a LinkedList, rather than LinkedList

[6:46] Tanner Gooding: needing to allocate extra space doesn't sound ideal for TP, when we have a super efficient node allocator already

[6:48] SingleAccretion: That to me sounds like exactly what the inline array solution provides...? (For reference, here's the current design, it is rather simple: https://github.com/SingleAccretion/runtime/blob/d6c083b94dd0a41a898964fb590ed39e49431c8f/src/coreclr/jit/gentree.h#L5013)

[6:49] Tanner Gooding: but why get rid of GenTreeList at all, rather than just change it to store more than 1 operand then?

[6:50] SingleAccretion: Ah, the reason is that the base GenTree takes up space as well (48 bytes to be precise). It also has other problems w.r.t tree traversal (basically we have two family of functions that traverse trees: ones that do expand lists and ones that do not).

[6:51] Tanner Gooding: couldn't you just move that functionality down to the base GenTree (the iterator support)?

[6:51] Tanner Gooding: and have it generally handle 1, 2, or n operands?

[6:52] SingleAccretion: It already exists: GenTreeUseEdgeIterator, what is invoked when you do op->Operands(). I should really rename the functions to something else so that they do not shadow the base iterator.

[6:53] Tanner Gooding: I thought UseEdge might include more than strictly the operands?

[6:54] SingleAccretion: I mean, what else would it be...? It operates in execution order I suppose but other than that it just gives you uses one by one.

[6:54] Tanner Gooding: right, but you might track other children or things than just the function inputs

[6:55] SingleAccretion: Ah, you mean like GenTreeCall::Use and friends? Well, you don't need anything else if all you're interested in are the child nodes themselves...

[6:56] SingleAccretion: (And MultiOp does not need its own Use class because it doesn't use a linked list)

[6:56] Tanner Gooding: How can you not be a linked list, given we have 2 node sizes and sometimes need 32+ args?

[6:56] Tanner Gooding: does it just allocate in that case, outside the node allocator?

[6:57] SingleAccretion: You would allocate an array of 32 GenTree* pointers in that case, yes. The allocator is the standard compiler one, passed as an argument.

[6:57] Tanner Gooding: that's what I was saying about sounds potentially problematic most of the gentree stuff completely avoids the standard allocator

[6:57] Tanner Gooding: everything is a node and comes from the bump allocator

[6:57] Tanner Gooding: which is a lot faster, particularly given its 1 size

[6:58] Tanner Gooding: (well 2)

[6:58] SingleAccretion: Everything uses one and only bump allocator, the compiler's one. We don't have a separate allocator for nodes specifically (though we might consider it actually, it could improve the IR density)

[6:58] reflectronic: i think that's what was meant by "the standard compiler one, passed as an argument"

[6:58] Tanner Gooding: I thought the one for gentree nodes was different

[6:59] SingleAccretion: Nope.

[7:00] SingleAccretion: We have an allocator that pools 64K pages and then a wrapper around it that enables some debug niceties, and that's what is used for all the allocations.

[7:00] Tanner Gooding: are you sure? getAllocator(CMK_ASTNode)

[7:01] SingleAccretion: Yes. The CMK_* arguments is debug-only, for tracking where the allocations come from.

[7:02] Tanner Gooding: hmmm, ok. I thought those were actually different in the underlying thing but I guess not

[7:03] Tanner Gooding: so you'd continue replacing gtOp1 with this new MultiOp node rather than GenTreeList?

[7:03] Tanner Gooding: and then have the iterator walk through it?

[7:05] SingleAccretion: Yep. I have already replaced the gtOp1 references, what's left is actually getting rid of the list. (Right now I have a "fake" MultiOp with lists as the first operand, only the 0-operand case is implemented properly).

[7:05] Tanner Gooding: if its not a normal GenTree node, how are you detecting its a MultiOp node instead?

[7:07] SingleAccretion: OperIs(GT_HWINTRINSIC, GT_SIMD)...? There are only two opers that will use MultiOp (right now, I will be looking at putting PHI on the same plan as well).

[7:07] Tanner Gooding: what about GT_CALL?

[7:08] SingleAccretion: Calls are complicated and probably better left as linked lists like they are right now. Plus for them it is actually beneficial to have the uses be actual "fat" classes with additional information. (We might migrate the arg info to them someday)

[7:08] Tanner Gooding: also, couldn't you just make it a GenTree node, use a kind to identify it, and still have the array allocated?

[7:09] SingleAccretion: As in not create a new class...? I don't think so, I need to have the relevant logic live somewhere.

[7:10] Tanner Gooding: right, in GenTreeMultiOp, which would be a normal GenTreeNode where its: count, op1, op2, op3, + pointer to remaining arg buffer?

[7:10] Tanner Gooding: which I think is what will fit in a small node today

[7:10] Tanner Gooding: might actually be count + 4 inline args + an extra pointer for remaining arg buffer

[7:12] Tanner Gooding: e.g. GenTreeJitIntrinsic is a small node. It contains a pointer + 2 uints + can contain another pointer before it becomes large and already contains 2 operand nodes + some other stuff from un/binop

[7:13] Tanner Gooding: the common case will be 2-4 args. The uncommon will be more, which is handled by that one extra trailing buffer

[7:20] SingleAccretion: So, today, the base GenTree is 48 bytes on 64 bit, and a small node is 80 bytes, which means we have 4 pointers to work with. (It is actually 5 pointers on 32 bit).

For the intrinsic nodes, we must have 8 bytes reserved for the node data (simd size etc). That leaves us with 3 pointers. We can do with them in two ways. Have an inline array be 3 pointers (unioned with a pointer to a dynamic one), and then switch on some discriminator when accessing the operands. -or- (which is what I chose) Have the inline array be 2 pointers, and have one pointer point to always valid memory (either to the inline array, or to a dynamic one). This makes indexing cheaper, as you no loner need the discriminator, but penalizes the 3 operand case. I have collected some statistics here: https://gist.github.com/SingleAccretion/f83892a07748b0f214556799dccc5c5d and concluded that this case is not common enough to warrant a penalty on every operands access.

Now, keen readers may have noticed I left off the operand count... And indeed, it does not fit. So I stole 1 byte from the padding in the base node and put it there.

So the current layout of GenTreeJitIntrinsic (and thus HWIntrinsic/SIMD as they add no new field) is as follows, nicely fitting into a small node and being very efficient for the most common scenarios (not a lot of bashing, a l lot of operands accesses): https://paste.mod.gg/fhoqkvkjnufc/0

[7:21] Tanner Gooding: whatever you have planned for GenTreeJitIntrinsic probably won't mesh with some future changes that are needed here.... in particular the need to eventually keep the original call info, so we can rewrite in rationalize/lowering

[7:22] Tanner Gooding: its not going to be fun making that all work eventually

[7:24] Tanner Gooding: in particular, we'll likely eventually need to carry CORINFO_METHOD_HANDLE and for R2R CORINFO_CONST_LOOKUP, just like GenTreeIntrinsic so we can rewrite like GTIntrinsic does

[7:25] SingleAccretion: I do not see how is that the case...? Keeping the call info just means forgoing the inline operands. With the current design it would be close to a one-line change, since the amount in the inline array is configurable (it will work just ok if we change it to zero).

[7:26] Tanner Gooding: It sounds like a very drastic change from how everything works today

[7:26] Tanner Gooding: and it sounds like its not going to be trivial to make work with GT_CALL

[7:26] Tanner Gooding: is my concern

[7:27] SingleAccretion: It sounds like a very drastic change from how everything works today But why? That is how GenTreeCall, GenTreeFieldList and GenTreePhi are today, just they use linked lists instead of an array.

[7:27] SingleAccretion: I could use a linked list as well, but that is just less efficient.

[7:29] Tanner Gooding: but they do it via an explicit GenTree node that is today, we have:

struct GenTreeJitIntrinsic : public GenTreeOp
{
private:
    ClassLayout* gtLayout;

    unsigned char  gtAuxiliaryJitType; // For intrinsics than need another type (e.g. Avx2.Gather* or SIMD (by element))
    regNumberSmall gtOtherReg;         // For intrinsics that return 2 registers

    unsigned char gtSimdBaseJitType; // SIMD vector base JIT type
    unsigned char gtSimdSize;        // SIMD vector size in bytes, use 0 for scalar intrinsics

public:
#if defined(FEATURE_SIMD)
    union {
        SIMDIntrinsicID gtSIMDIntrinsicID; // operation Id
        NamedIntrinsic  gtHWIntrinsicId;
    };
#else
    NamedIntrinsic gtHWIntrinsicId;
#endif
  // more ...
}

So GenTreeJitIntrinsic is a binop (space for w/e in the base + gtOp1 + gtOp2) plus a pointer for gtLayout plus a pointers worth of space for gtAuxType through gtSimdSize (4 bytes) and NamedIntrinsic (4 bytes)

[7:30] Tanner Gooding: so, even considered "wasted" space, you could have

struct GenTreeOperands : public GenTreeOp
{
    GenTree** remainingArgs;
    unsigned count;
}

[7:31] Tanner Gooding: in this case, GenTreeOperands tracks 2 operands + any remaining via a remainingArgs buffer, which is space for count - 2

[7:32] Tanner Gooding: if you handle 3+ operands in gtOp2 rather than in gtOp1, you could also make it so you track 3 operands + remaining in the buffer

[7:32] Tanner Gooding: so its basically GenTreeList, but more efficient requiring very minimal changes to how anything works or how the existing handlers of gtList need to handle things

[7:33] Tanner Gooding: and can simplify some of the handling via iterators and a helper that allows swapping two operand indices

[7:36] Tanner Gooding: a different refactoring would allow things to work more inline with how you want, without also breaking everything being a proper node namely, just ensure gtOper is the first field, with a comment that's the case and hopefully things like gtNext and the like don't break under it

[7:38] SingleAccretion: requiring very minimal changes to how anything works or how the existing handlers of gtList need to handle things I do not think that this would be the case. Code that only needed/cared about the first two operands would continue doing Op(1/2) in my version and in the GenTreeOp version. Code that needs to handle more operands would need to be aware of GenTreeOperands the same way it will be made aware of GenTreeMultiOp (which is about the same thing, just the remainingArgs are before the inline arguments 😄).

Take for example, all the endless functions that switch on the oper we have and then "for each operand, do X". They care about all the operands => they need to know they are dealing with GenTreeOperands, not just any old GenTreeSmpOp. The nodes with a variable number of operands are special, there is no going around it really.

But actually we should not lose the sight of the fact that the real purpose of this refactoring is to collapse all these function that iterate trees to use one common visitor (which lists complicate).

[7:40] Tanner Gooding: --- this is mainly coming at it from the perspective that

an increasing number of arguments is rare. Its commonly small, except for calls which can sometimes go up to 16; or intrinsics which go to 32 or maybe 64 in the future
the biggest "issue" with GenTreeArgList` is that it is "inefficient" with 1 arg per node. It could be more tightly packed to reduce allocations and indirection costs for these cases

However, its also not clear the inefficiency is bad even in intrinsic heavy code (and we did some analysis on this previously and didn't see anything obvious). The more general issue is that its different from other operand cases and not everything in the JIT handles this well or correctly. Simply switching to a new type of non-node won't improve this scenario

[7:44] Tanner Gooding: But actually we should not lose the sight of the fact that the real purpose of this refactoring is to collapse all these function that iterate trees to use one common visitor (which lists complicate) Maybe this should be an independent change that works with GT_LIST first? Iterating operands for any node kind shouldn't be very different regardless of whether its GenTreeArgList or GenTreeMultiOp or even pointer hacks and flags pointing to arrays

[7:45] Tanner Gooding: because as it is, such an iterator should really just be:

if un op, return gtOp1
if bin op
  and gtOp1 is gtArgList
     iterate until argList empty
  else return gtOp1 and gtOp2

[7:45] SingleAccretion: However, its also not clear the inefficiency is bad even in intrinsic heavy code (and we did some analysis on this previously and didn't see anything obvious). Indeed, the inefficiencies are not really that noticeable (indeed, if we work the math with the number of intrinsic nodes created when compiling CoreLib I collected, we'll not be seeing anywhere near the gains we saw with GT_CALL.

Maybe this should be an independent change that works with GT_LIST first? Iterating operands for any node kind shouldn't be very different regardless of whether its GenTreeArgList or GenTreeMultiOp or even pointer hacks and flags pointing to arrays The issue is precisely that there are two ways to iterate the nodes: without and with "expanding" the lists. The two are fundamentally not compatible, both are used.

[7:46] Tanner Gooding: The issue is precisely that there are two ways to iterate the nodes: without and with "expanding" the lists. The two are fundamentally not compatible, both are used. right, and the same would continue with any new approach you'd still have normal ops which have gtOp1/gtOp2 and call/simd/intrinsic which have "magic"

[7:47] Tanner Gooding: simply creating an iterator that hides gtOp1/gtOp2 and provides a general purpose way to iterate would work regardless of this and resolve the general issue

[7:47] SingleAccretion: (Will be responding in a dozen minutes or so...)

[7:52] Tanner Gooding: like, GenTreeUnOp could simply expose a:

gtOp1(), gtOp2() and gtOp(unsigned index);
+
OperandIterator getOps()


this could then do the trivial case of
if (OperIsConst() || OperIsLeaf())
{
    // nothing to iterate
}
else if (OperIsUnary())
{
    yield return gtOp1;
}
else
{
    assert(OperIsBinary());

    if(gtOp1->OperIs(GT_ARG_LIST))
    {
        // return all args
    }
    else
    {
        yield return gtOp1;
        yield return gtOp2;
    }
}

[7:53] Tanner Gooding: we already do similar things in unsigned GenTree::NumChildren()

[7:54] Tanner Gooding: so this would just be generalizing and centralizing reliance on things that should be "private fields" to be public methods that hide that internal detail

[7:54] Tanner Gooding: and then, from there, we can change the backing representation however without worrying about anything breaking or being overly complex to review

[7:55] Tanner Gooding: we even somewhat have this today as Genree::GetChild: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L9076, just that

its not strictly limited to operands
it doesn't handle everything it should today 😄

[7:57] Tanner Gooding: -- part of this is also that new nodes aren't forced to be handled here its the "downside" of the assert less default case in much of the JIT we should really have those cases assert its a one of a set of well known oper kinds, like we do in the hwintrinsics since that forces new nodes to be considered for "should this be specially handled or is the default case fine" but that's just my opinion

[8:23] SingleAccretion: I will be departing to sleep at this point...

I think it is not worth it trying to fit everything into the "only 2 operands" model because we already have a model where not everything fits into 2 operands. GT_LIST's original purpose was to make that model work, but it didn't work too well (indeed, the GT_LIST model didn't work much if at all, it has to be special-cased everywhere: in the GenTreeUseEdgeIterator (https://github.com/dotnet/runtime/blob/787c266ae88400b4352a2dcf9def851ace357bb3/src/coreclr/jit/gentree.cpp#L9758), in TryGetUse (https://github.com/dotnet/runtime/blob/787c266ae88400b4352a2dcf9def851ace357bb3/src/coreclr/jit/gentree.cpp#L5475, in gtSetEvalOrder(https://github.com/dotnet/runtime/blob/787c266ae88400b4352a2dcf9def851ace357bb3/src/coreclr/jit/gentree.cpp#L2517), etc).

So it is true that another node to add there will be another node to add. But it will also be another node to remove (the list itself), so it is a fair tradeoff in my opinion. And I am explicitly designing it so that in the future, if we need to add another variadic node (though I hope that does not happen...) with a fairly "fixed" number of operands, it can just derive from GenTreeMultiOp, provide the number of inline operands that is the right number for it and "just work". Indeed, I am looking at applying it to at least GenTreeFieldList (this one has a Use, but it ought to be straightforward to extend the current design to handle any arbitrary type of Use, not just raw GenTree**).

I note that deleting the list from IR (almost, because of the call placeholder nodes) removes a very special case: a node that is not a node, that does not produce any value, or side effects, or generate any code, or need a VN (though it does get it for some reason, wasting VN memory...), or a register, or be in the LIR sequence...

we already do similar things in unsigned GenTree::NumChildren() Or a lot of other methods. Some of them are "custom" for TP reasons...

[8:23] SingleAccretion: ...others will see their code deleted.

[8:24] SingleAccretion: In any case, thank you Tanner very much for this discussion. I will for sure be thinking about it in the coming weeks.

[8:44] Tanner Gooding: I think its probably worth doing a writeup/general doc of what your goal is here and putting it up as an issue on the dotnet/runtime repo for review by the JIT team. It's not immediately clear to me that this approach is the right one or necessarily inline with anything the JIT is wanting to do in general.

For big changes, especially ones that could be particularly invasive, its generally good to have the proposal up with some simple examples and callouts for what's being replaced and why. If the JIT team signs off, then it can be implemented and contributed

But, in my own experience there is often some level of feedback or changes requested (possibly even just on timing to reduce overall friction), so getting that sign-off first is good

[8:47] Tanner Gooding: Also, the main reason for the bin-op + node split was to handle the very common case of basic operations vs calls intrinsics complicated this a bit more, but not overly so; just making it more obvious where things aren't always getting handled

and that's where, again, I think switching over to a general iterator first is "better" and avoids any problems once you enforce everyone going through the general purpose case and prevent them from directly accessing the "nitty gritty" you can change that nitty gritty to be almost anything without issue

put another way, accessing fields directly is bad, because it means you can't trivially change things the compiler will correctly handle and inline simple wrapper methods, so there is no reason to not use them to abstract things a bit

so switching over to this first, even on top of GenTreeArgList, will help simplify the second PR and make the benefits of it more obvious/visible since the first PR is just a refactoring, not any actual logic or other design change

kant2002/conversation.md