I plumbed through
support today for the __array_func__ hook. This is defined in NEP-18
and there is a new proposal for __array_module__ in NEP-37. For a from-scratch implementation __array_func__ is quite nice. I believe that __array_module__ may be of some help to existing implementations
or those needing to layer multiple hooks together.
This let me get tracing working for numpy.dot and numpy.transpose ops. For the latter, we run in to for the first time the limits
of what can be extracted in a local transformation: the axes parameter can either be None or a list of ints. In more static op sets,
it is often represented as an attribute (as opposed to SSA-value/operand), and in general, the more is known about it, the better that any
code generation strategy is going to work. However, at the point of extraction, we have not yet done any dtype or shape inference, and there
is not even enough information known to produce a list of ints representing the default (N-1..0 where N is the rank). This leads me to
the conclusion that at the frontend level, transpose is actually two ops: one that performs the default action and another that takes
a user specified list of ints (or possibly a variadic of permutation indices). I've implemented the former so far.
There are many such ops like this where, based on the parameters and the constraints of an SSA-based IR, they are represented by multiple ops for full generality. Typically, I have observed, that ML op sets which try to formalize this, struggle greatly due to the impedence mismatch, often imposing arbitrary restrictions on the representation that infects everything and yet still fails to provide enough static information to perform great code generation without (usually substantial) additional analysis, which is itself often thwarted by the inability to represent the fully generic forms of the ops. In my observation, this is often made worse by prematurely tieing these frontend ops to concrete implementation kernels, which are themselves then required to implement the full generality (this is how TensorFlow can end up with ~250MB of canned CPU kernels for doing elementwise computations).
For this work, we just embrace the fully generic forms and leave open the door for full analysis and lowering to the next level down without committing to any execution strategy yet.
The astute reader will realize that I've also been carrying forward a pretty serious impedence mismatch of my own so far: I've just been
blindly representing ndarray by the MLIR tensor type. The former represents an array in memory that can be mutated, aliased, etc
while the latter is a value type. There is a method to this (and a resolution) that bridges the worlds, but that will wait for
another day. Suffice it to say that most ops are well formed enough (barring things like at methods and out= params) that there
are a lot of advantages to defining them in terms of value semantics. My argument is that locally (at the point of tracing or
AST translation), we have sufficient context to emit additional ops (loads, stores, etc) and SSA-level manipulations to bridge these
worlds, hopefully creating programs where only the buffer-aliases that escape actually persevere in the high level IR, leaving
the lower level compiler the freedom to perform buffer allocation and load/store optimizations mostly as it sees fit while still
retaining the ability to do user level assignment and aliasing operations when necessary.
For now, I'm assuming that and just working at the tensor-value level to get things bootstrapped.