Report-04-04.md

Context: argmin/argmax are one of the last bits whose tests are needed to be handled. I am having the following troubles with intergrating these reductions into the new function interface.

Problems about `argmin/argmax`

Outline about how ArgReductionOp works:

During the creation phase -- gets identified and tagged as a reduction operation.
During the type inference, gets interpreted as a single instruction(which is a good thing). For the type inference redirects to the result_dtypes method of the class ArgReductionOp.
At the realize_reduction step in the preprocess part of the pipeline it gets converted to 3 instructions and one of the instructions would point be a call to Call(<class ArgExtOp>, (*parameters))
In the end, we append the inline dtype1 loopy_argmin_dtype1_dtype2_op(...) function to the set of preambles of the device code.

What are the problems with this getting integrated into our function interface?

Till the preprocess part the everything is fine, as the reductions are not covered by our current function interface.
But after the preprocess stage, we get the "not so pretty" p.Call(<class ArgExtOp>, (*parameters)), which is unlike the other calls encounterd in a kernel i.e. p.Call(p.Variable('function_name'), (*parameters)).
The effect this has, is that it is not compatible according to the current structure of loopy specific expression node ScopedFunction, which a variant of p.Variable. Currently p.Call(p.Variable('sin'), (p.Variable('x'))) after scoping would get converted to p.Call(p.ScopedFunction('sin'), (p.Variable('x'))).
To make it compatible, we would need to add the following block in every Mapper when dealing with ScopedFunctions.

if isinstance(expr.function, ArgExtOp):
    # do something differently
elif isinstance(expr.function, Variable):
    # do things the normal way

Conclusion: I strongly feel that we should deal with ArgExtOp in a different manner, so that we could achieve our target of unifying the loopy pipeline.

One way could be, at the creation phase covert the argmin/max reduction to a kernel callable and link it to a kernel performing argmin. (All of this should happen at the creation phase).

The following snippet shows an example of a potential argmin kernel.

def get_argmin_kernel(n):
    acc_i = p.Variable("acc_i")
    i = p.Variable("i")
    index = p.Variable("index")
    a_i = p.Subscript(p.Variable("a"), p.Variable("i"))
    argmin_kernel = lp.make_kernel(
            "{[i]: 0 <= i < n}",
            [
                lp.Assignment(id="init1", assignee=index,
                    expression=0),
                lp.Assignment(id="init2", assignee=acc_i,
                    expression="214748367"),
                lp.Assignment(id="insn", assignee=index,
                    expression=p.If(p.Expression.lt(acc_i, a_i), i, index),
                    depends_on="update"),
                lp.Assignment(id="update", assignee=acc_i,
                    expression=p.Variable("min")(acc_i, a_i),
                    depends_on="init1,init2")])

    return argmin_kernel


knl = lp.make_kernel(
        "{[i]:0<=i<10}",
        """
        min_val, min_index = custom_argmin([i]:b[i])
        """)
knl = lp.register_callable_kernel(knl, "custom_argmin", get_argmin_kernel(10))

We get the following OpenCL code:

void custom_argmin_1(__global double const *__restrict__ a, __global double *__restrict__ acc_i, __global int *__restrict__ index)
{
  index[0] = 0;
  acc_i[0] = 214748367.0;
  for (int i = 0; i <= 9; ++i)
  {
    acc_i[0] = fmin(acc_i[0], a[i]);
    index[0] = (acc_i[0] < a[i] ? i : index[0]);
  }
}
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double const *__restrict__ b, __global int *__restrict__ min_index, __global double *__restrict__ min_val)
{
  custom_argmin_1(&(b[0]), &(min_val[0]), &(min_index[0]));
}

Result:

This kernel generates (almost) the same code as the argmin being employed now in loo.py
Side advantage: No need for writing type inference specifically for custom_argmin as we already employ loopy's present type inference to infer the output types of function.

Status on the WIP MR:

I am handling the issues in decreasing order of difficulty. I will let you know if there are any issues that I fall into.

I would be happy to receive a feedback on Riot/Etherpad

Thanks.

kaushikcfd/Report-04-04.md

Problems about argmin/argmax

Problems about `argmin/argmax`