Skip to content

Instantly share code, notes, and snippets.

@kaushikcfd
Last active April 8, 2018 16:25
Show Gist options
  • Save kaushikcfd/9228241277666287c48f835ccfe8801f to your computer and use it in GitHub Desktop.
Save kaushikcfd/9228241277666287c48f835ccfe8801f to your computer and use it in GitHub Desktop.
  • Context: argmin/argmax are one of the last bits whose tests are needed to be handled. I am having the following troubles with intergrating these reductions into the new function interface.

Problems about argmin/argmax

Outline about how ArgReductionOp works:

  • During the creation phase -- gets identified and tagged as a reduction operation.
  • During the type inference, gets interpreted as a single instruction(which is a good thing). For the type inference redirects to the result_dtypes method of the class ArgReductionOp.
  • At the realize_reduction step in the preprocess part of the pipeline it gets converted to 3 instructions and one of the instructions would point be a call to Call(<class ArgExtOp>, (*parameters))
  • In the end, we append the inline dtype1 loopy_argmin_dtype1_dtype2_op(...) function to the set of preambles of the device code.

What are the problems with this getting integrated into our function interface?

  • Till the preprocess part the everything is fine, as the reductions are not covered by our current function interface.
  • But after the preprocess stage, we get the "not so pretty" p.Call(<class ArgExtOp>, (*parameters)), which is unlike the other calls encounterd in a kernel i.e. p.Call(p.Variable('function_name'), (*parameters)).
  • The effect this has, is that it is not compatible according to the current structure of loopy specific expression node ScopedFunction, which a variant of p.Variable. Currently p.Call(p.Variable('sin'), (p.Variable('x'))) after scoping would get converted to p.Call(p.ScopedFunction('sin'), (p.Variable('x'))).
  • To make it compatible, we would need to add the following block in every Mapper when dealing with ScopedFunctions.
if isinstance(expr.function, ArgExtOp):
    # do something differently
elif isinstance(expr.function, Variable):
    # do things the normal way

Conclusion: I strongly feel that we should deal with ArgExtOp in a different manner, so that we could achieve our target of unifying the loopy pipeline.

One way could be, at the creation phase covert the argmin/max reduction to a kernel callable and link it to a kernel performing argmin. (All of this should happen at the creation phase).

The following snippet shows an example of a potential argmin kernel.

def get_argmin_kernel(n):
    acc_i = p.Variable("acc_i")
    i = p.Variable("i")
    index = p.Variable("index")
    a_i = p.Subscript(p.Variable("a"), p.Variable("i"))
    argmin_kernel = lp.make_kernel(
            "{[i]: 0 <= i < n}",
            [
                lp.Assignment(id="init1", assignee=index,
                    expression=0),
                lp.Assignment(id="init2", assignee=acc_i,
                    expression="214748367"),
                lp.Assignment(id="insn", assignee=index,
                    expression=p.If(p.Expression.lt(acc_i, a_i), i, index),
                    depends_on="update"),
                lp.Assignment(id="update", assignee=acc_i,
                    expression=p.Variable("min")(acc_i, a_i),
                    depends_on="init1,init2")])

    return argmin_kernel


knl = lp.make_kernel(
        "{[i]:0<=i<10}",
        """
        min_val, min_index = custom_argmin([i]:b[i])
        """)
knl = lp.register_callable_kernel(knl, "custom_argmin", get_argmin_kernel(10))

We get the following OpenCL code:

void custom_argmin_1(__global double const *__restrict__ a, __global double *__restrict__ acc_i, __global int *__restrict__ index)
{
  index[0] = 0;
  acc_i[0] = 214748367.0;
  for (int i = 0; i <= 9; ++i)
  {
    acc_i[0] = fmin(acc_i[0], a[i]);
    index[0] = (acc_i[0] < a[i] ? i : index[0]);
  }
}
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double const *__restrict__ b, __global int *__restrict__ min_index, __global double *__restrict__ min_val)
{
  custom_argmin_1(&(b[0]), &(min_val[0]), &(min_index[0]));
}

Result:

  • This kernel generates (almost) the same code as the argmin being employed now in loo.py
  • Side advantage: No need for writing type inference specifically for custom_argmin as we already employ loopy's present type inference to infer the output types of function.

Status on the WIP MR:

  • I am handling the issues in decreasing order of difficulty. I will let you know if there are any issues that I fall into.

I would be happy to receive a feedback on Riot/Etherpad

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment