- Context:
argmin/argmaxare one of the last bits whose tests are needed to be handled. I am having the following troubles with intergrating these reductions into the new function interface.
Outline about how ArgReductionOp works:
- During the creation phase -- gets identified and tagged as a reduction operation.
- During the type inference, gets interpreted as a single instruction(which is a good thing). For the type inference redirects to the
result_dtypesmethod of the classArgReductionOp. - At the
realize_reductionstep in the preprocess part of the pipeline it gets converted to 3 instructions and one of the instructions would point be a call toCall(<class ArgExtOp>, (*parameters)) - In the end, we append the
inline dtype1 loopy_argmin_dtype1_dtype2_op(...)function to the set of preambles of the device code.
What are the problems with this getting integrated into our function interface?
- Till the preprocess part the everything is fine, as the reductions are not covered by our current function interface.
- But after the preprocess stage, we get the "not so pretty"
p.Call(<class ArgExtOp>, (*parameters)), which is unlike the other calls encounterd in a kernel i.e.p.Call(p.Variable('function_name'), (*parameters)). - The effect this has, is that it is not compatible according to the current structure of loopy specific expression node
ScopedFunction, which a variant ofp.Variable. Currentlyp.Call(p.Variable('sin'), (p.Variable('x')))after scoping would get converted top.Call(p.ScopedFunction('sin'), (p.Variable('x'))). - To make it compatible, we would need to add the following block in every
Mapperwhen dealing withScopedFunctions.
if isinstance(expr.function, ArgExtOp):
# do something differently
elif isinstance(expr.function, Variable):
# do things the normal wayConclusion: I strongly feel that we should deal with ArgExtOp in a different manner, so that we could achieve our target of unifying the loopy pipeline.
One way could be, at the creation phase covert the argmin/max reduction to a kernel callable and link it to a kernel performing argmin. (All of this should happen at the creation phase).
The following snippet shows an example of a potential argmin kernel.
def get_argmin_kernel(n):
acc_i = p.Variable("acc_i")
i = p.Variable("i")
index = p.Variable("index")
a_i = p.Subscript(p.Variable("a"), p.Variable("i"))
argmin_kernel = lp.make_kernel(
"{[i]: 0 <= i < n}",
[
lp.Assignment(id="init1", assignee=index,
expression=0),
lp.Assignment(id="init2", assignee=acc_i,
expression="214748367"),
lp.Assignment(id="insn", assignee=index,
expression=p.If(p.Expression.lt(acc_i, a_i), i, index),
depends_on="update"),
lp.Assignment(id="update", assignee=acc_i,
expression=p.Variable("min")(acc_i, a_i),
depends_on="init1,init2")])
return argmin_kernel
knl = lp.make_kernel(
"{[i]:0<=i<10}",
"""
min_val, min_index = custom_argmin([i]:b[i])
""")
knl = lp.register_callable_kernel(knl, "custom_argmin", get_argmin_kernel(10))We get the following OpenCL code:
void custom_argmin_1(__global double const *__restrict__ a, __global double *__restrict__ acc_i, __global int *__restrict__ index)
{
index[0] = 0;
acc_i[0] = 214748367.0;
for (int i = 0; i <= 9; ++i)
{
acc_i[0] = fmin(acc_i[0], a[i]);
index[0] = (acc_i[0] < a[i] ? i : index[0]);
}
}
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double const *__restrict__ b, __global int *__restrict__ min_index, __global double *__restrict__ min_val)
{
custom_argmin_1(&(b[0]), &(min_val[0]), &(min_index[0]));
}Result:
- This kernel generates (almost) the same code as the
argminbeing employed now inloo.py - Side advantage: No need for writing
type inferencespecifically forcustom_argminas we already employ loopy's present type inference to infer the output types of function.
Status on the WIP MR:
- I am handling the issues in decreasing order of difficulty. I will let you know if there are any issues that I fall into.
I would be happy to receive a feedback on Riot/Etherpad
Thanks.