csullivan · July 21, 2020 16:40
diff --git a/mace_conv2d_1x1_buffer_analysis.ipynb b/mace_conv2d_1x1_buffer_analysis.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Annotations for XiaoMi/mace's Conv2d (1x1) buffer-based OpenCL kernel**\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**High level observations:**\n",
    "* No use of local memory, possibly to avoid barrier / memory fence synchronization\n",
    "* Use of 24 bit multiplication for index space transformations\n",
    "* Vector loads and stores\n",
    "* Basic fusions of activation and bias addition\n",
    "* Performs matrix multiplication via 2x4 (spatial x Co) outerproduct\n",
    "* Unrolls inner loop on filters input channels by a factor of 4\n",
    "* Use of multiply-accumulate intrinsics on vector types\n",
    "* Private memory register tiling used for accumulation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reference: https://github.com/XiaoMi/mace/blob/master/mace/ops/opencl/cl/conv_2d_1x1_buffer.cl\n",
    "\n",
    "```c++\n",
    "#include <common.h>\n",
    "\n",
    "__kernel void conv2d(BUFFER_OUT_OF_RANGE_PARAMS\n",
    "                     GLOBAL_WORK_GROUP_SIZE_DIM2\n",
    "                     __global IN_DATA_TYPE *padded_input,\n",
    "                     __global IN_DATA_TYPE *filter,\n",
    "#ifdef BIAS\n",
    "                     __global IN_DATA_TYPE *bias,\n",
    "#endif\n",
    "                     __private const int in_height,\n",
    "                     __private const int in_width,\n",
    "                     __private const int in_chan,\n",
    "                     __private const int filter_in_chan,\n",
    "                     __private const int out_height,\n",
    "                     __private const int out_width,\n",
    "                     __private const int out_chan,\n",
    "                     __private const int stride_h,\n",
    "                     __private const int stride_w,\n",
    "                     __private const float relux_max_limit,\n",
    "                     __private const float leakyrelu_coefficient,\n",
    "                     __global OUT_DATA_TYPE *output) {\n",
    "\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "From below access pattern I infer the tensor layouts to be: \n",
    "\n",
    "**NHWCi x CoCico4 -> NHWCo**\n",
    "\n",
    "Outer loops over output pixels are bound to OpenCl global group indices:\n",
    "</div>\n",
    "\n",
    "```c++\n",
    "  const int out_wc_blk_idx = get_global_id(0); // in [0, Wo/2 * Co/4)\n",
    "  const int out_hb_idx = get_global_id(1); // in [0, N * Ho)\n",
    "\n",
    "```\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "OpenCL local_id abstraction is not utilized as hardware compute units are fully expressed via global work items.\n",
    "</div>\n",
    "\n",
    "```c++\n",
    "\n",
    "#ifndef NON_UNIFORM_WORK_GROUP\n",
    "  if (out_wc_blk_idx >= global_size_dim0 ||\n",
    "      out_hb_idx >= global_size_dim1) {\n",
    "    return;\n",
    "  }\n",
    "#endif\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "Output pixel coordinate calculation:\n",
    "\n",
    "\n",
    "This is `Co/4 = roundup(float(out_chan)/4.0))`\n",
    "</div>\n",
    "\n",
    "```c++\n",
    "  const int out_chan_blk = (out_chan + 3) >> 2; \n",
    "\n",
    "  const int out_width_blk_idx = out_wc_blk_idx / out_chan_blk;\n",
    "  // out chan block index = (wo*co) % (Co/4)\n",
    "  const int out_chan_blk_idx =\n",
    "      out_wc_blk_idx - mul24(out_width_blk_idx, out_chan_blk);\n",
    "\n",
    "```\n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "Let `out_wc_blk_idx = wo * co`, then given work group launch params:\n",
    "</div>\n",
    "\n",
    "$wo \\in [0,Wo/2)$\n",
    "\n",
    "$co \\in [0,Co/4)$\n",
    "    \n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "and, \n",
    "\n",
    "`out pixel width index = (wo*co) / (Co/4)`\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "Each work group will process two widths - idx. transformation: [0,Wo/2,step=1) -> [0,Wo,step=2), lane width = 2. Lanes are unrolled.\n",
    "\n",
    "Similarly, each work group will process 4 channels - idx. transformation: [0,Co/2,step=1) -> [0,Co,step=4), lane width = 4. Lanes are vectorized. \n",
    "</div>\n",
    "\n",
    "```c++\n",
    "  const int out_width_idx = out_width_blk_idx << 1;  \n",
    "  const int out_chan_idx = out_chan_blk_idx << 2;\n",
    "```\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "\n",
    "Linear output batch and height index calculation:\n",
    "</div>\n",
    "\n",
    "```c++\n",
    "  // out batch index = (n*ho) / Ho\n",
    "  const int batch_idx = out_hb_idx / out_height;\n",
    "  // out pixel height index = (n*ho) % Ho\n",
    "  const int out_height_idx = out_hb_idx - mul24(batch_idx, out_height);\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "Coordinate transformation from outputs index space to input index space.\n",
    "For 1x1 convolution the output index space is just a scaled version of the input index space.\n",
    "</div>\n",
    "\n",
    "```c++\n",
    "\n",
    "  // hi = ho * stride_h\n",
    "  const int in_height_idx = mul24(out_height_idx, stride_h);\n",
    "  // wi = wo * stride_w\n",
    "  const int in_width_idx = mul24(out_width_idx, stride_w);\n",
    "  // Cin = 240, stride size to next set of input channels = Cin * stride_w\n",
    "  const int strided_chan = mul24(in_chan, stride_w);\n",
    "    \n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "Register tiled vector accumulators for vector stores to the output along `co`\n",
    "</div>    \n",
    "\n",
    "```c++\n",
    "#ifdef BIAS\n",
    "  DATA_TYPE4 out0 = CONVERT4(vload4(0, bias + out_chan_idx));\n",
    "  DATA_TYPE4 out1 = out0;\n",
    "#else\n",
    "  DATA_TYPE4 out0 = 0;\n",
    "  DATA_TYPE4 out1 = 0;\n",
    "#endif\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "Offset into input memory from bound outerloops: \n",
    "</div>    \n",
    "\n",
    "```c++\n",
    "  // (((n * Hi + hi) * Wi + wi) * Ci \n",
    "  int in_offset = mul24(mad24(mad24(batch_idx, in_height, in_height_idx),\n",
    "      in_width, in_width_idx), in_chan);\n",
    "``` \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "\n",
    "Convert output channel index space (already vectorized into 4 lanes) into filter index space\n",
    "To get input filters corresponding to each output channel index we stride through the filter by filter input channels (the common dimension of the matmul). Outer loop over output channels is bound to out_chan_blk_idx. Inner loop over input channels Ci is unrolled 4 times.\n",
    "</div>    \n",
    "   \n",
    "```c++\n",
    "  int filter_offset = mul24(out_chan_blk_idx, filter_in_chan) << 2;\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\"> \n",
    "Local register tiles (vectors) for filter and input\n",
    "</div>    \n",
    "    \n",
    "```c++\n",
    "  // Local register tiles for filter and input\n",
    "  DATA_TYPE4 in0, in1;\n",
    "  DATA_TYPE4 w0, w1, w2, w3;\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    \n",
    "Perform the outer product accumulation along Ci for 4 separates lanes of Co in the filter and two lanes of spatial dimensions in the input. Comments added for clarity:\n",
    "</div>\n",
    "    \n",
    "```c++\n",
    "  // Iteratate over common dimension Ci and accumulate into register vectors\n",
    "  for (int in_chan_idx = 0; in_chan_idx < in_chan; in_chan_idx += 4) {\n",
    "    // Unrolled filter access along Ci of four vector loads of packed axis co%4\n",
    "    // * ---------- Co ---------->\n",
    "    // | w0.x w0.y w0.z w0.w ...\n",
    "    // | w1.x w1.y w1.z w1.w\n",
    "    // | ...\n",
    "    //Ci w3.x w3.y w3.z w3.w \n",
    "    // |\n",
    "    // |  \n",
    "    // |\n",
    "    // ▼\n",
    "    w0 = CONVERT4(vload4(0, filter + filter_offset));\n",
    "    w1 = CONVERT4(vload4(0, filter + filter_offset + 4));\n",
    "    w2 = CONVERT4(vload4(0, filter + filter_offset + 8));\n",
    "    w3 = CONVERT4(vload4(0, filter + filter_offset + 12));\n",
    "\n",
    "    // Unrolled two width index accesses, as well as \n",
    "    // vectorizing the load on the input channel (in_chan_idx)\n",
    "    in0 = CONVERT4(vload4(0, padded_input + in_offset));\n",
    "    in1 = CONVERT4(vload4(0, padded_input + in_offset + strided_chan));\n",
    "\n",
    "    // Broadcast each component of spatial vector and\n",
    "    // multiply by the elements of the packed output channel. \n",
    "    // Then accumulate into the out register vector for each \n",
    "    // input channel (along the commond dimension of the matmul)l\n",
    "    out0 = mad((DATA_TYPE4)(in0.x), w0, out0);\n",
    "    out0 = mad((DATA_TYPE4)(in0.y), w1, out0);\n",
    "    out0 = mad((DATA_TYPE4)(in0.z), w2, out0);\n",
    "    out0 = mad((DATA_TYPE4)(in0.w), w3, out0);\n",
    "\n",
    "    out1 = mad((DATA_TYPE4)(in1.x), w0, out1);\n",
    "    out1 = mad((DATA_TYPE4)(in1.y), w1, out1);\n",
    "    out1 = mad((DATA_TYPE4)(in1.z), w2, out1);\n",
    "    out1 = mad((DATA_TYPE4)(in1.w), w3, out1);\n",
    "\n",
    "    filter_offset += 16;\n",
    "    in_offset += 4;\n",
    "  }\n",
    "\n",
    "  int out_offset = mad24(mad24(mad24(batch_idx, out_height, out_height_idx),\n",
    "      out_width, out_width_idx), out_chan, out_chan_idx);\n",
    "\n",
    "// Handle extra output channels that may result from rounding up Co/4.\n",
    "#define WRITE_OUTPUT(i) \\\n",
    "  if (out_chan_idx + 4 > out_chan) {           \\\n",
    "    const int diff = out_chan - out_chan_idx;  \\\n",
    "    switch(diff) {                             \\\n",
    "      case 3:                                  \\\n",
    "        output[out_offset + 2] = CONVERT_TO(out##i.z, OUT_DATA_TYPE);     \\\n",
    "      case 2:                                  \\\n",
    "        output[out_offset + 1] = CONVERT_TO(out##i.y, OUT_DATA_TYPE);     \\\n",
    "      case 1:                                  \\\n",
    "        output[out_offset] = CONVERT_TO(out##i.x, OUT_DATA_TYPE);         \\\n",
    "    }                                          \\\n",
    "    CHECK_OUT_OF_RANGE_FOR_BUFFER(out_offset + diff - 1); \\\n",
    "  } else {                                     \\\n",
    "    VSTORE4(CONVERT_TO(out##i, OUT_DATA_TYPE4), output, out_offset);   \\\n",
    "  }\n",
    "\n",
    "  WRITE_OUTPUT(0);\n",
    "  if (out_width_idx + 1 >= out_width) return;\n",
    "  out_offset += out_chan;\n",
    "  WRITE_OUTPUT(1);\n",
    "#undef WRITE_OUTPUT\n",
    "\n",
    "}\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Annotations for XiaoMi/mace's Conv2d (1x1) buffer-based OpenCL kernel\n",
	"---"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\n",
	"High level observations:\n",
	"* No use of local memory, possibly to avoid barrier / memory fence synchronization\n",
	"* Use of 24 bit multiplication for index space transformations\n",
	"* Vector loads and stores\n",
	"* Basic fusions of activation and bias addition\n",
	"* Performs matrix multiplication via 2x4 (spatial x Co) outerproduct\n",
	"* Unrolls inner loop on filters input channels by a factor of 4\n",
	"* Use of multiply-accumulate intrinsics on vector types\n",
	"* Private memory register tiling used for accumulation.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Reference: https://github.com/XiaoMi/mace/blob/master/mace/ops/opencl/cl/conv_2d_1x1_buffer.cl\n",
	"\n",
	"```c++\n",
	"#include <common.h>\n",
	"\n",
	"__kernel void conv2d(BUFFER_OUT_OF_RANGE_PARAMS\n",
	" GLOBAL_WORK_GROUP_SIZE_DIM2\n",
	" __global IN_DATA_TYPE *padded_input,\n",
	" __global IN_DATA_TYPE *filter,\n",
	"#ifdef BIAS\n",
	" __global IN_DATA_TYPE *bias,\n",
	"#endif\n",
	" __private const int in_height,\n",
	" __private const int in_width,\n",
	" __private const int in_chan,\n",
	" __private const int filter_in_chan,\n",
	" __private const int out_height,\n",
	" __private const int out_width,\n",
	" __private const int out_chan,\n",
	" __private const int stride_h,\n",
	" __private const int stride_w,\n",
	" __private const float relux_max_limit,\n",
	" __private const float leakyrelu_coefficient,\n",
	" __global OUT_DATA_TYPE *output) {\n",
	"\n",
	"\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	"From below access pattern I infer the tensor layouts to be: \n",
	"\n",
	"NHWCi x CoCico4 -> NHWCo\n",
	"\n",
	"Outer loops over output pixels are bound to OpenCl global group indices:\n",
	"</div>\n",
	"\n",
	"```c++\n",
	" const int out_wc_blk_idx = get_global_id(0); // in [0, Wo/2 * Co/4)\n",
	" const int out_hb_idx = get_global_id(1); // in [0, N * Ho)\n",
	"\n",
	"```\n",
	"<div class=\"alert alert-block alert-warning\">\n",
	"OpenCL local_id abstraction is not utilized as hardware compute units are fully expressed via global work items.\n",
	"</div>\n",
	"\n",
	"```c++\n",
	"\n",
	"#ifndef NON_UNIFORM_WORK_GROUP\n",
	" if (out_wc_blk_idx >= global_size_dim0 \|\|\n",
	" out_hb_idx >= global_size_dim1) {\n",
	" return;\n",
	" }\n",
	"#endif\n",
	"\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	"Output pixel coordinate calculation:\n",
	"\n",
	"\n",
	"This is `Co/4 = roundup(float(out_chan)/4.0))`\n",
	"</div>\n",
	"\n",
	"```c++\n",
	" const int out_chan_blk = (out_chan + 3) >> 2; \n",
	"\n",
	" const int out_width_blk_idx = out_wc_blk_idx / out_chan_blk;\n",
	" // out chan block index = (wo*co) % (Co/4)\n",
	" const int out_chan_blk_idx =\n",
	" out_wc_blk_idx - mul24(out_width_blk_idx, out_chan_blk);\n",
	"\n",
	"```\n",
	"\n",
	"<div class=\"alert alert-block alert-warning\">\n",
	"Let `out_wc_blk_idx = wo * co`, then given work group launch params:\n",
	"</div>\n",
	"\n",
	"$wo \\in [0,Wo/2)$\n",
	"\n",
	"$co \\in [0,Co/4)$\n",
	" \n",
	"<div class=\"alert alert-block alert-warning\">\n",
	"and, \n",
	"\n",
	"`out pixel width index = (wo*co) / (Co/4)`\n",
	"</div>\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	"Each work group will process two widths - idx. transformation: [0,Wo/2,step=1) -> [0,Wo,step=2), lane width = 2. Lanes are unrolled.\n",
	"\n",
	"Similarly, each work group will process 4 channels - idx. transformation: [0,Co/2,step=1) -> [0,Co,step=4), lane width = 4. Lanes are vectorized. \n",
	"</div>\n",
	"\n",
	"```c++\n",
	" const int out_width_idx = out_width_blk_idx << 1; \n",
	" const int out_chan_idx = out_chan_blk_idx << 2;\n",
	"```\n",
	"<div class=\"alert alert-block alert-warning\">\n",
	"\n",
	"Linear output batch and height index calculation:\n",
	"</div>\n",
	"\n",
	"```c++\n",
	" // out batch index = (n*ho) / Ho\n",
	" const int batch_idx = out_hb_idx / out_height;\n",
	" // out pixel height index = (n*ho) % Ho\n",
	" const int out_height_idx = out_hb_idx - mul24(batch_idx, out_height);\n",
	"```\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	"Coordinate transformation from outputs index space to input index space.\n",
	"For 1x1 convolution the output index space is just a scaled version of the input index space.\n",
	"</div>\n",
	"\n",
	"```c++\n",
	"\n",
	" // hi = ho * stride_h\n",
	" const int in_height_idx = mul24(out_height_idx, stride_h);\n",
	" // wi = wo * stride_w\n",
	" const int in_width_idx = mul24(out_width_idx, stride_w);\n",
	" // Cin = 240, stride size to next set of input channels = Cin * stride_w\n",
	" const int strided_chan = mul24(in_chan, stride_w);\n",
	" \n",
	"```\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	"Register tiled vector accumulators for vector stores to the output along `co`\n",
	"</div> \n",
	"\n",
	"```c++\n",
	"#ifdef BIAS\n",
	" DATA_TYPE4 out0 = CONVERT4(vload4(0, bias + out_chan_idx));\n",
	" DATA_TYPE4 out1 = out0;\n",
	"#else\n",
	" DATA_TYPE4 out0 = 0;\n",
	" DATA_TYPE4 out1 = 0;\n",
	"#endif\n",
	"```\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	"Offset into input memory from bound outerloops: \n",
	"</div> \n",
	"\n",
	"```c++\n",
	" // (((n * Hi + hi) * Wi + wi) * Ci \n",
	" int in_offset = mul24(mad24(mad24(batch_idx, in_height, in_height_idx),\n",
	" in_width, in_width_idx), in_chan);\n",
	"``` \n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	"\n",
	"Convert output channel index space (already vectorized into 4 lanes) into filter index space\n",
	"To get input filters corresponding to each output channel index we stride through the filter by filter input channels (the common dimension of the matmul). Outer loop over output channels is bound to out_chan_blk_idx. Inner loop over input channels Ci is unrolled 4 times.\n",
	"</div> \n",
	" \n",
	"```c++\n",
	" int filter_offset = mul24(out_chan_blk_idx, filter_in_chan) << 2;\n",
	"```\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\"> \n",
	"Local register tiles (vectors) for filter and input\n",
	"</div> \n",
	" \n",
	"```c++\n",
	" // Local register tiles for filter and input\n",
	" DATA_TYPE4 in0, in1;\n",
	" DATA_TYPE4 w0, w1, w2, w3;\n",
	"```\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<div class=\"alert alert-block alert-warning\">\n",
	" \n",
	"Perform the outer product accumulation along Ci for 4 separates lanes of Co in the filter and two lanes of spatial dimensions in the input. Comments added for clarity:\n",
	"</div>\n",
	" \n",
	"```c++\n",
	" // Iteratate over common dimension Ci and accumulate into register vectors\n",
	" for (int in_chan_idx = 0; in_chan_idx < in_chan; in_chan_idx += 4) {\n",
	" // Unrolled filter access along Ci of four vector loads of packed axis co%4\n",
	" // * ---------- Co ---------->\n",
	" // \| w0.x w0.y w0.z w0.w ...\n",
	" // \| w1.x w1.y w1.z w1.w\n",
	" // \| ...\n",
	" //Ci w3.x w3.y w3.z w3.w \n",
	" // \|\n",
	" // \| \n",
	" // \|\n",
	" // ▼\n",
	" w0 = CONVERT4(vload4(0, filter + filter_offset));\n",
	" w1 = CONVERT4(vload4(0, filter + filter_offset + 4));\n",
	" w2 = CONVERT4(vload4(0, filter + filter_offset + 8));\n",
	" w3 = CONVERT4(vload4(0, filter + filter_offset + 12));\n",
	"\n",
	" // Unrolled two width index accesses, as well as \n",
	" // vectorizing the load on the input channel (in_chan_idx)\n",
	" in0 = CONVERT4(vload4(0, padded_input + in_offset));\n",
	" in1 = CONVERT4(vload4(0, padded_input + in_offset + strided_chan));\n",
	"\n",
	" // Broadcast each component of spatial vector and\n",
	" // multiply by the elements of the packed output channel. \n",
	" // Then accumulate into the out register vector for each \n",
	" // input channel (along the commond dimension of the matmul)l\n",
	" out0 = mad((DATA_TYPE4)(in0.x), w0, out0);\n",
	" out0 = mad((DATA_TYPE4)(in0.y), w1, out0);\n",
	" out0 = mad((DATA_TYPE4)(in0.z), w2, out0);\n",
	" out0 = mad((DATA_TYPE4)(in0.w), w3, out0);\n",
	"\n",
	" out1 = mad((DATA_TYPE4)(in1.x), w0, out1);\n",
	" out1 = mad((DATA_TYPE4)(in1.y), w1, out1);\n",
	" out1 = mad((DATA_TYPE4)(in1.z), w2, out1);\n",
	" out1 = mad((DATA_TYPE4)(in1.w), w3, out1);\n",
	"\n",
	" filter_offset += 16;\n",
	" in_offset += 4;\n",
	" }\n",
	"\n",
	" int out_offset = mad24(mad24(mad24(batch_idx, out_height, out_height_idx),\n",
	" out_width, out_width_idx), out_chan, out_chan_idx);\n",
	"\n",
	"// Handle extra output channels that may result from rounding up Co/4.\n",
	"#define WRITE_OUTPUT(i) \\\n",
	" if (out_chan_idx + 4 > out_chan) { \\\n",
	" const int diff = out_chan - out_chan_idx; \\\n",
	" switch(diff) { \\\n",
	" case 3: \\\n",
	" output[out_offset + 2] = CONVERT_TO(out##i.z, OUT_DATA_TYPE); \\\n",
	" case 2: \\\n",
	" output[out_offset + 1] = CONVERT_TO(out##i.y, OUT_DATA_TYPE); \\\n",
	" case 1: \\\n",
	" output[out_offset] = CONVERT_TO(out##i.x, OUT_DATA_TYPE); \\\n",
	" } \\\n",
	" CHECK_OUT_OF_RANGE_FOR_BUFFER(out_offset + diff - 1); \\\n",
	" } else { \\\n",
	" VSTORE4(CONVERT_TO(out##i, OUT_DATA_TYPE4), output, out_offset); \\\n",
	" }\n",
	"\n",
	" WRITE_OUTPUT(0);\n",
	" if (out_width_idx + 1 >= out_width) return;\n",
	" out_offset += out_chan;\n",
	" WRITE_OUTPUT(1);\n",
	"#undef WRITE_OUTPUT\n",
	"\n",
	"}\n",
	"\n",
	"```\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}