maedoc · May 1, 2021 20:34
diff --git a/SHTfs.ipynb b/SHTfs.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Spherical harmonic transforms from scratch.\n",
    "\n",
    "SHTns is fast, but our use case deviates a bit: we might have a lot of transform to do at once, our lmax is not very high and we may want to use OpenCL when on other platforms.  Having the algorithm under control would be useful for us."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DFT\n",
    "\n",
    "for the, not FFT rather, DFT, on the GPU, it could be fast enough to simply do it the O(N^2) way: Each thread computes its exp term and then increments the corresponding sum.  The main issue is that the integration in time is a reduction.  It would be nice to avoid the parallel reduction, but that would mean a grid of (nxfm, nlat), (lmax,) threads. \n",
    "\n",
    "\n",
    "Two advantages of doing it ourselves:\n",
    "\n",
    "\n",
    "- don't bother computer terms over lmax, while FFT computes them all\n",
    "- no planning, no API, no library to link, directly port to OpenCL/OpenGL or other\n",
    "- can choose data layout: separate real and imaginary parts completely, lower bandwidth on LT side\n",
    "\n",
    "Point 2 is non-trivial: with CUDA-only we eliminate a bunch of users from having an implementation as fast as their hardware allows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Definition\n",
    "\n",
    "Frequently resources describe spatial and frequency indices with $n$ and $k$ respectively, but in the SHT, $\\theta$ and $m$ are the domains,\n",
    "\n",
    "$$\n",
    "X_m = \\sum_{t=0}^{T-1} x_n \\left[ \\cos (\\frac{2\\pi}{T}mt) - i \\sin (\\frac{2\\pi}{T}mt) \\right]\n",
    "$$\n",
    "\n",
    "In our problem domain, we'd have a small number of transforms to do, at least 6, two hemispheres, and nlat transforms per hemisphere.  For testing, let's say the SHT grid is 148, 74, matching the nfs example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[64, 128]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import os; os.environ['NUMBA_ENABLE_CUDASIM'] = '1'\n",
    "import numpy as np\n",
    "import numba as nb\n",
    "import shtns\n",
    "\n",
    "nlon, nlat = 128, 64\n",
    "lmax = 50\n",
    "sht = shtns.sht(lmax)\n",
    "sht.set_grid(nlat=nlat, nphi=nlon)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can write the above DFT as matrix multipllication:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "C = np.cos(2*np.pi/nlon*np.c_[:lmax]*np.r_[:nlon])\n",
    "S = -np.sin(2*np.pi/nlon*np.c_[:lmax]*np.r_[:nlon])\n",
    "FT = np.vstack([C,S]).astype('f')\n",
    "out = np.zeros((lmax*2, 6 * 2 * 74), 'f')\n",
    "x = np.zeros((nlon, 6 * 2 * 74), 'f')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.17 ms ± 89.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit np.dot(FT, x, out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and compare it to an FFT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "480 µs ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit np.fft.fft(x, axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the FFT is faster than a matmul DFT, but not by much. For `lmax=50` it's a 3x faster, for `lmax=20`, 10%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we push this to the GPU,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cupy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "gFT, gx, gout = cupy.array(FT), cupy.array(x), cupy.array(out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 7.24 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "16.7 µs ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit cupy.dot(gFT, gx, out=gout)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "96.1 µs ± 48 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit cupy.fft.fft(gx, axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "~~A custom kernel might be faster, where the reduction happens in-thread over time, and the sin/cos are computed on the fly,~~ was not faster at all.\n",
    "\n",
    "The last thing to benchmark is a clBlas version, maybe a futhark one too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyopencl\n",
    "import pyopencl.array\n",
    "import pyopencl_blas\n",
    "pyopencl_blas.setup()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "contexts = []\n",
    "for platform in pyopencl.get_platforms():\n",
    "    for device in platform.get_devices():\n",
    "        contexts.append(pyopencl.Context([device]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "class gemm:\n",
    "    def __init__(self, context, FT, x, out):\n",
    "        print(context)\n",
    "        self.queue = pyopencl.CommandQueue(context)\n",
    "        self.cFT = pyopencl.array.to_device(self.queue, FT)\n",
    "        self.cx = pyopencl.array.to_device(self.queue, x)\n",
    "        self.cout = pyopencl.array.to_device(self.queue, out)\n",
    "    def __call__(self):\n",
    "        pyopencl_blas.gemm(self.queue, self.cFT, self.cx, self.cout,\n",
    "                          ).wait()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<pyopencl.Context at 0x23565e78 on <pyopencl.Device '11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz' on 'Intel(R) CPU Runtime for OpenCL(TM) Applications' at 0x23302888>>\n",
      "1.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
      "<pyopencl.Context at 0x23d53c00 on <pyopencl.Device 'pthread-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz' on 'Portable Computing Language' at 0x233018c0>>\n",
      "4.07 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n",
      "<pyopencl.Context at 0x1b848a0 on <pyopencl.Device 'Intel(R) Iris(R) Xe Graphics [0x9a49]' on 'Intel(R) OpenCL HD Graphics' at 0x235386d0>>\n",
      "346 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
      "<pyopencl.Context at 0x23d4c480 on <pyopencl.Device 'GeForce GTX 1650 Ti with Max-Q Design' on 'NVIDIA CUDA' at 0x2355cea0>>\n",
      "91 µs ± 754 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "for context in contexts:\n",
    "    g = gemm(context, FT, x, out)\n",
    "    %timeit g()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly just for fun let's test the Futhark version of these operations,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import \"lib/github.com/diku-dk/linalg/linalg\"\n",
      "\n",
      "module la = mk_linalg f32\n",
      "\n",
      "-- ==\n",
      "-- random input {[100][128]f32 [128][888]f32}\n",
      "-- random input {[300][512]f32 [512][3072]f32}\n",
      "entry main [m][T][n] (FT: [m][T]f32) (data: [T][n]f32): [m][n]f32 =\n",
      "  la.matmul FT data\n",
      "import \"lib/github.com/diku-dk/fft/stockham-radix-2\"\n",
      "\n",
      "module fft = mk_fft f32\n",
      "\n",
      "-- ==\n",
      "-- random input {[888][128]f32} \n",
      "-- random input {[3072][512]f32} \n",
      "entry main [n][m] (x: [n][m]f32): [][](f32,f32) = \n",
      "  map fft.fft_re x\n"
     ]
    }
   ],
   "source": [
    "!cat matmul.fut\n",
    "!cat dft.fut"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shtfs\n",
    "import importlib\n",
    "shtfs = importlib.reload(shtfs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "class gemm2:\n",
    "    def __init__(self, context, FT, x, out):\n",
    "        print(context)\n",
    "        self.queue = pyopencl.CommandQueue(context)\n",
    "        self.shtfs = shtfs.shtfs(self.queue)\n",
    "        self.cFT = pyopencl.array.to_device(self.queue, FT)\n",
    "        self.cxT = pyopencl.array.to_device(self.queue, x.T.copy())\n",
    "        self.cx = pyopencl.array.to_device(self.queue, x)\n",
    "        self.cout = pyopencl.array.to_device(self.queue, out)\n",
    "    def dft(self):\n",
    "        self.shtfs.dft1(self.cxT)\n",
    "    def mm(self):\n",
    "        self.shtfs.matmul(self.cFT, self.cx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((100, 128), (128, 888))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "FT.shape, x.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<pyopencl.Context at 0x23565e78 on <pyopencl.Device '11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz' on 'Intel(R) CPU Runtime for OpenCL(TM) Applications' at 0x23302888>>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/lib/python3/dist-packages/pyopencl/__init__.py:233: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.\n",
      "  warn(\"Non-empty compiler output encountered. Set the \"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.18 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "815 µs ± 26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "<pyopencl.Context at 0x23d53c00 on <pyopencl.Device 'pthread-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz' on 'Portable Computing Language' at 0x233018c0>>\n",
      "7.85 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
      "4.92 ms ± 61.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n",
      "<pyopencl.Context at 0x1b848a0 on <pyopencl.Device 'Intel(R) Iris(R) Xe Graphics [0x9a49]' on 'Intel(R) OpenCL HD Graphics' at 0x235386d0>>\n",
      "615 µs ± 51.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "679 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "<pyopencl.Context at 0x23d4c480 on <pyopencl.Device 'GeForce GTX 1650 Ti with Max-Q Design' on 'NVIDIA CUDA' at 0x2355cea0>>\n",
      "250 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "202 µs ± 5.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "for context in contexts:\n",
    "    g = gemm2(context, FT, x, out)\n",
    "    %timeit g.dft()\n",
    "    %timeit g.mm()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With some autotuning the numbers go down to 167us for the dft and 177us for the matmul.  The CUDA backend is 5-15% faster.   Interesting how the Xe is only about 2x off. \n",
    "\n",
    "It's close enough that we could roll with it, especially if it means churning out the code faster ++ C, multicore, CUDA & OpenCL.  The autotuning and benchmarking is a plus."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LT\n",
    "\n",
    "\n",
    "https://en.wikipedia.org/wiki/Associated_Legendre_polynomials\n",
    "\n",
    "The first few have explicit form from P00 up to P44, then the recurrence formula that allow advancing are\n",
    "\n",
    "```\n",
    "(l - m + 1) P( m,l+1, x) = (2 l + 1) x P(m,l,x) - (l + m) P(m,l-1,x)\n",
    "sqrt(1-x^2) P(m+1, l, x) = (l - m) x P(m, l, x) - (l + m) P(m, l-1, x)\n",
    "```\n",
    "\n",
    "To break the serial aspect and evaluate partrs of the sequence in parallel, we could use a few breakpoints.  The above recurrence formula only need `P(m,l,x)` and `P(m,l-1,x)`.  E.g. we compute lmax=70, then a halfway point might be around l20,m15, so from wolfram alpha,\n",
    "\n",
    "```\n",
    "p(15,20,x) = -14776206365113318125/8 (1 - x^2)^(15/2) (1443 x^5 - 370 x^3 + 15 x)\n",
    "p(15,19,x) = -2110886623587616875/8 (1 - x^2)^(15/2) (1295 x^4 - 210 x^2 + 3)\n",
    "```\n",
    "\n",
    "where these might need to be evaluated in float64 before continuing in float.\n",
    "then the first section would need start at P(0,1,x) and need P(0,0,x) and P(0,1,x) which are on wikipedia page.\n",
    "\n",
    "This would be  reduction over the domain of $x$, which is really $sin(\\phi)$ with $\\phi$ from $0$ to $\\pi$. $sin(\\phi)$ is symmetric over $\\pi/2$, so coefficients can be reused, e.g. computing `P(.) (x(phi) + x(phi+pi/2))`.  Reordering would help so reads from both halfs can be immediately used, but means that nlats are ordered $[0,pi/2], [pi, pi/2)$.\n",
    "\n",
    "Coefficients are reused across nlon and fields, but since those have O(n) memory access on larger strides, it may not be possible to share, rather mainly having a big kernel doing processing the whole dataset at  once would be important. the grid would be `(nfield, nlon)x(nlat/2,nstart)` or `(nfield, nlon, nlat1)x(nlat2,nstart)` with `nlat1*nlat2=nlat/2` and `nlat1` vs `nlat2` is tunable for the hardware."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step here would be to work out in Python the loop,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This trickier than just a single loop.  For each `l`, we need to the `range(m,l)` and then increment `l`, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = 0.5\n",
    "P = {}\n",
    "P[(0,0)] = 1.0  # P^m_l == P[(m,l)]\n",
    "P[(0,1)] = x\n",
    "\n",
    "m, l = 1, 1\n",
    "P[(m, l)] = ((l - m + 1) * x * P[(m - 1, l)] - (l + m - 1) * P[(m - 1, l-1)]) / np.sqrt(1-x**2) \n",
    "\n",
    "m, l = 0, 2\n",
    "P[(m,l)] = ((2 * (l-1) + 1) * x * P[(m,l-1)] - (l + m - 1) * P[(m,l-2)]) / (l - m)\n",
    "\n",
    "m = 1\n",
    "P[(m, l)] = ((l - m + 1) * x * P[(m - 1, l)] - (l + m - 1) * P[(m - 1, l-1)]) / np.sqrt(1-x**2) \n",
    "m = 2\n",
    "P[(m, l)] = ((l - m + 1) * x * P[(m - 1, l)] - (l + m - 1) * P[(m - 1, l-1)]) / np.sqrt(1-x**2) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = 0.5\n",
    "P = {}\n",
    "P[(0,0)] = 1.0  # P^m_l == P[(m,l)]\n",
    "P[(0,1)] = x\n",
    "m, l = 0, 1\n",
    "while l <= lmax:\n",
    "    while m < l:\n",
    "        m += 1\n",
    "        P[(m, l)] = ((l - m + 1) * x * P[(m - 1, l)] - (l + m - 1) * P[(m - 1, l-1)]) / np.sqrt(1-x**2)\n",
    "    m = 0\n",
    "    l += 1\n",
    "    if l > lmax:\n",
    "        break\n",
    "    P[(m,l)] = ((2 * (l-1) + 1) * x * P[(m,l-1)] - (l + m - 1) * P[(m,l-2)]) / (l - m)\n",
    "\n",
    "from scipy.special import lpmv\n",
    "for (m,l),v in P.items():\n",
    "    assert np.allclose(v,lpmv(m,l,x))\n",
    "assert len(P) == sht.m.size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the control flow is good, the values are good, just need to translate this into something a little lower level, usable with a bunch of registers instead of random access arrays.\n",
    "\n",
    "A few observations:\n",
    "\n",
    "- each m loop needs P for one previous m and two previous l\n",
    "- l update needs one previous m and two previous values of l\n",
    "- but m loop only updates one coefficient\n",
    "- and l loop starts with m=0\n",
    "\n",
    "so we can use just an array of l=0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SHTns uses recurrence relations which look simpler, notably parts which aren't dependent on x and could be precomputed.  Let's try anyway:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = 0.5\n",
    "rs1mx = 1.0 / np.sqrt(1 - x**2)\n",
    "\n",
    "# compute all l for m=0\n",
    "m = 0\n",
    "p0l = np.zeros((lmax+1,))\n",
    "p0l[0] = 1.0\n",
    "p0l[1] = x\n",
    "for l in range(2,p0l.size):\n",
    "    p0l[l] = ((2 * (l - 1) + 1) * x * p0l[l - 1] - (l + m - 1) * p0l[l - 2]) / (l - m)\n",
    "    \n",
    "assert np.allclose(p0l, lpmv(0,np.r_[:lmax+1], x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With all the \"starting points\" computed, we can then generate the sequence of m for each l\n",
    "\n",
    "**or** just advance one m and then continue the range of l\n",
    "\n",
    "**or** use the following elration where l doesn't change:\n",
    "$$\n",
    "2mx P(m,l,x) = -\\sqrt(1-x^2)(P(m+1,l,x) + (l + m)(l-m+1)P(m-1,l,x))\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "P(m,l,x) = - 2(m-1)x P(m-1,l,x)/\\sqrt(1-x^2)  -(l + m-1)(l-(m-1)+1)P(m-2,l,x))\n",
    "$$\n",
    "\n",
    "where the $P(1,l,x)$ can be done with the 1st recurrence relation for m."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "l = 15\n",
    "pml = np.zeros((l+1,))\n",
    "pml[0] = p0l[l]\n",
    "pml[1] = (l*x*p0l[l] - l*p0l[l-1])*rs1mx\n",
    "for m in range(2,pml.size):\n",
    "    pml[m] = -2*(m-1)*x*pml[m-1]*rs1mx - (l+m-1)*(l-(m-1)+1)*pml[m-2]  # ok\n",
    "assert np.allclose(pml, lpmv(np.r_[:pml.size],l,x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Putting it together,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1323, 1326)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = 0.5\n",
    "rs1mx = 1.0 / np.sqrt(1 - x**2)\n",
    "P = []\n",
    "p0l = np.zeros((lmax+1,))\n",
    "p0l[0] = 1.0; # TODO missing p00, p01 and p11\n",
    "p0l[1] = x\n",
    "for l in range(2,p0l.size):\n",
    "    p0l[l] = ((2 * (l - 1) + 1) * x * p0l[l - 1] - (l - 1) * p0l[l - 2]) / l\n",
    "    pml = np.zeros((l+1,))\n",
    "    pml[0] = p0l[l]\n",
    "    pml[1] = (l*x*p0l[l] - l*p0l[l-1])*rs1mx\n",
    "    for m in range(2,pml.size):\n",
    "        pml[m] = -2*(m-1)*x*pml[m-1]*rs1mx - (l+m-1)*(l-(m-1)+1)*pml[m-2]\n",
    "    P.append(pml)\n",
    "P = np.hstack(P)\n",
    "P.size, sht.m.size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1326.0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(lmax+1)*(lmax + 2)/2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only two values are ever required to advance, but separately for l and m."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import sqrt\n",
    "@nb.jit\n",
    "def ps(lmax, x):\n",
    "    rs1mx = nb.float32(1.0 / sqrt(1.0 - x**2))\n",
    "    p0l_a = nb.float32(1.0) # p00\n",
    "    p0l_b = x               # p01\n",
    "    acc = nb.float64(0.0)\n",
    "    for l in range(2, lmax + 1):\n",
    "        p0l_b, p0l_a = ((2 * (l - 1) + 1) * x * p0l_b - (l - 1) * p0l_a) / l, p0l_b # p02\n",
    "        pml_a = p0l_b  # p22\n",
    "        pml_b = (l*x*p0l_b - l*p0l_a)*rs1mx # p32\n",
    "        for m in range(2, l+1):\n",
    "            pml_b, pml_a = -2*(m-1)*x*pml_b*rs1mx - (l+m-1)*(l-(m-1)+1)*pml_a, pml_b # p42\n",
    "            acc += pml_b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "111 ns ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)\n"
     ]
    }
   ],
   "source": [
    "ps(50,x)\n",
    "%timeit ps(50,x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import sqrt\n",
    "from numba import cuda\n",
    "@cuda.jit\n",
    "def ps(lmax, x):\n",
    "    rs1mx = nb.float32(1.0 / sqrt(1.0 - x**2))\n",
    "    p0l_a = nb.float32(1.0) # p00\n",
    "    p0l_b = x               # p01\n",
    "    acc = nb.float64(0.0)\n",
    "    for l in range(2, lmax + 1):\n",
    "        p0l_b, p0l_a = ((2 * (l - 1) + 1) * x * p0l_b - (l - 1) * p0l_a) / l, p0l_b # p02\n",
    "        pml_a = p0l_b  # p22\n",
    "        pml_b = (l*x*p0l_b - l*p0l_a)*rs1mx # p32\n",
    "        for m in range(2, l+1):\n",
    "            pml_b, pml_a = -2*(m-1)*x*pml_b*rs1mx - (l+m-1)*(l-(m-1)+1)*pml_a, pml_b # p42\n",
    "            acc += pml_b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "ps[1,32](50,x)\n",
    "%timeit ps[1,32](50,x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is getting close to the overhead of kernel launch itself.  \n",
    "\n",
    "So there's little overhead to generating the coefficients on the fly.  The main trick now is to optimize data movement for the reduction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could also parallelize threads over l: compute all p0l first, or even precompute them for all values of x (which is `sin(np.r_[:pi/2:1j*nlat/2])`: this would be `lmax * nlat/2` coefficients, or `50 * 16=800` which is cheap. Then each thread block would be simpler:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def ps2(p0l, x, data):\n",
    "    m = blockIdx.y\n",
    "    l = threadIdx.x  # up to lmax\n",
    "    lat = threadIdx.y  # up to nlat/2\n",
    "    acc = nb.float64(0.0)\n",
    "    pml_a = p0l[l,lat]\n",
    "    acc += pml_a * (data[m, lat] + data[m, lat+32])\n",
    "    pml_b = (l*x*p0l[l, lat] - l*p0l[l-1, lat])*rs1mx\n",
    "    if (l > 0):\n",
    "        acc += pml_b * (data[m, lat] + data[m, lat+32])\n",
    "    for m in range(2, l + 1):\n",
    "        if (m <= l):\n",
    "            pml_b, pml_a = -2*(m-1)*x*pml_b*rs1mx - (l+m-1)*(l-(m-1)+1)*pml_a, pml_b # p42\n",
    "            acc += pml_b * (data[m, lat] + data[m, lat+32])\n",
    "    # reduce over lat (assuming nlat=64)\n",
    "    acc += cuda.shfl_down_sync(lat<32, acc, 32)\n",
    "    acc += cuda.shfl_down_sync(lat<16, acc, 16)\n",
    "    acc += cuda.shfl_down_sync(lat<8, acc, 8)\n",
    "    acc += cuda.shfl_down_sync(lat<4, acc, 8)\n",
    "    acc += cuda.shfl_down_sync(lat<2, acc, 2)\n",
    "    acc += cuda.shfl_down_sync(lat<1, acc, 1)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to sum over lats here. \n",
    "\n",
    "We could also consider fusing this with the DFT pass, by processing blocks of lon x lat in shared memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using multiple dimensions of the thread block, I wasn't sure about the ordering of the threads.  This is important since we always need the read/writes ot be coalesced.  This is a quick check:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def threadblock():\n",
    "    i = cuda.threadIdx.x\n",
    "    j = cuda.threadIdx.y\n",
    "    n = cuda.blockDim.x\n",
    "    m = cuda.blockDim.y\n",
    "    print(j*m + i)\n",
    "threadblock[1,(4,4)]()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prints the sequence 0,1,2,3,4...14,15 (in the terminal), so is thread block are column major."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "nfs",
   "language": "python",
   "name": "nfs"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
diff --git a/SHTMatrix2.ipynb b/SHTMatrix2.ipynb