Skip to content

Instantly share code, notes, and snippets.

@kaushikcfd
Last active January 22, 2018 19:21
Show Gist options
  • Save kaushikcfd/b289d78ac7d9aef7ba394056fe9a4833 to your computer and use it in GitHub Desktop.
Save kaushikcfd/b289d78ac7d9aef7ba394056fe9a4833 to your computer and use it in GitHub Desktop.

Timings on higher DP capable machine

Kernel Porter Quail
Mass 0.002 0.003
Laplace 0.012 0.008
Hyperelasticity 0.11 0.052

Porter: Nvidia Titan X Quail: Nvidia K40c

Got CSE up

Kernel NO CSE WITH CSE
Mass 0.003 0.0028
Laplace 0.008 0.003
Hyperelasticity 0.052 0.0039

Intepretation:

  • Quite an improvement on using the CSE. The effect can be seen a lot in the compute intensive kernels like the Hypereleasticy, where the timing gets enhanced by an order of magnitude.
  • For hyperelasticity the FLOPs went down from 48936*nel to 130*nel. (nel being the number of elements being used)
  • The time to assemble a kernel also got down sciginificantly because of the CSEs.

Kernel results for CPU

Kernel Loopy PyOP2 MatFree
Mass 0.029 0.0013 0.0021
Laplace 0.020 0.0013 0.0023
Hyperelasticity 0.041 0.0034 0.0088

The hardware has 2x 8 Xeon cores

Interpretation: This shows that there is quite a lot needed to be done on the CPU end.

Perf Numbers with Barvinok wrappers

  • No change in the bandwidth numbers.
  • Hand calculation of the lower bound of bandwidth yields bandwidth of the range 30 GB/s.

Currently working on:

  • Finding a way to deal with scheduling the indexed CSEs(we call it Island problem)
  • Making operators.py compatible with the newer kernel._ir.
  • Minor issues with Rayleigh Bernard kernel.
  • Making loopy compute the footprint bandwidth.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment