| Kernel | Porter | Quail |
|---|---|---|
| Mass | 0.002 | 0.003 |
| Laplace | 0.012 | 0.008 |
| Hyperelasticity | 0.11 | 0.052 |
Porter: Nvidia Titan X
Quail: Nvidia K40c
| Kernel | NO CSE | WITH CSE |
|---|---|---|
| Mass | 0.003 | 0.0028 |
| Laplace | 0.008 | 0.003 |
| Hyperelasticity | 0.052 | 0.0039 |
Intepretation:
- Quite an improvement on using the
CSE. The effect can be seen a lot in the compute intensive kernels like the Hypereleasticy, where the timing gets enhanced by an order of magnitude. - For hyperelasticity the FLOPs went down from
48936*nelto130*nel. (nelbeing the number of elements being used) - The time to assemble a kernel also got down sciginificantly because of the CSEs.
| Kernel | Loopy | PyOP2 | MatFree |
|---|---|---|---|
| Mass | 0.029 | 0.0013 | 0.0021 |
| Laplace | 0.020 | 0.0013 | 0.0023 |
| Hyperelasticity | 0.041 | 0.0034 | 0.0088 |
The hardware has 2x 8 Xeon cores
Interpretation: This shows that there is quite a lot needed to be done on the CPU end.
- No change in the bandwidth numbers.
- Hand calculation of the lower bound of bandwidth yields bandwidth of the range 30 GB/s.
- Finding a way to deal with scheduling the indexed CSEs(we call it Island problem)
- Making
operators.pycompatible with the newerkernel._ir. - Minor issues with
Rayleigh Bernardkernel. - Making
loopycompute thefootprintbandwidth.