kaushikcfd/work_summary_22jan.md

Last active January 22, 2018 19:21

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/kaushikcfd/b289d78ac7d9aef7ba394056fe9a4833.js"></script>
Save kaushikcfd/b289d78ac7d9aef7ba394056fe9a4833 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

work_summary_22jan.md

Timings on higher DP capable machine

Kernel	Porter	Quail
Mass	0.002	0.003
Laplace	0.012	0.008
Hyperelasticity	0.11	0.052

Porter: Nvidia Titan X Quail: Nvidia K40c

Got CSE up

Kernel	NO CSE	WITH CSE
Mass	0.003	0.0028
Laplace	0.008	0.003
Hyperelasticity	0.052	0.0039

Intepretation:

Quite an improvement on using the CSE. The effect can be seen a lot in the compute intensive kernels like the Hypereleasticy, where the timing gets enhanced by an order of magnitude.
For hyperelasticity the FLOPs went down from 48936*nel to 130*nel. (nel being the number of elements being used)
The time to assemble a kernel also got down sciginificantly because of the CSEs.

Kernel results for CPU

Kernel	Loopy	PyOP2	MatFree
Mass	0.029	0.0013	0.0021
Laplace	0.020	0.0013	0.0023
Hyperelasticity	0.041	0.0034	0.0088

The hardware has 2x 8 Xeon cores

Interpretation: This shows that there is quite a lot needed to be done on the CPU end.

Perf Numbers with Barvinok wrappers

No change in the bandwidth numbers.
Hand calculation of the lower bound of bandwidth yields bandwidth of the range 30 GB/s.

Currently working on:

Finding a way to deal with scheduling the indexed CSEs(we call it Island problem)
Making operators.py compatible with the newer kernel._ir.
Minor issues with Rayleigh Bernard kernel.
Making loopy compute the footprint bandwidth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment