Skip to content

Instantly share code, notes, and snippets.

@allanmac
Last active June 30, 2018 16:12
Show Gist options
  • Save allanmac/7976450 to your computer and use it in GitHub Desktop.
Save allanmac/7976450 to your computer and use it in GitHub Desktop.
A strategy for converting a float3 SoA into AoS without using shared memory.
===============================================================================================
Load three arrays (x, y and z) in SoA order, repack them and store them in AoS order.
Strategy: each warp permutes its load lane with:
(rowNum + (laneId() * 3)) & 31
This will convert SoA into AoS but with x/y/z staggered across rows of registers.
===============================================================================================
0-31:
0 - - 3 - - 6 - - 9 - - 12 - - 15 - - 18 - - 21 - - 24 - - 27 - - 30 -
- 1 - - 4 - - 7 - - 10 - - 13 - - 16 - - 19 - - 22 - - 25 - - 28 - - 31
- - 2 - - 5 - - 8 - - 11 - - 14 - - 17 - - 20 - - 23 - - 26 - - 29 - -
0-63:
0 33 - 3 36 - 6 39 - 9 42 - 12 45 - 15 48 - 18 51 - 21 54 - 24 57 - 27 60 - 30 63
- 1 34 - 4 37 - 7 40 - 10 43 - 13 46 - 16 49 - 19 52 - 22 55 - 25 58 - 28 61 - 31
32 - 2 35 - 5 38 - 8 41 - 11 44 - 14 47 - 17 50 - 20 53 - 23 56 - 26 59 - 29 62 -
0-93:
0 33 66 3 36 69 6 39 72 9 42 75 12 45 78 15 48 81 18 51 84 21 54 87 24 57 90 27 60 93 30 63
64 1 34 67 4 37 70 7 40 73 10 43 76 13 46 79 16 49 82 19 52 85 22 55 88 25 58 91 28 61 94 31
32 65 2 35 68 5 38 71 8 41 74 11 44 77 14 47 80 17 50 83 20 53 86 23 56 89 26 59 92 29 62 95
===============================================================================================
Permutation vector for each lane:
mod3=0 mod3=1 mod3=2
------ ------ ------
0 1 2
2 0 1
1 2 0
if (laneIsMod0) xchg(r1,r2);
if (laneIsMod1) xchg(r0,r1);
if (laneIsMod2) xchg(r0,r2);
At this point the 3 x 32 float rows are in float3 order.
Or just use two SELP ops (p ? a : b) to select which register to store out to device or host mem.
===============================================================================================
If there is no need to expose the float3 you can simplify any future
load/store by packing a 3x32 float "block" into a 2x64 + 1x32 form.
If this is acceptable, then another variant of the above permutation
strategy would only permute the x/y rows and leave z intact.
All of this could've been avoided if the original source of the SoA
arrays interleaved the a/b rows followed by the c row.
For example:
typedef union
{
struct {
float x;
float y;
float z;
};
struct {
float2 xy;
float z;
} block;
} bfloat3;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment