Last active
June 30, 2018 16:12
-
-
Save allanmac/7976450 to your computer and use it in GitHub Desktop.
A strategy for converting a float3 SoA into AoS without using shared memory.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
=============================================================================================== | |
Load three arrays (x, y and z) in SoA order, repack them and store them in AoS order. | |
Strategy: each warp permutes its load lane with: | |
(rowNum + (laneId() * 3)) & 31 | |
This will convert SoA into AoS but with x/y/z staggered across rows of registers. | |
=============================================================================================== | |
0-31: | |
0 - - 3 - - 6 - - 9 - - 12 - - 15 - - 18 - - 21 - - 24 - - 27 - - 30 - | |
- 1 - - 4 - - 7 - - 10 - - 13 - - 16 - - 19 - - 22 - - 25 - - 28 - - 31 | |
- - 2 - - 5 - - 8 - - 11 - - 14 - - 17 - - 20 - - 23 - - 26 - - 29 - - | |
0-63: | |
0 33 - 3 36 - 6 39 - 9 42 - 12 45 - 15 48 - 18 51 - 21 54 - 24 57 - 27 60 - 30 63 | |
- 1 34 - 4 37 - 7 40 - 10 43 - 13 46 - 16 49 - 19 52 - 22 55 - 25 58 - 28 61 - 31 | |
32 - 2 35 - 5 38 - 8 41 - 11 44 - 14 47 - 17 50 - 20 53 - 23 56 - 26 59 - 29 62 - | |
0-93: | |
0 33 66 3 36 69 6 39 72 9 42 75 12 45 78 15 48 81 18 51 84 21 54 87 24 57 90 27 60 93 30 63 | |
64 1 34 67 4 37 70 7 40 73 10 43 76 13 46 79 16 49 82 19 52 85 22 55 88 25 58 91 28 61 94 31 | |
32 65 2 35 68 5 38 71 8 41 74 11 44 77 14 47 80 17 50 83 20 53 86 23 56 89 26 59 92 29 62 95 | |
=============================================================================================== | |
Permutation vector for each lane: | |
mod3=0 mod3=1 mod3=2 | |
------ ------ ------ | |
0 1 2 | |
2 0 1 | |
1 2 0 | |
if (laneIsMod0) xchg(r1,r2); | |
if (laneIsMod1) xchg(r0,r1); | |
if (laneIsMod2) xchg(r0,r2); | |
At this point the 3 x 32 float rows are in float3 order. | |
Or just use two SELP ops (p ? a : b) to select which register to store out to device or host mem. | |
=============================================================================================== | |
If there is no need to expose the float3 you can simplify any future | |
load/store by packing a 3x32 float "block" into a 2x64 + 1x32 form. | |
If this is acceptable, then another variant of the above permutation | |
strategy would only permute the x/y rows and leave z intact. | |
All of this could've been avoided if the original source of the SoA | |
arrays interleaved the a/b rows followed by the c row. | |
For example: | |
typedef union | |
{ | |
struct { | |
float x; | |
float y; | |
float z; | |
}; | |
struct { | |
float2 xy; | |
float z; | |
} block; | |
} bfloat3; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment