The __byte_perm
function in CUDA is a powerful intrinsic designed for byte-level manipulation of 32-bit integers, commonly used in GPU-accelerated applications like cryptographic operations, data formatting, and Bitcoin script construction. It allows developers to select and rearrange bytes from two 32-bit inputs, x
and y
, into a new 32-bit output using a selector s
. This guide provides a detailed explanation of how __byte_perm
works, with a focus on constructing selector values, and includes practical examples to clarify its usage and address common pitfalls, particularly in the context of your Bitcoin P2SH script construction.
__device__ unsigned int __byte_perm(unsigned int x, unsigned int y, unsigned int s);
- Parameters:
x
: A 32-bit unsigned integer providing the first four bytes of the source.y
: A 32-bit unsigned integer providing the next four bytes of the source.s
: A 32-bit selector that determines which bytes are chosen for the output.
- Return: A 32-bit unsigned integer composed of four bytes selected from the 8-byte source based on
s
.
The __byte_perm
function combines x
and y
into an 8-byte source, from which four bytes are selected to form the result. The selector s
specifies which bytes to pick and their order in the output. Here’s a step-by-step breakdown:
The 8-byte source is formed by concatenating x
and y
:
- Bytes 0–3: From
x
, where byte 0 is the least significant byte (LSB, bits 7:0) and byte 3 is the most significant byte (MSB, bits 31:24). - Bytes 4–7: From
y
, where byte 4 is the LSB (bits 7:0) and byte 7 is the MSB (bits 31:24).
For example, if x = 0x12345678
and y = 0x9ABCDEF0
, the source bytes are:
Index | Byte | Source |
---|---|---|
0 | 0x78 | x[7:0] |
1 | 0x56 | x[15:8] |
2 | 0x34 | x[23:16] |
3 | 0x12 | x[31:24] |
4 | 0xF0 | y[7:0] |
5 | 0xDE | y[15:8] |
6 | 0xBC | y[23:16] |
7 | 0x9A | y[31:24] |
The selector s
is a 32-bit integer, but only the lower 16 bits are used, divided into four 3-bit fields:
- Bits 2:0: Select the byte for result[7:0] (LSB).
- Bits 6:4: Select the byte for result[15:8].
- Bits 10:8: Select the byte for result[23:16].
- Bits 14:12: Select the byte for result[31:24] (MSB).
Each 3-bit field (values 0–7) specifies an index into the 8-byte source. The result is constructed as:
result[7:0] = source[(s >> 0) & 0x7]
result[15:8] = source[(s >> 4) & 0x7]
result[23:16] = source[(s >> 8) & 0x7]
result[31:24] = source[(s >> 12) & 0x7]
In your code, selectors are often written in binary with 3-bit groups separated by apostrophes (e.g., 0b011'001'000'100
). This represents a 12-bit or 16-bit value, where each group corresponds to a field:
- Leftmost group: Bits 14:12 (for result[31:24]).
- Next group: Bits 10:8 (for result[23:16]).
- Next group: Bits 6:4 (for result[15:8]).
- Rightmost group: Bits 2:0 (for result[7:0]).
For example, s = 0b011'001'000'100
translates to 0x3104
in hex, with fields 3, 1, 0, 4.
The result is a 32-bit integer stored in little-endian memory (e.g., [b0, b1, b2, b3]
). When printed with your logic:
printf("%02x %02x %02x %02x ", (word >> 24) & 0xFF, (word >> 16) & 0xFF, (word >> 8) & 0xFF, word & 0xFF);
It displays bytes in big-endian order: MSB (bits 31:24) first, LSB (bits 7:0) last. For result = 0x76a914c5
, it prints 76 a9 14 c5
.
To achieve a specific output, follow these steps:
- Identify Desired Bytes: Determine which bytes from the source are needed for each position in the result.
- Assign Indices: Map each desired byte to its source index (0–7).
- Build Selector: Place each index in the appropriate 3-bit field of
s
:s = (index3 << 12) | (index2 << 8) | (index1 << 4) | index0
index0
for result[7:0],index1
for result[15:8],index2
for result[23:16],index3
for result[31:24].
- Write in Binary: Optionally, express
s
as a binary literal with 3-bit groups (e.g.,0b011'001'000'100
).
Below are examples to illustrate __byte_perm
usage, including cases from your code and additional scenarios to highlight common patterns and pitfalls.
Goal: Output all bytes of x = 0x12345678
.
- Inputs:
x = 0x12345678
(bytes[78, 56, 34, 12]
),y = 0x9ABCDEF0
(bytes[F0, DE, BC, 9A]
). - Source:
[78, 56, 34, 12, F0, DE, BC, 9A]
. - Desired Result:
[78, 56, 34, 12]
(hex0x12345678
). - Selectors: Choose bytes 0, 1, 2, 3.
result[7:0] = source[0] = 78
→ index 0.result[15:8] = source[1] = 56
→ index 1.result[23:16] = source[2] = 34
→ index 2.result[31:24] = source[3] = 12
→ index 3.
- Selector:
s = (3 << 12) | (2 << 8) | (1 << 4) | 0 = 0x3210
. - Binary:
0b0011'0010'0001'0000
. - Code:
unsigned int result = __byte_perm(0x12345678, 0x9ABCDEF0, 0x3210); // result = 0x12345678, prints 12 34 56 78
Goal: Output all bytes of y = 0x9ABCDEF0
.
- Inputs: Same as above.
- Desired Result:
[F0, DE, BC, 9A]
(hex0x9ABCDEF0
). - Selectors: Choose bytes 4, 5, 6, 7.
- Selector:
s = (7 << 12) | (6 << 8) | (5 << 4) | 4 = 0x7654
. - Binary:
0b0111'0110'0101'0100
. - Code:
unsigned int result = __byte_perm(0x12345678, 0x9ABCDEF0, 0x7654); // result = 0x9ABCDEF0, prints 9A BC DE F0
Goal: Output 76 a9 14 c5
with x = 0x7600a914
, y = h[0] = 0x069f34c5
.
- Inputs:
x = 0x7600a914
→ bytes[14, a9, 00, 76]
.y = 0x069f34c5
→ bytes[c5, 34, 9f, 06]
.- Source:
[14, a9, 00, 76, c5, 34, 9f, 06]
.
- Desired Result:
[c5, 14, a9, 76]
(hex0x76a914c5
, prints76 a9 14 c5
). - Selectors:
result[7:0] = c5 = source[4]
→ index 4.result[15:8] = 14 = source[0]
→ index 0.result[23:16] = a9 = source[1]
→ index 1.result[31:24] = 76 = source[3]
→ index 3.
- Selector:
s = (3 << 12) | (1 << 8) | (0 << 4) | 4 = 0x3104
. - Binary:
0b0011'0001'0000'0100
or0b011'001'000'100
. - Code:
unsigned int result = __byte_perm(0x7600a914, 0x069f34c5, 0x3104); // result = 0x76a914c5, prints 76 a9 14 c5
Goal: Reproduce your working example with x = 0x0014a976
.
- Inputs:
x = 0x0014a976
→ bytes[76, a9, 14, 00]
.y = 0x069f34c5
→ bytes[c5, 34, 9f, 06]
.- Source:
[76, a9, 14, 00, c5, 34, 9f, 06]
.
- Desired Result:
[c5, 14, a9, 76]
(prints76 a9 14 c5
). - Selector:
s = 0b0000'0001'0010'0100 = 0x0124
.- Bits 2:0:
100
= 4 → source[4] =c5
. - Bits 6:4:
010
= 2 → source[2] =14
. - Bits 10:8:
001
= 1 → source[1] =a9
. - Bits 14:12:
000
= 0 → source[0] =76
.
- Bits 2:0:
- Code:
unsigned int result = __byte_perm(0x0014a976, 0x069f34c5, 0x0124); // result = 0x76a914c5, prints 76 a9 14 c5
Goal: Understand why s = 0b011'001'000'111
failed.
- Inputs: Same as Example 3.
- Selector:
s = 0b011'001'000'111 = 0x3107
.- Bits 2:0:
111
= 7 → source[7] =06
. - Bits 6:4:
000
= 0 → source[0] =14
. - Bits 10:8:
001
= 1 → source[1] =a9
. - Bits 14:12:
011
= 3 → source[3] =76
.
- Bits 2:0:
- Result:
[06, 14, a9, 76]
(hex0x76a91406
, prints76 a9 14 06
). - Issue: You reported
14 9f c5 06
, suggesting a possible mismatch inh[0]
(e.g.,h[0] = 0x14c59f06
) or a printing error. The selector picked byte 7 (06
) instead of byte 4 (c5
).
In your code, __byte_perm
is used to construct a Bitcoin P2SH-P2WPKH script, where scriptBytes[0]
needs to start with specific opcodes (76 a9 14
) followed by a hash byte (c5
). The working example:
scriptBytes[0] = __byte_perm(h[0], 0x14, 0x5401);
Produces 00 14 c5 34
, showing how __byte_perm
interleaves bytes from h[0]
and a constant. For your case, adjusting the selector to 0x3104
with x = 0x7600a914
aligns the bytes correctly.
- Incorrect Selector Fields:
- Problem: Using wrong indices (e.g.,
0x3107
selecting byte 7 instead of 4). - Solution: Double-check the source byte indices and align with desired output positions.
- Problem: Using wrong indices (e.g.,
- Byte Order Confusion:
- Problem: Misinterpreting
x
ory
byte positions (e.g., assuming big-endian). - Solution: Always list bytes as LSB to MSB for
x
andy
.
- Problem: Misinterpreting
- Printing Logic:
- Problem: Printing in big-endian order can confuse memory order expectations.
- Solution: Verify result in memory (little-endian) matches print order (big-endian).
- Hash Byte Misalignment:
- Problem: Assuming
h[0]
bytes are in a different order. - Solution: Confirm
h[0]
’s byte order (e.g.,0x069f34c5
→[c5, 34, 9f, 06]
).
- Problem: Assuming
- Use Hex for Selectors: Hex values (e.g.,
0x3104
) are less error-prone than binary literals. - Verify Byte Order: Always confirm whether the desired output refers to memory layout or print order. For big-endian printing, reverse the byte order in memory.
- Test with Known Values: Use simple inputs to validate selector logic.
- Document Selectors: Comment the intended byte selections (e.g.,
// selects [76, a9, 14, c5]
). - Test Selectors Incrementally: Use temporary print statements to verify each
__byte_perm
output. - Account for Endianness: When working with little-endian
uint32_t
inputs and big-endian printing, calculate selectors based on the memory layout needed for the print order.
The __byte_perm
function is a versatile tool for byte manipulation in CUDA, but its selector mechanism requires careful construction. By understanding the 8-byte source and 3-bit field selectors, you can achieve precise byte arrangements. The examples provided, especially tailored to your Bitcoin script use case, should help you avoid common errors and use __byte_perm
effectively in your CUDA programs.