Skip to content

Instantly share code, notes, and snippets.

@kenorb
Last active June 1, 2025 20:07
Show Gist options
  • Save kenorb/93ec021081151c173286b0cb773151f8 to your computer and use it in GitHub Desktop.
Save kenorb/93ec021081151c173286b0cb773151f8 to your computer and use it in GitHub Desktop.
Comprehensive Learning Guide on CUDA's `__byte_perm` function

Comprehensive Learning Guide on CUDA's __byte_perm Function

Introduction

The __byte_perm function in CUDA is a powerful intrinsic designed for byte-level manipulation of 32-bit integers, commonly used in GPU-accelerated applications like cryptographic operations, data formatting, and Bitcoin script construction. It allows developers to select and rearrange bytes from two 32-bit inputs, x and y, into a new 32-bit output using a selector s. This guide provides a detailed explanation of how __byte_perm works, with a focus on constructing selector values, and includes practical examples to clarify its usage and address common pitfalls, particularly in the context of your Bitcoin P2SH script construction.

Function Signature

__device__ unsigned int __byte_perm(unsigned int x, unsigned int y, unsigned int s);
  • Parameters:
    • x: A 32-bit unsigned integer providing the first four bytes of the source.
    • y: A 32-bit unsigned integer providing the next four bytes of the source.
    • s: A 32-bit selector that determines which bytes are chosen for the output.
  • Return: A 32-bit unsigned integer composed of four bytes selected from the 8-byte source based on s.

How __byte_perm Works

The __byte_perm function combines x and y into an 8-byte source, from which four bytes are selected to form the result. The selector s specifies which bytes to pick and their order in the output. Here’s a step-by-step breakdown:

Source Byte Formation

The 8-byte source is formed by concatenating x and y:

  • Bytes 0–3: From x, where byte 0 is the least significant byte (LSB, bits 7:0) and byte 3 is the most significant byte (MSB, bits 31:24).
  • Bytes 4–7: From y, where byte 4 is the LSB (bits 7:0) and byte 7 is the MSB (bits 31:24).

For example, if x = 0x12345678 and y = 0x9ABCDEF0, the source bytes are:

Index Byte Source
0 0x78 x[7:0]
1 0x56 x[15:8]
2 0x34 x[23:16]
3 0x12 x[31:24]
4 0xF0 y[7:0]
5 0xDE y[15:8]
6 0xBC y[23:16]
7 0x9A y[31:24]

Selector Mechanics

The selector s is a 32-bit integer, but only the lower 16 bits are used, divided into four 3-bit fields:

  • Bits 2:0: Select the byte for result[7:0] (LSB).
  • Bits 6:4: Select the byte for result[15:8].
  • Bits 10:8: Select the byte for result[23:16].
  • Bits 14:12: Select the byte for result[31:24] (MSB).

Each 3-bit field (values 0–7) specifies an index into the 8-byte source. The result is constructed as:

  • result[7:0] = source[(s >> 0) & 0x7]
  • result[15:8] = source[(s >> 4) & 0x7]
  • result[23:16] = source[(s >> 8) & 0x7]
  • result[31:24] = source[(s >> 12) & 0x7]

Binary Notation for Selectors

In your code, selectors are often written in binary with 3-bit groups separated by apostrophes (e.g., 0b011'001'000'100). This represents a 12-bit or 16-bit value, where each group corresponds to a field:

  • Leftmost group: Bits 14:12 (for result[31:24]).
  • Next group: Bits 10:8 (for result[23:16]).
  • Next group: Bits 6:4 (for result[15:8]).
  • Rightmost group: Bits 2:0 (for result[7:0]).

For example, s = 0b011'001'000'100 translates to 0x3104 in hex, with fields 3, 1, 0, 4.

Output and Printing

The result is a 32-bit integer stored in little-endian memory (e.g., [b0, b1, b2, b3]). When printed with your logic:

printf("%02x %02x %02x %02x ", (word >> 24) & 0xFF, (word >> 16) & 0xFF, (word >> 8) & 0xFF, word & 0xFF);

It displays bytes in big-endian order: MSB (bits 31:24) first, LSB (bits 7:0) last. For result = 0x76a914c5, it prints 76 a9 14 c5.

Constructing the Selector

To achieve a specific output, follow these steps:

  1. Identify Desired Bytes: Determine which bytes from the source are needed for each position in the result.
  2. Assign Indices: Map each desired byte to its source index (0–7).
  3. Build Selector: Place each index in the appropriate 3-bit field of s:
    • s = (index3 << 12) | (index2 << 8) | (index1 << 4) | index0
    • index0 for result[7:0], index1 for result[15:8], index2 for result[23:16], index3 for result[31:24].
  4. Write in Binary: Optionally, express s as a binary literal with 3-bit groups (e.g., 0b011'001'000'100).

Examples

Below are examples to illustrate __byte_perm usage, including cases from your code and additional scenarios to highlight common patterns and pitfalls.

Example 1: Selecting All Bytes from x

Goal: Output all bytes of x = 0x12345678.

  • Inputs: x = 0x12345678 (bytes [78, 56, 34, 12]), y = 0x9ABCDEF0 (bytes [F0, DE, BC, 9A]).
  • Source: [78, 56, 34, 12, F0, DE, BC, 9A].
  • Desired Result: [78, 56, 34, 12] (hex 0x12345678).
  • Selectors: Choose bytes 0, 1, 2, 3.
    • result[7:0] = source[0] = 78 → index 0.
    • result[15:8] = source[1] = 56 → index 1.
    • result[23:16] = source[2] = 34 → index 2.
    • result[31:24] = source[3] = 12 → index 3.
  • Selector: s = (3 << 12) | (2 << 8) | (1 << 4) | 0 = 0x3210.
  • Binary: 0b0011'0010'0001'0000.
  • Code:
    unsigned int result = __byte_perm(0x12345678, 0x9ABCDEF0, 0x3210);
    // result = 0x12345678, prints 12 34 56 78

Example 2: Selecting All Bytes from y

Goal: Output all bytes of y = 0x9ABCDEF0.

  • Inputs: Same as above.
  • Desired Result: [F0, DE, BC, 9A] (hex 0x9ABCDEF0).
  • Selectors: Choose bytes 4, 5, 6, 7.
  • Selector: s = (7 << 12) | (6 << 8) | (5 << 4) | 4 = 0x7654.
  • Binary: 0b0111'0110'0101'0100.
  • Code:
    unsigned int result = __byte_perm(0x12345678, 0x9ABCDEF0, 0x7654);
    // result = 0x9ABCDEF0, prints 9A BC DE F0

Example 3: Your Desired Output

Goal: Output 76 a9 14 c5 with x = 0x7600a914, y = h[0] = 0x069f34c5.

  • Inputs:
    • x = 0x7600a914 → bytes [14, a9, 00, 76].
    • y = 0x069f34c5 → bytes [c5, 34, 9f, 06].
    • Source: [14, a9, 00, 76, c5, 34, 9f, 06].
  • Desired Result: [c5, 14, a9, 76] (hex 0x76a914c5, prints 76 a9 14 c5).
  • Selectors:
    • result[7:0] = c5 = source[4] → index 4.
    • result[15:8] = 14 = source[0] → index 0.
    • result[23:16] = a9 = source[1] → index 1.
    • result[31:24] = 76 = source[3] → index 3.
  • Selector: s = (3 << 12) | (1 << 8) | (0 << 4) | 4 = 0x3104.
  • Binary: 0b0011'0001'0000'0100 or 0b011'001'000'100.
  • Code:
    unsigned int result = __byte_perm(0x7600a914, 0x069f34c5, 0x3104);
    // result = 0x76a914c5, prints 76 a9 14 c5

Example 4: Your Working Case

Goal: Reproduce your working example with x = 0x0014a976.

  • Inputs:
    • x = 0x0014a976 → bytes [76, a9, 14, 00].
    • y = 0x069f34c5 → bytes [c5, 34, 9f, 06].
    • Source: [76, a9, 14, 00, c5, 34, 9f, 06].
  • Desired Result: [c5, 14, a9, 76] (prints 76 a9 14 c5).
  • Selector: s = 0b0000'0001'0010'0100 = 0x0124.
    • Bits 2:0: 100 = 4 → source[4] = c5.
    • Bits 6:4: 010 = 2 → source[2] = 14.
    • Bits 10:8: 001 = 1 → source[1] = a9.
    • Bits 14:12: 000 = 0 → source[0] = 76.
  • Code:
    unsigned int result = __byte_perm(0x0014a976, 0x069f34c5, 0x0124);
    // result = 0x76a914c5, prints 76 a9 14 c5

Example 5: Common Pitfall (Incorrect Selector)

Goal: Understand why s = 0b011'001'000'111 failed.

  • Inputs: Same as Example 3.
  • Selector: s = 0b011'001'000'111 = 0x3107.
    • Bits 2:0: 111 = 7 → source[7] = 06.
    • Bits 6:4: 000 = 0 → source[0] = 14.
    • Bits 10:8: 001 = 1 → source[1] = a9.
    • Bits 14:12: 011 = 3 → source[3] = 76.
  • Result: [06, 14, a9, 76] (hex 0x76a91406, prints 76 a9 14 06).
  • Issue: You reported 14 9f c5 06, suggesting a possible mismatch in h[0] (e.g., h[0] = 0x14c59f06) or a printing error. The selector picked byte 7 (06) instead of byte 4 (c5).

Practical Application: Bitcoin Script Construction

In your code, __byte_perm is used to construct a Bitcoin P2SH-P2WPKH script, where scriptBytes[0] needs to start with specific opcodes (76 a9 14) followed by a hash byte (c5). The working example:

scriptBytes[0] = __byte_perm(h[0], 0x14, 0x5401);

Produces 00 14 c5 34, showing how __byte_perm interleaves bytes from h[0] and a constant. For your case, adjusting the selector to 0x3104 with x = 0x7600a914 aligns the bytes correctly.

Common Pitfalls and Solutions

  1. Incorrect Selector Fields:
    • Problem: Using wrong indices (e.g., 0x3107 selecting byte 7 instead of 4).
    • Solution: Double-check the source byte indices and align with desired output positions.
  2. Byte Order Confusion:
    • Problem: Misinterpreting x or y byte positions (e.g., assuming big-endian).
    • Solution: Always list bytes as LSB to MSB for x and y.
  3. Printing Logic:
    • Problem: Printing in big-endian order can confuse memory order expectations.
    • Solution: Verify result in memory (little-endian) matches print order (big-endian).
  4. Hash Byte Misalignment:
    • Problem: Assuming h[0] bytes are in a different order.
    • Solution: Confirm h[0]’s byte order (e.g., 0x069f34c5[c5, 34, 9f, 06]).

Best Practices

  • Use Hex for Selectors: Hex values (e.g., 0x3104) are less error-prone than binary literals.
  • Verify Byte Order: Always confirm whether the desired output refers to memory layout or print order. For big-endian printing, reverse the byte order in memory.
  • Test with Known Values: Use simple inputs to validate selector logic.
  • Document Selectors: Comment the intended byte selections (e.g., // selects [76, a9, 14, c5]).
  • Test Selectors Incrementally: Use temporary print statements to verify each __byte_perm output.
  • Account for Endianness: When working with little-endian uint32_t inputs and big-endian printing, calculate selectors based on the memory layout needed for the print order.

Conclusion

The __byte_perm function is a versatile tool for byte manipulation in CUDA, but its selector mechanism requires careful construction. By understanding the 8-byte source and 3-bit field selectors, you can achieve precise byte arrangements. The examples provided, especially tailored to your Bitcoin script use case, should help you avoid common errors and use __byte_perm effectively in your CUDA programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment