Skip to content

Instantly share code, notes, and snippets.

@mbohun
Last active September 18, 2025 18:20
Show Gist options
  • Save mbohun/f175fd891f284141fd5a52ec9dccbec7 to your computer and use it in GitHub Desktop.
Save mbohun/f175fd891f284141fd5a52ec9dccbec7 to your computer and use it in GitHub Desktop.
aarch64 big.LITTLE cpu-pinning

big.LITTLE cpu-pinning

ROCKPro64

  • ROCKpro64 by PINE64
    • Rockchip RK3399
  • big.LITTLE architecture:
    • Dual Cortex-A72 0xd08
    • Quad Cortex-A53 0xd03

NOTE: "CPU part" identifies the A53 and A72 CPUs respectvely.

mbohun@rockpro64a:~$ cat /proc/cpuinfo | grep -E "processor|model name|CPU part"
processor       : 0
CPU part        : 0xd03
processor       : 1
CPU part        : 0xd03
processor       : 2
CPU part        : 0xd03
processor       : 3
CPU part        : 0xd03
processor       : 4
CPU part        : 0xd08
processor       : 5
CPU part        : 0xd08
mbohun@rockpro64a:~$

Quad Cortex-A53

mbohun@rockpro64a:~$ cpupower frequency-info
analyzing CPU 1:
driver: cpufreq-dt
CPUs which run at the same hardware frequency: 0 1 2 3
CPUs which need to have their frequency coordinated by software: 0 1 2 3
  maximum transition latency: 40.0 us
  hardware limits: 408 MHz - 1.42 GHz
  available frequency steps:  408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz
  available cpufreq governors: performance schedutil
  current policy: frequency should be within 408 MHz and 1.42 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 1.01 GHz (asserted by call to kernel)
mbohun@rockpro64a:~$

Dual Cortex-A72

mbohun@rockpro64a:~$ cpupower --cpu 4 frequency-info
analyzing CPU 4:
driver: cpufreq-dt
  CPUs which run at the same hardware frequency: 4 5
  CPUs which need to have their frequency coordinated by software: 4 5
  maximum transition latency: 465 us
  hardware limits: 408 MHz - 1.80 GHz
  available frequency steps:  408 MHz, 600 MHz, 816 MHz, 1.01 GHz, 1.20 GHz, 1.42 GHz, 1.61 GHz, 1.80 GHz
  available cpufreq governors: performance schedutil
  current policy: frequency should be within 408 MHz and 1.80 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 816 MHz (asserted by call to kernel)
mbohun@rockpro64a:~$

Orange Pi 5 Pro

  • Orange Pi 5 Pro by Orange Pi
    • Rockchip RK3588S
  • big.LITTLE architecture:
    • quad-core A55 TODO: hex code
    • quad-core A76 TODO: hex code

REFERENCES:

@mbohun
Copy link
Author

mbohun commented Sep 17, 2025

2. Use SIMD (NEON) Intrinsics:
The A72 has a very powerful NEON SIMD unit. If your code involves heavy number crunching (image processing, linear algebra, audio/video encoding), you can get a massive speedup by using NEON intrinsics to process multiple data points with a single instruction.

Example of a simple NEON intrinsic adding four floats at once:

#include <arm_neon.h> // The key header for NEON

void add_arrays(float* a, float* b, float* result, int n) {
    for (int i = 0; i < n; i += 4) {
        // Load 4 floats from arrays a and b
        float32x4_t va = vld1q_f32(a + i);
        float32x4_t vb = vld1q_f32(b + i);
        // Add them together
        float32x4_t vresult = vaddq_f32(va, vb);
        // Store the result
        vst1q_f32(result + i, vresult);
    }
}

The compiler's auto-vectorization with -O3 -mcpu=cortex-a72 is very good, but for maximum control and performance, hand-tuning with intrinsics is the way to go.

3. Cache Awareness:
The A72 has a larger and more sophisticated cache system (L1, L2) compared to the A53. Write cache-friendly code:

  • Use smaller data types.
  • Access memory in sequential, predictable patterns (avoid random jumping).
  • Structure your data to fit in cache lines (e.g., use Arrays of Structures (AoS) or Structures of Arrays (SoA) appropriately).

Summary and Recommended Workflow

  1. Profile: First, identify the hot spots in your code. Don't optimize blindly.
  2. Compile: Use the correct compiler flags (-O3 -mcpu=cortex-a72). This often gives the biggest gain for the least effort.
  3. Pin: Use taskset or sched_setaffinity to force your optimized application to run only on the A72 cores. This prevents the OS scheduler from migrating it to a slower A53 core.
  4. Optimize Further: For critical loops, consider using NEON intrinsics to leverage SIMD parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment