First impressions after a night of perf & conformance testing

BCM2712 brief

Setting the expectations

BCM2712 is a 16nm part. That dictates its Power-Performance-Area (PPA) parameters & capabilities, but also pricing and availability
CPU complex: 4x cortex-a76 cores. Base factory clock 1.5GHz, peek factory clock 2.4GHz. Those clocks are dictated by litho process and form factor, ie. dissipation budget, and in both those categories rpi5 is a budget product. When passively cooled (via a heatsink) BCM2712 can run at full load for a dozen of minutes before throttling down. Rpi5 stock active cooler (not tested) is reported to provide all-around unthrottled performance.
CPU L3/LLC: 2MB. That's not much, but it's ok. Advanced SoCs have 4MB, cutting edge SoCs like apple M-series have 8-48MB. The important thing is BCM2712 has L3, and it shows.
CPU uarch notes: CA76 is the uarch of Graviton2 and Ampere Altra. A good deal of the arm-based cloud out there shares its uarch with BCM2712. That means compilers can target this uarch adequately. Case in point: clang-13 codegen for CA76 is nice.
GPU complex: VideoCore VII, as exposed by v3d driver stack, is not the most powerful mobile GPU out there. Its performance is in the ballpark of budget PowerVR GX6250 found in MT8173. But v3d has a conformance advantage -- solid GLES 3.1 and VK 1.2 out of the DRI/Mesa open-source box. As a result, one can visit shadertoy.com on the rpi5 and expect to see practically all workloads rendering properly, albeit slow, without having to turn a single knob, outside of installing the stock packages.
RAM: 32x LPDDR4X 4267MT/s translates to a theoretical 17GB/s. In practice one can expect between 10-12GB/s of RAM bandwidth from a single CPU core. RAM availabiliy of 8GB is good to have -- many parallel workloads aim for the 2GB/core rule of thumb; 1GB/core still usable. Less than that is for niche tasks.

Figures of interest

Cache hierarchy performance: sequential write 76GB/s to L1D, 45GB/s to L2D (private per core), via Data Cache Zero by Virtual Addess (dc zva), approx. one L1D cacheline cleared every two CPU cycles.
CA76 (armv8.2 ISA) asimd FMA asymptotic performance on dense matrices, per core: 2x simd FMA x4 lanes = 2x 4x 2x ops = 16 flops/clock/core; measured performance on compiled c++ code: 15.33 flops/clock/core

blu/rpi5_impressions.md

First impressions after a night of perf & conformance testing

Setting the expectations

Figures of interest

blu commented Feb 15, 2024

Uh oh!