During the 13th of January 2023 HPC Huddle (now hosted by hpc.social) the topic of #HPC development and workloads on Apple silicon came up briefly.
Thinking on it, once #Asahi Linux has GPU compute support squared away; I can see a world where devices like Mac Studio with M1 ultra are augmented by Thunderbolt4 networking cards. Even if it is for PR, vendors like Oracle amongst others have demonstrated a willingness to build weird and wonderful clusters as a “because we can.” It is far from Ideal, but we have done worse to get less. Beyond Oracle and the Pi cluster, the US DOD/Air Force ran a PS3 cluster for years. https://phys.org/news/2010-12-air-playstation-3s-supercomputer.html
A few baselines before I go on:
- We are going to go through this thought experiment from the point of view of a small laboratory/bootstrap cluster that can only use a single 48U, 42” deep rack.
- The current state of Metal is such that interop with other lang/frameworks is not happening without major work, so MacOS is out, Linux is in
- I assume GPU compute on M1 is working while also being able to use the Neural Processor.
- For networking, link aggregation is necessary. Using all 6 Thunderbolt 4 controllers adapted to 50 Gb/s SFP56 gets us to “acceptable” throughput per machine. Because of limitations around ThudnerBolt4, each NIC is limited to 32Gbps.
- TB4 is working on the Mac, and the chosen NIC is compatible.
- You use the built in 10G baseT ethernet as a BMC of sorts.
- M1 has both Ecores (using the Ice Storm architecture) and Pcores (Using Firestorm). The Ecores are basically unusable without messing with the scheduler. For the sake of simplifying the setup, we will be thinking of them as auxiliary cores, the same way Fugaku A64FX has a secondary cluster of CPU cores meant to deal with IO, scheduling, etc.
If setup horizontally, you can fit 2 Mac Studios in 3U, we can fit the macs together every 8 inches, for a total of 10 Macs per 3U. Better is mounting them vertically, allowing for 4 Macs per row in 5U. Push it all the way to the back on we are at 20 Macs per 5U.
Since we need access to power and the thunderbolt ports on the front and the back, let us build in 1U below for the 2 front thunderbolt ports, and 3U above for power and 4 thunderbolt ports.
We have all these network interfaces, choosing a 1U, 40 port QSFP56 switch seems right. Each port breaks out to 4*SFP56.
Unfortunately, the market has not seen fit to develop SFP56 to thunderbolt 4 adapters (I wonder why 😉). The best method of dealing with this is a PCIe NIC into a thunderbolt4 to PCIe adapter. It is very expensive and involves disassembling 360 thunderbolt4 housings. Hope you have strong hands and a charged screwdriver. These, once stripped, will fit into the extra 4U we allowed for in the Mac Studio enclosure.
Outside of power, we now have all the components needed, and fit in 10U. We will call it the “Apple Pod”
Since we have 48U, the assumption is 3 “Apple Pods”, leaving 18U for power and other needs. First, we need to figure out Performance and power needs, which leads nicely to:
CFD testing (USM3D from NASA) puts it at each Mac at 180 Gflops using the CPU performance cores.
GPU power is in the ~20 FP32 Tflops per Mac, twice that in FP16.
Assuming the Neural accelerator receives support, ~40Tflops of FP16 per Mac.
128 GBs per machine shared between CPU and GPU.
For our 3 Apple Pod cluster:
1.2 Pflops of FP32
4.8 Pflops of FP16
Between 60 Macs, 360 sets of adapters and 3 switches, I calculated 24.3 KW.
To deal with that, I chose Eaton 9PX6K UPSs. Each is 5.4 KW/6 KVA and 3U, leaving us a 10% power buffer.
Finally, the remaining 3U is for PDUs to split out power to each Apple pod.
Component selection:
-
Maxed out Mac Studio Ultra’s * 60
-
MQM8700-HS2F, 40-Port 200G QSFP56 switch's * 3
-
QSFP56 to SFP56 breakout direct attach copper cables * 15
-
Sonnet Thunderbolt4 to PCIe adapters * 360
-
Mellanox Connectx6 PCIe to SFP56 network interfaces * 360
-
Custom Mac studio Rack shelf Enclosures * 3
-
Eaton 9PX6K UPS’s * 5
A few assumptions, largely in line with the market.
- 10% off Apple EDU pricing, each Mac then costing 7200.
- You can buy at near street price.
- You are not adding 60 PiKVMs
- You buy +1 of everything for hot swap.
- Total taxes are 10%
Total after taxes: $1.12M
- You can definitely get the thunderbolt/SFP56 parts for cheaper.
- I did not include some sort of large storage array, nor the cost for cables/optics to get data to and from this rack.
- The GPU portions rely on GPU compute coming to Linux
- Redundancy is not really a thing here.
- No ECC beyond what LPDDR5 already provides.
- ARM on Linux with unsupported hardware will be “fun”
At the very least, it is an interesting thought exercise.
** edit: ** followed up with a similar piece on Orin AGX: https://gist.github.com/FCLC/7d75d12e4c368c13e400fda1475da673
Here’s a spread sheet with all the juicy data: https://www.icloud.com/numbers/05fF59zXuDyHbYG2L4Fzuz8Lg#Cursed_cluster
CFD benchmarks here: http://hrtapps.com/blogs/20220427/