On every machine in the cluster install openmpi and mlx-lm:
conda install conda-forge::openmpi
pip install -U mlx-lmNext download the pipeline parallel run script. Download it to the same path on every machine:
curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.pyMake a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:
[
{"ssh": "hostname1"},
{"ssh": "hostname2"}
]
Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.
Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:
sudo sysctl iogpu.wired_limit_mb=180000Run the generation with a command like the following:
mlx.launch \
--hostfile path/to/hosts.json \
--backend mpi \
path/to/pipeline_generate.py \
--prompt "What number is larger 6.9 or 6.11?" \
--max-tokens 128 \
--model mlx-community/DeepSeek-R1-4bit
For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.


Is there a limit of setting the available ram for GPU? Just wondering for the coming 512gb mac studio how much I can squeeze out for GPU alone, I assume that if I can leave only something like 16gb for os on 2 machines and get 496x2 vram for deepseek r1 I can run the full version with fp16 on core attention and fp8 on the rest of the params?
Also can mlx utilize multiple tb5 connect bandwidth? Since the mac studio comes with multiple tb5 port it would be nice if we can use all of them.