Skip to content

Instantly share code, notes, and snippets.

@awni
Last active October 6, 2025 13:50
Show Gist options
  • Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Run DeepSeek R1 or V3 with MLX Distributed

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:

[
  {"ssh": "hostname1"},
  {"ssh": "hostname2"}
]

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mlx.launch \
  --hostfile path/to/hosts.json \
  --backend mpi \
  path/to/pipeline_generate.py \ 
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 128 \
  --model mlx-community/DeepSeek-R1-4bit

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

@awni
Copy link
Author

awni commented Sep 18, 2025

Fix is incoming here: ml-explore/mlx-lm#483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment