Skip to content

Instantly share code, notes, and snippets.

@awni
Last active October 6, 2025 13:50
Show Gist options
  • Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Run DeepSeek R1 or V3 with MLX Distributed

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:

[
  {"ssh": "hostname1"},
  {"ssh": "hostname2"}
]

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mlx.launch \
  --hostfile path/to/hosts.json \
  --backend mpi \
  path/to/pipeline_generate.py \ 
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 128 \
  --model mlx-community/DeepSeek-R1-4bit

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

@georgiedekker
Copy link

You can try one of the smaller DSV2 models like mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit

Thank you so much Awni, I was trying to convince Claude Code to have it work with Qwen models but it just would not get it to work properly. Then I guess it had too much memory of that approach to accept your guidance. Finally got it to work and I asked Claude to do a beginner friendly writeup, please review/comment as i think this would be a nice example that many people would be able to run.
https://github.com/georgiedekker/mlx_distributed_ring_inference

@georgiedekker
Copy link

@awni made another attempt at speeding things up. seems to work please review, feel free to use any as example in the mlx repo. https://github.com/georgiedekker/mlx_distributed_ring_inference_v2

@maxims-eject
Copy link

Hi @awni,

I have been using the pipeline generate with mpi distributed (ring, two/three nodes, TB network) since may but starting from version "mlx-lm==0.27.0", I am seeing this crash (gpu timeout)

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)

This is a very deterministic issue on my system in multiple scenarios (servers mpi, single text generation...), basically all the pipeline distributed generation seems to stop working from mlx-lm v0.27.0, while I have still no issue if I remain to previous versions, e.g. python -m pip install -U "mlx-lm==0.26.4". The nodes I am running are M4, M2 Pro and a mix of them.

For debugging, I was able to reproduce the issue using the pipeline_generate.py from the example directory and the following command - note I am using the smaller model due to hw ram sizes, but this used to be supported by pipeline_generate as part of the deepseek_v2 family:

clear && mlx.launch \
    --hostfile hosts_2.json \
    --backend ring \
    pipeline_generate.py \
    --prompt "What number is larger 6.9 or 6.11?" \
    --max-tokens 512 \
    --model mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit

Below the two logs produced on my conda env with the old version of mlx-lm (up to 0.26.4) and the new versions (either 0.27.0 or 0.27.1):

python -m pip install -U "mlx-lm==0.26.4":
Screenshot 2025-09-14 at 12 30 45 PM

python -m pip install -U "mlx-lm==0.27.0":
Screenshot 2025-09-14 at 12 39 46 PM

@awni
Copy link
Author

awni commented Sep 18, 2025

Fix is incoming here: ml-explore/mlx-lm#483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment