awni/mlx_distributed_deepseek.md

Last active October 6, 2025 13:50

Star (90) You must be signed in to star a gist
Fork (14) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/awni/ec071fd27940698edd14a4191855bba6.js"></script>
Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.

Download ZIP

Run DeepSeek R1 or V3 with MLX Distributed

Raw

mlx_distributed_deepseek.md

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:

[
  {"ssh": "hostname1"},
  {"ssh": "hostname2"}
]

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mlx.launch \
  --hostfile path/to/hosts.json \
  --backend mpi \
  path/to/pipeline_generate.py \ 
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 128 \
  --model mlx-community/DeepSeek-R1-4bit

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

jiyzhang commented Mar 31, 2025

404 for the url
https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

The run script can be downloaded at
https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/examples/pipeline_generate.py

Basten7 commented Apr 11, 2025 •

edited

Loading

Good News

pipeline_generate.py work very well with other DeepSeek model "DeepSeek-V2.5-1210-3bit "

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-3bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 85.378 tokens-per-sec
Generation: 776 tokens, 17.794 tokens-per-sec
Peak memory: 55.234 GB

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-4bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 80.473 tokens-per-sec
Generation: 901 tokens, 17.410 tokens-per-sec
Peak memory: 70.257 GB

Less good News

1°) When I run mlx_distributed_deepseek.py
error message :

except statement is broken in "distributed_run.py"

Edit around line 175. Find:
in the file "except e:"
replace with
"except Exception as e:"

2°) And when I run this command: mlx.distributed_config --verbose --hosts
error message :

/miniconda3/envs/mlxmpi/lib/python3.11/site-packages/mlx/distributed_run.py", line 507, in prepare_tb_ring
connected_to = items[0]["domain_uuid_key"]
~~~~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'domain_uuid_key'

zengqingfu1442 commented Apr 15, 2025

Does mlx support gguf format?

RaylenFarnor commented May 5, 2025 •

edited

Loading

To set up a machine cluster for MLX generation, install OpenMPI and mlx-lm on each machine using conda and pip. Download the pipeline script and place it in the same path on all machines. Create a hosts.json file listing the machines, ensuring SSH access between them. Set memory limits with sudo sysctl. If you’ve ever had to finish an essay overnight, you know the pressure. I’ve relied on UKWritings in such cases their write my essay service at https://ukwritings.com/write-my-essay is perfect for fast, high-quality academic writing that meets urgent deadlines.

georgiedekker commented Aug 8, 2025

Also works for DeepSeek v2. But not for other models.

would it be possible, to make a smaller model have this option of distributed, just to be able to get it to work on less expensive hardware haha. I have a few base model m4 mac minis and before i fork out for the m3 ultras, I'd love to see it actually worked. tried many ways to achieve it with grpc, tensor parallelism, etc. most of the time things got corrupted in the kv cache synchronization.

Author

awni commented Aug 8, 2025

You can try one of the smaller DSV2 models like mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit

georgiedekker commented Aug 9, 2025

You can try one of the smaller DSV2 models like mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit

Thank you so much Awni, I was trying to convince Claude Code to have it work with Qwen models but it just would not get it to work properly. Then I guess it had too much memory of that approach to accept your guidance. Finally got it to work and I asked Claude to do a beginner friendly writeup, please review/comment as i think this would be a nice example that many people would be able to run.
https://github.com/georgiedekker/mlx_distributed_ring_inference

georgiedekker commented Aug 15, 2025

@awni made another attempt at speeding things up. seems to work please review, feel free to use any as example in the mlx repo. https://github.com/georgiedekker/mlx_distributed_ring_inference_v2

maxims-eject commented Sep 14, 2025

Hi @awni,

I have been using the pipeline generate with mpi distributed (ring, two/three nodes, TB network) since may but starting from version "mlx-lm==0.27.0", I am seeing this crash (gpu timeout)

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)

This is a very deterministic issue on my system in multiple scenarios (servers mpi, single text generation...), basically all the pipeline distributed generation seems to stop working from mlx-lm v0.27.0, while I have still no issue if I remain to previous versions, e.g. python -m pip install -U "mlx-lm==0.26.4". The nodes I am running are M4, M2 Pro and a mix of them.

For debugging, I was able to reproduce the issue using the pipeline_generate.py from the example directory and the following command - note I am using the smaller model due to hw ram sizes, but this used to be supported by pipeline_generate as part of the deepseek_v2 family:

clear && mlx.launch \
    --hostfile hosts_2.json \
    --backend ring \
    pipeline_generate.py \
    --prompt "What number is larger 6.9 or 6.11?" \
    --max-tokens 512 \
    --model mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit

Below the two logs produced on my conda env with the old version of mlx-lm (up to 0.26.4) and the new versions (either 0.27.0 or 0.27.1):

python -m pip install -U "mlx-lm==0.26.4":

python -m pip install -U "mlx-lm==0.27.0":

Author

awni commented Sep 18, 2025

Fix is incoming here: ml-explore/mlx-lm#483

awni/mlx_distributed_deepseek.md

Setup

Run

jiyzhang commented Mar 31, 2025

Uh oh!

Basten7 commented Apr 11, 2025 •

edited

Loading

Uh oh!

zengqingfu1442 commented Apr 15, 2025

Uh oh!

RaylenFarnor commented May 5, 2025 •

edited

Loading

Uh oh!

georgiedekker commented Aug 8, 2025

Uh oh!

awni commented Aug 8, 2025

Uh oh!

georgiedekker commented Aug 9, 2025

Uh oh!

georgiedekker commented Aug 15, 2025

Uh oh!

maxims-eject commented Sep 14, 2025

Uh oh!

awni commented Sep 18, 2025

Uh oh!

awni/mlx_distributed_deepseek.md

Setup

Run

jiyzhang commented Mar 31, 2025

Uh oh!

Basten7 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zengqingfu1442 commented Apr 15, 2025

Uh oh!

RaylenFarnor commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgiedekker commented Aug 8, 2025

Uh oh!

awni commented Aug 8, 2025

Uh oh!

georgiedekker commented Aug 9, 2025

Uh oh!

georgiedekker commented Aug 15, 2025

Uh oh!

maxims-eject commented Sep 14, 2025

Uh oh!

awni commented Sep 18, 2025

Uh oh!

Basten7 commented Apr 11, 2025 •

edited

Loading

RaylenFarnor commented May 5, 2025 •

edited

Loading