Skip to content

Instantly share code, notes, and snippets.

@awni
Last active May 27, 2025 14:42
Show Gist options
  • Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Save awni/ec071fd27940698edd14a4191855bba6 to your computer and use it in GitHub Desktop.
Run DeepSeek R1 or V3 with MLX Distributed

Setup

On every machine in the cluster install openmpi and mlx-lm:

conda install conda-forge::openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:

[
  {"ssh": "hostname1"},
  {"ssh": "hostname2"}
]

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mlx.launch \
  --hostfile path/to/hosts.json \
  --backend mpi \
  path/to/pipeline_generate.py \ 
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 128 \
  --model mlx-community/DeepSeek-R1-4bit

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

@jiyzhang
Copy link

404 for the url
https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

The run script can be downloaded at
https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/examples/pipeline_generate.py

@Basten7
Copy link

Basten7 commented Apr 11, 2025

Good News

pipeline_generate.py work very well with other DeepSeek model "DeepSeek-V2.5-1210-3bit "

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-3bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 85.378 tokens-per-sec
Generation: 776 tokens, 17.794 tokens-per-sec
Peak memory: 55.234 GB

mlx.launch --hosts mac1,mac2 --backend mpi "pipeline_generate.py" --max-tokens 12800 --model mlx-community/DeepSeek-V2.5-1210-4bit --prompt "Generate a python script"

==========
Prompt: 21 tokens, 80.473 tokens-per-sec
Generation: 901 tokens, 17.410 tokens-per-sec
Peak memory: 70.257 GB

Less good News

1°) When I run mlx_distributed_deepseek.py
error message :

except statement is broken in "distributed_run.py"

Edit around line 175. Find:
in the file "except e:"
replace with
"except Exception as e:"

2°) And when I run this command: mlx.distributed_config --verbose --hosts
error message :

/miniconda3/envs/mlxmpi/lib/python3.11/site-packages/mlx/distributed_run.py", line 507, in prepare_tb_ring
connected_to = items[0]["domain_uuid_key"]
~~~~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'domain_uuid_key'

@zengqingfu1442
Copy link

Does mlx support gguf format?

@RaylenFarnor
Copy link

RaylenFarnor commented May 5, 2025

To set up a machine cluster for MLX generation, install OpenMPI and mlx-lm on each machine using conda and pip. Download the pipeline script and place it in the same path on all machines. Create a hosts.json file listing the machines, ensuring SSH access between them. Set memory limits with sudo sysctl. If you’ve ever had to finish an essay overnight, you know the pressure. I’ve relied on UKWritings in such cases their write my essay service at https://ukwritings.com/write-my-essay is perfect for fast, high-quality academic writing that meets urgent deadlines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment