yeahdongcn’s gists

SGLang already runs on macOS via PyTorch's MPS (Metal Performance Shaders) backend — you can launch a server, send requests, and get responses. But performance on Apple Silicon has been underwhelming. In this post, we describe how we integrated a native MLX execution path into SGLang that delivers up to 5.3× higher throughput while using significantly less memory.

The Problem: PyTorch MPS Overhead

When SGLang runs on macOS with PyTorch MPS, every operation — matrix multiplications, attention, normalization — goes through PyTorch's MPS backend, which translates PyTorch ops into Metal Performance Shaders. This translation layer adds substantial overhead:

Op dispatch overhead: Each PyTorch operation is individually dispatched to MPS, missing optimization opportunities that come from fusing operations together.
Memory duplication: PyTorch loads model weights into MPS memory and allocates a large KV cache, leaving less room for actual inference workloads.
No fused kernels: Op

Investigating performance variance in MLX inference on a MacBook Pro M1

TL;DR

We observed up to 30% variance in decode throughput across benchmark runs of the same code on the same hardware — enough to turn a "no regression" into an apparent 14% slowdown, or inflate a 48% improvement to 80%. Using apple-smi to monitor GPU temperature, power draw, and memory pressure, we identified three root causes: thermal throttling, memory pressure from unified memory, and DVFS (Dynamic Voltage and Frequency Scaling). This post documents the methodology and provides guidelines for producing reliable benchmarks on Apple Silicon.

The Problem

While developing a hybrid KV cache for SGLang's MLX backend on Apple Silicon, we encountered a frustrating situation: the same benchmark, run minutes apart, produced wildly different results.

R0CKSTAR yeahdongcn

The Problem: PyTorch MPS Overhead

TL;DR

The Problem