layout | title | author |
---|---|---|
post |
vLLM Now Supports AMD GPUs |
vLLM team and EmbeddedLLM |
TL;DR:
- With helps from EmbeddedLLM team, vLLM can now run on top of ROCm enabled AMD GPUs.
- On previous generation GPU (MI250), it achieves 87% of the throughput of A100 80GB for Llama-2 70B model.
- On latest generation GPU (MI300X), it outperforms H100 by 40% for llama 70B model and 60% for Bloom 176B model.
- vLLM is committed to support AMD GPUs and is actively working with AMD to bring the best performance to the community.
We are thrilled to announce that AMD GPU support is now integrated into vLLM, an open-source inference engine for efficient LLM serving. The initial integration is the result of a collaboration between the vLLM team and EmbeddedLLM, a specialist in the AMD ROCm platform, with support from AMD. With this integration, vLLM's broad model support and its ever-expanding suite of inference optimizations will be accessible to an even broader audience running on AMD GPUs.
The initial supports landed in v0.2.4. It adds support for ROCm 5.0 and has been tested on the AMD Instinct MI250 GPU. An MI250 GPU achieved 87% of the throughput performance of the NVIDIA A100 80GB GPU when serving llama 70B model. Specificially, when running the Llama-2 70B model on a dual MI250 GPUs (each have two Graphics Compute Dies) setup for ShareGPT workload,we recorded a throughput of 1.87
requests per second, translating to 895.26
tokens per second. In comparison, the baseline with two A100 (80GB) GPUs reached a throughput of 2.15
requests per second, or 1029.45
tokens per second, under the same conditions.
TODO: Add bar plot of the performance number.
It is difficult to compare apple to apple given the two GPUs have different memory size and architecture. We noted the following metrics:
- When comparing the throughput, we used ShareGPT workload with Llama-2 70b float16. We ran AMD MI250x with 2 MI250x with tensor parallel size of 4 and 2 A100s with tensor parallel size of 2. The throughput of MI250x is 87% of A100 on this particular workload.
- A single MI250x has 128GB HBM2 memory, while A100 has 80GB HBM2 memory. A single MI250x has 2 Graphics Compute Dies, which is equivalent to 2 "GPUs" on the same board. Under this setting, we also compared the throughput setting of 2 MI250x with tensor parallel size of 4 but duplicated the serving process and 4 A100s with tensor parallel size of 4. The throughput of MI250x is
4.03
requests per second, translating to1928.22
tokens per second. In comparison, the baseline with 4 A100 (80GB) GPUs reached a throughput of2.9
3 requests per second, or1399.77
tokens per second as of vLLM v0.2.4, or4.34
requests per second, or2074.42
tokens per second as of vLLM v0.3.0. The throughput of MI250x is 92.9% of A100 on this particular workload. - Notably, the ROCm support is lacking the equivalent of CUDA Graph, which is a feature that vLLM relies on for performance optimization. We are actively working with AMD to bridge this gap.
The integration of vLLM with ROCm has demonstrated that it's possible to maintain a single codebase for different GPU platforms. This is largely thanks to PyTorch's compatibility with ROCm, which automatically translates operations to ROCm equivalents, allowing the same code to run on both AMD and Nvidia GPUs.
One of the challenges faced during the porting process was the xFormers reliance on NVIDIA-specific libraries like CUTLASS, which are used by Flash Attention. To overcome this, We utilized AMD's fork of Flash Attention, which is built on the composable kernel library. This ROCm fork of Flash Attention supports advanced features like multi-query and grouped-query attention and was integrated into vLLM without issues.
For the PyTorch custom operations (Custom Ops) within vLLM, a conditional compilation mechanism was implemented to choose between CUDA and ROCm at runtime, depending on the available hardware. This mechanism is necessary because there are certain CUDA API functions that are not supported on ROCm. Additionally, it allows for the selection of a better optimized code for AMD platforms. Furthermore, any code that was written in assembly specifically for NVIDIA GPUs has been ported to AMD as well. For example:
// Vector fused multiply-add.
inline __device__ uint32_t fma(uint32_t a, uint32_t b, uint32_t c) {
uint32_t d;
#ifndef USE_ROCM
asm volatile("fma.rn.f16x2 %0, %1, %2, %3;\n" : "=r"(d) : "r"(a), "r"(b), "r"(c));
#else
asm volatile("v_pk_fma_f16 %0, %1, %2, %3;\n" : "=v"(d) : "v"(a), "v"(b), "v"(c));
#endif
return d;
}
Thanks to PyTorch's auto hipify tool, which acts as a CUDA to HIP transpiler, code users and maintainers can maintain a single CUDA codebase. This tool played a pivotal role in creating a unified codebase, converting the majority of CUDA-specific code into its HIP equivalent, thereby enabling the PyTorch Custom Ops to function on both CUDA and ROCm platforms.
AMD is deeply committed to advancing state-of-the-art open AI technologies. We look forward to continuing our engagement with the vLLM community and contributing to its growth and development, including the advanced ROCm features for MI300x GPUs, such as hipGraph, Triton Flash Attention and FP8 support. We are delighted to bring exceptional AI features and performance to our customers through deep collaboration with AI innovators like Ion Stoica and the open AI SW community. - Peng Sun, Ph.D. Director of Software Development. AIG (AI Group) in AMD
With the anticipated full support for AMD GPUs in Triton 3.0, the future looks promising for the development of more Triton language kernels, which will be directly compatible with a variety of GPUs. This will further enhance the capabilities of vLLM.
The AMD Instinct MI300X has gained significant attention since its unveiling at AMD's Advancing AI Event. It is an impressive device, featuring 192GB HBM3 and a remarkable memory bandwidth of 5.3 TB/s. In a direct comparison with a setup of eight NVIDIA H100 GPUs, the eight MI300X GPUs outperform by delivering a 40% increase in performance on the Llama 2 70B benchmark and an impressive 60% surge in performance on the Bloom 176B benchmark.
In subsequent benchmarks, the MI300X, utilizing vLLM, surpasses the Nvidia H100 optimized with TensorRT-LLM, showing a latency improvement of 1.3 times. This performance superiority is retained by the MI300X, even when using FP16, as compared to the H100's FP8 optimized with TensorRT-LLM, with the MI300X still holding a 5% performance edge.
TODO: Add AMD keynote figure
Starting with vLLM 0.2.4, we support model inference and serving on AMD GPUs through ROCm. The current ROCm version supports FP16 and BF16 data types, SqueezeLLM quantization, W4A16 GPTQ quantization, and CUDA/HIP Graphs. Additionally, we have verified that new models such as Phi-2 and Mixtral are functioning correctly.
To get started with vLLM on AMD, we recommend using our pre-installed Docker image, which simplifies the setup process. For those who prefer a more hands-on approach, building from source is also an option, with or without Docker. Please refer to our AMD installation guide to get started:
EmbeddedLLM is a Singapore-based company that specializes in helping clients build Large Language Model (LLM) platforms and services on the AMD ROCm platform. They offer comprehensive support, training, and consulting services to ensure seamless integration and optimization of LLM technologies for businesses seeking to harness the power of AI.
The AMD ROCm™ Platform is an open-source stack for GPU computation that allows developers to customize and tailor their GPU software while collaborating with a community of other developers. It includes drivers, development tools, and APIs enabling GPU programming from low-level kernel to end-user applications, and supports programming models such as OpenMP and OpenCL. The goal of ROCm is to help users maximize their GPU hardware investment and develop, test, and deploy GPU accelerated applications in a free, open-source, integrated, and secure software ecosystem.
Bar plot