Finally used OpenAI's Deep Research for the first time

Building a Scalable AI Fine-Tuning Cluster with AMD Ryzen AI Max+ 395

This report outlines a model-independent framework for fine-tuning large AI models on a cluster of AMD Ryzen AI Max+ 395 nodes. The design supports a minimum of two nodes and scales to much larger deployments. We focus on optimizing fine-tuning efficiency using the XDNA 2 neural processing unit (NPU) in these chips, while keeping the setup accessible to developers of open-source AI models. Key areas include architecture and low-level optimizations, model splitting strategies, network and data throughput tuning, alternative computation models, and continuous benchmarking for improvements.

1. Architecture Analysis & Low-Level Optimization

XDNA 2 NPU vs CPU/GPU: AMD’s XDNA 2 NPU (built into Ryzen AI Max chips) is a specialized spatial dataflow engine optimized for AI workloads. It consists of a 2D array of compute tiles with a flexible interconnect and on-chip SRAM buffers (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware). Unlike general-purpose CPUs or even GPUs, the NPU’s dataflow architecture can run matrix-heavy operations (like transformer attention and feed-forward layers) with high throughput and deterministic latency. AMD reports the second-generation XDNA 2 delivers up to 50 TOPS of AI performance and is up to 35× more power-efficient than CPU execution for AI models (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware). This power efficiency is a key advantage – the NPU can sustain AI computations at a fraction of the wattage of a CPU or discrete GPU, making it ideal for long-running training tasks and background inference (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware). In fact, the flagship Ryzen AI Max+ 395 APU (16 Zen 5 cores, 40 RDNA 3+ GPU CUs, and XDNA 2 NPU) delivers a combined 126 TOPS of compute at low precision and was demonstrated to run a 70 B parameter Llama 3.1 model (quantized to 4‑bit) more than 2× faster than an NVIDIA RTX 4090 24 GB GPU – all within a 45–120 W laptop power envelope (AMD Unveils Its Fastest Edge AI Chips Yet: The Ryzen AI Max and Ryzen AI Max+ Strix Halo Families - Hackster.io). These results highlight that, for certain quantized models, the integrated NPU+GPU combo can rival or exceed high-end GPUs in throughput. It’s worth noting, however, that NPUs primarily excel in power efficiency; their raw speed-up over GPUs may be limited by memory bandwidth in large-model scenarios, so careful workload allocation is needed to exploit their strengths (AI PCs Aren't Good at AI: The CPU Beats the NPU | Hacker News).

Assembly-Level and ISA Optimizations: To maximize token processing efficiency, low-level optimizations on both the CPU and NPU are crucial. The Zen 5 cores in Ryzen AI Max chips support advanced SIMD instructions (e.g. AVX-512/VNNI) tailored for AI math (Zen5's AVX512 Teardown and More - Hacker News). These vector instructions accelerate int8 and bf16 matrix multiply-accumulate operations, which can speed up parts of the training loop (like embedding lookups or smaller attention heads) on the CPU side. Developers can leverage libraries that use these instruction sets (for example, oneDNN or PyTorch with CPU acceleration) to get efficient token handling in parallel to the NPU. On the NPU side, AMD’s XDNA architecture provides a custom VLIW-like instruction set via the Vitis AI compiler. This enables highly efficient inner loops for tensor ops, though coding at this level is complex. In practice, AMD’s Ryzen AI software stack abstracts the NPU behind an ONNX Runtime execution provider (EP) (AMD Ryzen AI). Models can be converted to ONNX and sections of the graph (e.g. matrix multiplies, convolutions) will execute on the NPU, while unsupported operations fall back to CPU or the integrated GPU. This setup ensures that each compute task runs on the most suitable hardware in real-time, albeit at the cost of some framework overhead. Over time, we expect more direct support in ML frameworks – AMD is actively improving ROCm (their GPU compute stack) and tools like Hugging Face Optimum are adding support to load models on the Ryzen NPU with minimal code changes (AMD Ryzen AI). In summary, combining CPU-side vectorization, GPU compute, and NPU offload can yield significant speed-ups. Existing GPU clustering libraries can potentially be adapted – for example, using SYCL or HIP to target the integrated GPU (RDNA cores) and collaborating with AMD’s Vitis tools to target the NPU – but this is an emerging area. The cluster framework should remain flexible, using standard interfaces (like ONNX or PyTorch extensions) to dispatch heavy tensor ops to the NPU when available, and otherwise use CPU/GPU. By carefully aligning data formats and quantization between these units (e.g. using int8 on both NPU and CPU with VNNI), we can avoid costly conversions and squeeze out extra efficiency at the assembly level.

Memory and Dataflow Considerations: The XDNA 2 NPU’s design helps mitigate memory bottlenecks that often limit speed on large models. It features terabytes-per-second of internal fabric bandwidth and extensive on-chip SRAM, operating without traditional caches (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware). This means frequently used weights or activations can be kept on-chip, and data can be multicast to many compute tiles with minimal off-chip traffic (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware). For example, if multiple attention heads need the same input embedding, the NPU can broadcast that data across tile columns instead of each tile fetching it separately. This deterministic, high-bandwidth dataflow is ideal for transformer-style models where the same weights (the layer kernels) are applied across many tokens. In contrast, a CPU or standard GPU must stream these weights from cache or VRAM repeatedly, wasting energy and time. To leverage this, the cluster software should feed the NPU with appropriately sized workloads that fit in its on-chip memory whenever possible – for instance, batching tokens so the working set of one transformer layer stays in NPU SRAM during computation. Furthermore, kernel fusion on the NPU can combine operations (such as a GEMM followed by activation) into one run, reducing extra reads/writes. This is analogous to fused CUDA kernels on GPUs that eliminate kernel launch overhead (Making Transformer inference faster on GPUs - performance - PyTorch Developer Mailing List). The principle is the same: by merging multiple operations into a single NPU program, we reduce scheduling latency and make better use of the NPU’s parallelism. In summary, maximizing NPU usage involves structuring workloads to fit its dataflow model and using its custom instructions (via compilers) to run long sequences of operations in one go. Meanwhile, the CPU (with AVX-512/VNNI) can handle anything from tokenization to lighter ML layers in parallel, ensuring all parts of the SoC contribute to the pipeline.

Adapting GPU Cluster Tools to NPU: Many existing multi-GPU training frameworks (PyTorch Distributed, Horovod, etc.) assume CUDA or ROCm devices. While AMD’s integrated GPU in the 395 APU can be used via ROCm, the NPU requires a different approach. One strategy is to treat the NPU similarly to an accelerator like an FPGA: compile model kernels for it and offload via a runtime. AMD’s use of Vitis AI (from Xilinx) suggests that models can be partitioned such that the heavy matrix multiplications run on the NPU, while the rest runs on CPU/GPU (AMD Ryzen AI). This partitioning could be extended across nodes – for example, each node’s NPU handles a shard of the model’s layers, and at synchronization points the CPUs exchange data between nodes. Using an abstraction like SYCL might eventually allow NPUs to be programmed in a unified way, since SYCL can target various backends (GPU, CPU, accelerators) (CUDA, ROCm, oneAPI? — Running Code on a GPU, Any GPU). In practice, until such support matures, building a cluster with these nodes will likely rely on a hybrid approach: using PyTorch with ROCm for the integrated GPUs (as if they were small accelerators) and calling out to AMD’s NPU runtime for specific layers. The cluster scheduler can allocate each model layer or operation to “device:GPU” or “device:NPU” as appropriate. Because this is model-independent, developers can fine-tune any PyTorch/TensorFlow model – the framework will use the GPU/NPU devices present. Over time, as AMD potentially releases ROCm drivers for their NPUs or deeper integration with frameworks, the cluster can seamlessly take advantage of those improvements. The key is that our design is heterogeneous at heart: it recognizes the strengths of CPU, GPU, and NPU, and uses each in a complementary way. With careful low-level optimization (e.g., pinning memory to large pages, aligning data for SIMD, pre-fetching weights into NPU SRAM), we can unlock the full potential of Ryzen AI Max nodes in a distributed fine-tuning setting.

2. Model Splitting & Workload Distribution

FSDP – Fully Sharded Data Parallel: For training or fine-tuning very large models (think hundreds of billions of parameters) across multiple nodes, sharding the model’s parameters is essential. Fully Sharded Data Parallel (FSDP) is a technique where each node (or each device) holds only a fraction of the model parameters, instead of a full copy (Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -). During forward and backward passes, devices exchange the necessary parameter chunks so that each can compute its portion of the network. Meta’s implementation of FSDP shards an AI model’s weights across data-parallel workers and overlaps communication with computation to hide latency (Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -) (Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -). The advantage is significant memory savings: with FSDP (especially in its ZeRO-3 form), redundant memory usage is eliminated, allowing, for example, a 450 B parameter model to be divided across, say, 8 or 16 GPUs/NPUs without any single device needing to load all 450 B weights. This method was designed to scale to models that can’t fit on one GPU by partitioning weights, gradients, and optimizer states across the cluster (Zero Redundancy Optimizer - DeepSpeed). In our AMD cluster context, FSDP can be used to split the giant model across the NPUs/GPUs of multiple 395 nodes. Each node’s NPU might store only 1/N of the model’s layers or parameters, and they synchronize via all-reduce or scatter-gather operations. The “fully sharded” approach ensures memory and compute load are balanced. One key benefit is that FSDP works at the framework level (PyTorch has native FSDP support): it’s model-agnostic, so whether you fine-tune a 70 B LLaMA or a 400 B GPT variant, the mechanism is the same. Developers benefit from the conceptual simplicity that each worker still runs a piece of every batch (like standard data parallel) but without the memory overhead of full model replicas (Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -). For our cluster, enabling FSDP means even with just two nodes to start, one can fine-tune models larger than a single node’s memory would allow. As you scale to 4, 8, 16+ nodes, the same approach extends, simply partitioning the model states more finely. FSDP is also compatible with CPU offloading – parts of the model or optimizer states can reside in CPU RAM and stream to the NPU/GPU when needed (Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -), which is useful given our nodes have powerful CPUs and potentially large RAM (e.g. 64–128 GB per node) to supplement the on-die accelerators.

Tensor Parallelism: While FSDP shards model parameters, tensor model parallelism (TP) splits the computations within each layer across devices. This is useful when a single layer (like a giant ∼Dense matrix multiply in a 450 B model) is too large or too slow for one NPU/GPU. With tensor parallelism, multiple devices hold different parts of that layer’s weight matrix and collectively compute the output. For example, if you have a fully-connected layer with shape [N, M], two devices could each store half the columns of the weight (size [N, M/2]) and each compute half of the output features, combining their results at the end. This reduces per-device memory and multiplies throughput at the cost of an extra all-reduce or concat operation. In practice, libraries like Megatron-LM use tensor slicing to spread transformer feed-forward and attention projections across GPUs. In an AMD cluster, we could enable tensor parallelism across the NPUs of two or more nodes – effectively treating the collection of NPUs as one giant accelerator. This can be combined with FSDP for maximum effect. For instance, one popular strategy is 2D parallelism: use tensor parallelism within a node (across the NPU and integrated GPU, or across multiple NPUs if the node had them) and use FSDP across nodes (2D Parallelism (Tensor Parallelism + FSDP) — lightning 2.4.0 documentation). This hybrid approach leverages the memory saving of sharding while still accelerating each layer’s math with multiple devices in parallel (2D Parallelism (Tensor Parallelism + FSDP) — lightning 2.4.0 documentation). The result is better scaling: memory is less likely to be the bottleneck, and adding more nodes can increase effective compute on each layer. In concrete terms, our cluster could split a 450 B model such that each node handles a subset of layers (via FSDP) and for the largest layers, both nodes’ NPUs cooperate (via TP) to compute them. PyTorch Lightning and DeepSpeed provide utilities to set up such parallelism patterns with minimal changes to model code.

Pipeline (Layer-wise) Parallelism: Very large models can also be split by layers across nodes. In pipeline parallelism, each node (or group of devices) is assigned a consecutive set of layers in the neural network. For example, Node 1 might hold layers 1–12 of a transformer, Node 2 holds layers 13–24, and so on. During training or inference, mini-batches are broken into micro-batches that are passed through the pipeline: Node 1 processes micro-batch 1 and passes its output activations to Node 2 while simultaneously starting on micro-batch 2, and Node 2 does the same in a staggered fashion. This layer-wise batching keeps all nodes busy in parallel once the pipeline is filled, greatly increasing hardware utilization. The trade-off is a one-time pipeline latency (the first batch has to traverse all nodes sequentially) and extra memory to store intermediate activations in transit. For our fine-tuning cluster, pipeline parallelism can be a powerful approach, especially when combined with the integrated high-speed fabric of these APUs. Each 395 node has fast PCIe and potentially Infinity Fabric links; with a high-speed network between nodes, sending activation tensors from one node’s NPU to the next node’s NPU can be done with low latency. By distributing different layers to different nodes, we also naturally fit very deep models into the cluster memory. This approach is model-independent (any sequential model can be pipelined) and works even starting with just two nodes – you’d split the model roughly in half. As you scale to more nodes, you split into more stages. Modern frameworks (like DeepSpeed’s pipeline engine or PyTorch’s Pipe modules) automate splitting and micro-batch scheduling. An added optimization is interleaved pipelining where multiple micro-batches are in flight and some stages may even host multiple layers to balance compute times ([PDF] Efficient Large-Scale Language Model Training on GPU Clusters ...). In summary, pipeline parallelism (a form of layer-wise distribution) complements FSDP/TP by addressing depth scaling. It’s particularly useful for inference serving: one can dedicate nodes to specific portions of the model and achieve streaming throughput once the pipeline is saturated, effectively batching across layers.

Load Balancing and ZeRO-Offload: When splitting models, not all layers or parameters are equal in size or compute cost. Techniques like ZeRO-3 (Zero Redundancy Optimizer) help manage this by partitioning not just model weights, but also optimizer states and gradients across devices (Zero Redundancy Optimizer - DeepSpeed). This removes duplication and allows dynamic balancing. For instance, if one node has more available memory (say one node in the cluster also has a discrete GPU installed), ZeRO can allocate slightly more of the model’s shards to that node. Another aspect is offloading: ZeRO can offload optimizer state updates to CPU memory without slowing down training significantly, especially useful in fine-tuning where batch sizes are small. In a Ryzen AI cluster, one could pin the large static weights in NPU/GPU memory (or even NVMe swap using memory-mapped techniques if needed), and use the system RAM (and those 16 CPU cores) to handle optimizer computations. This frees up the NPU to strictly do forward-backward matrix crunching. The ZeRO-3 stage essentially gives us memory elasticity – we aren’t constrained by each node’s VRAM or NPU SRAM, as the cluster’s aggregate memory (including CPUs) is utilized. When fine-tuning a 450 B model, ZeRO-3 would shard the optimizer states (which can be 2-3× the model size) across nodes, making it feasible to use these 16 GB (for example) NPUs. All these parallelism strategies – FSDP, TP, pipeline, ZeRO – are complementary and often used together in training at scale (2D Parallelism (Tensor Parallelism + FSDP) — lightning 2.4.0 documentation). Our framework should allow pluggable combinations based on model needs. For example, developers might start with FSDP+ZeRO (which is simple to enable in PyTorch) for a 70 B model on two nodes; for a 450 B model on, say, 8 nodes, they might use 2-way tensor parallel per node and 4-way pipeline across nodes, with ZeRO handling optimizer sharding. We will provide configuration recipes for these scenarios, so the approach remains general and model-agnostic. The goal is that a broad user base – even those without distributed systems expertise – can fine-tune large open-source models by simply toggling these high-level settings.

Inference Scaling with Ray Serve and Triton: After fine-tuning, serving the model efficiently is the next challenge. Two promising tools for distributed inference are Ray Serve and NVIDIA Triton Inference Server (which can also run on CPUs/ROCm). Ray Serve is a scalable model serving library that can orchestrate multiple nodes and devices behind a unified API (Low-latency Generative AI Model Serving with Ray, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM). It allows you to deploy the fine-tuned model as a microservice, automatically batch incoming requests and route them to different workers. In a multi-node AMD cluster, Ray Serve can coordinate so that each request utilizes all the nodes if needed (for example, if the model is sharded or pipelined across nodes), or dispatch different requests to different nodes if the model is replicated for throughput. Ray’s advantage is easy multi-node scaling – it can manage a cluster of heterogenous resources and provides features like auto-scaling and load balancing out of the box (Low-latency Generative AI Model Serving with Ray, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM). We could integrate Ray Serve with our training setup so that the same shards used for FSDP training are kept in memory and used for inference, amortizing the cost of loading the model. On the other hand, Triton Inference Server is an optimized serving runtime that supports multiple frameworks and can maximize throughput on GPUs (and potentially NPUs via ONNX). Triton excels at squeezing out every bit of performance by using tensor cores, scheduling GPU batches, and supporting features like concurrent model execution. We can run Triton on each node (since it supports AMD GPUs via ROCm and CPUs via ONNX Runtime) and use Ray to distribute queries to the Triton instances on each node. This hybrid approach leverages Triton’s optimized backend (for things like accelerated kernels, dynamic batching) and Ray’s distributed computing ability (Triton Inference Server Ray Serve Deployment - NVIDIA Docs Hub). Notably, Triton has a backend for FasterTransformer and TensorRT, which are highly optimized for transformer models, including fused kernels and int4/int8 support. Our cluster could use Triton with a custom backend for the XDNA NPU (possibly through the Vitis AI runtime) – this would require writing a Triton backend that interfaces with AMD’s NPU execution provider. While somewhat advanced, it means end-users could send a query to the cluster and Triton would handle splitting the input across NPUs, collecting outputs, and returning the result, all in an optimized C++ runtime. In summary, inference scaling in our framework will use Ray Serve as the high-level orchestrator to easily add more nodes or instances, and Triton or similar optimized inference servers at the low-level to maximize hardware utilization per node. This ensures that the fine-tuned models are not just trained efficiently but also served to end-users with low latency and high throughput, making the cluster useful for real-world applications (chatbots, AI services, etc.) beyond just research.

3. Data Traffic & Network Latency Optimization

High-Speed Networking – InfiniBand vs Ethernet: In a multi-node fine-tuning setup, the choice of interconnect can dramatically affect performance. Large models require frequent gradient exchanges, weight synchronization, or activation passing between nodes. Traditionally, HPC and AI clusters use InfiniBand (IB) for its low latency and RDMA capabilities. InfiniBand was designed to overcome Ethernet’s historical limitations (like high latency and packet loss), offering very low message latency (on the order of microseconds) and efficient CPU offload. However, modern Ethernet has narrowed the gap significantly. With technologies like RoCE (RDMA over Converged Ethernet), QoS for lossless delivery, and 100–400 GbE speeds, a tuned Ethernet network can approach IB-level performance in many AI workloads (The Battle of AI Networking: Ethernet vs InfiniBand - WWT). In fact, tests have shown that with similar bandwidth (e.g. both at 400 Gb/s), the difference in AI training throughput between InfiniBand and a well-configured Ethernet fabric can be under 1% (The Battle of AI Networking: Ethernet vs InfiniBand - WWT). The implication is that for our cluster, while InfiniBand would be ideal (especially for scaling to a large number of nodes where consistent latency matters), a high-speed Ethernet solution (25 GbE as a minimum, but preferably 100 GbE or more) with proper settings could suffice. InfiniBand still has the edge in out-of-the-box latency and collective communication efficiency – for instance, NVIDIA’s NCCL library is highly optimized for IB and can do things like in-network reductions on Mellanox switches. If ultimate scaling to, say, 16+ nodes is planned, using InfiniBand (NDR 400 Gb/s or similar) will minimize communication overhead. Ethernet, on the other hand, offers flexibility and cost advantages, and many cloud-like deployments use Ethernet for AI clusters by leveraging advanced switches. Our recommendation is to equip the cluster with at least 100 Gb/s interconnect between nodes. This could be 100 GbE with RoCE enabled, or HDR100 InfiniBand (both deliver ~100 Gb/s user bandwidth). With two nodes, even 25 GbE can manage (as the communication volume is lower), but for larger deployments, 100 Gb+ ensures gradients from a 450 B model (which can be many gigabytes per step) don’t create a bottleneck. Additionally, we will use techniques like gradient compression or quantization during communication to further reduce bandwidth needs. For example, compressing gradients to 16-bit or 8-bit before all-reduce can cut traffic in half or more, at minimal impact to convergence. In summary, a low-latency network fabric is key to cluster efficiency. InfiniBand provides specialized hardware acceleration for this, while Ethernet requires careful tuning (PFC, ECN, etc.) to avoid congestion. Both can work – we lean towards commodity Ethernet for accessibility, noting that it “can push data with the same bandwidth, latency and reliability as InfiniBand” with proper tweaks (The Battle of AI Networking: Ethernet vs InfiniBand - WWT), but if the option is available, InfiniBand will give extra headroom for scaling with less engineering effort.

Reducing Data Transfer via Quantization: One major way to optimize both computation and communication is efficient quantization of model data types. By using lower precision (such as 8-bit or 4-bit integers) for model weights and activations, we drastically cut memory usage and bandwidth. For instance, quantizing a model from 16-bit to 4-bit reduces memory by 4×, which means 4× fewer bytes to move between nodes or to read from memory. The AMD Ryzen AI Max platform shines here – the XDNA 2 NPU is designed to excel at low-precision arithmetic. Its 50 TOPS rating is for “minimum precision” operations, likely int8 or even int4 accumulation (AMD Unveils Its Fastest Edge AI Chips Yet: The Ryzen AI Max and Ryzen AI Max+ Strix Halo Families - Hackster.io). AMD’s CES demo explicitly used a 4-bit quantized 70 B model (Llama 70B Q4) to achieve the huge speedup over the 4090 GPU (AMD Unveils Its Fastest Edge AI Chips Yet: The Ryzen AI Max and Ryzen AI Max+ Strix Halo Families - Hackster.io). In our cluster, we can use 4-bit weight quantization both for inference and fine-tuning (with techniques like QLoRA). During fine-tuning, QLoRA keeps a 4-bit base model in GPU/NPU memory and learns low-rank 16-bit weight updates, allowing adaptation of large models on smaller hardware. This approach would be perfect for our nodes – the NPU can handle the int4 matrix multiplies for the forward pass, while the CPU/GPU applies the small high-precision weight updates. The benefit is a massive reduction in memory and compute load per token. For inference, we can quantize weights to 4-bit and even activations to 8-bit (using techniques to maintain accuracy). Projects like Llama.cpp and MLC-LLM have proven that 4-bit quantized LLMs can run with minimal loss in output quality while greatly boosting speed and memory efficiency (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs). In a multi-node scenario, quantization also means network messages are smaller: instead of sending 32-bit floats for a layer’s activations to the next node in a pipeline, we could send 8-bit integers – a 4× reduction in bandwidth. Our framework will include support for int8/int4 training and inference. This includes using libraries like Transformers with 8-bit (via bitsandbytes) or the ONNX Runtime with integer models on the NPU. We will carefully choose quantization schemes that the hardware accelerates – for example, vector quantization that aligns with AVX512 VNNI on CPU or the NPU’s supported data types. By doing so, we reduce the “computational load” on each node, effectively allowing each 50 TOPS NPU to behave like a much larger accelerator because each operation is smaller. Empirical results have shown nearly linear speed-ups in multi-GPU inference when using 4-bit models (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs). For example, 4-bit Llama2-70B can reach ~30 tokens/sec on 2 GPUs (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs) – in our case, 2 NPUs might similarly achieve high throughput, given their strength in low-precision math. The cluster will incorporate tooling to automatically convert models to lower precision and to use calibration techniques to maintain accuracy. Ultimately, by trading off a bit of precision, we gain disproportionately in speed, memory, and network efficiency – a crucial trade-off to democratize running 100B+ models on modest hardware setups.

Optimizing Communication Patterns: Beyond raw bandwidth, how and when data is exchanged between nodes matters. We will implement optimized communication strategies to minimize latency impact. One such idea is overlapping computation with communication – for example, in FSDP or pipeline parallel, as soon as one layer’s outputs are produced, we begin transmitting them to the next node while the current node moves on to the next chunk of work. This pipelining of communication ensures network transfer happens in parallel with NPU computation, hiding the latency. We can use libraries like NCCL (optimized for GPUs, but possibly usable with our network for CPU transfers) or custom MPI with RDMA to schedule non-blocking sends/receives of tensors. Another strategy is gradient accumulation and compression: instead of sending every micro-batch’s gradients, accumulate a few and send one averaged (or quantized) message. This reduces message frequency and allows better utilization of network bandwidth (larger, fewer packets). In inference serving, if using pipeline parallel, we’ll employ micro-batch batching – grouping multiple inference requests and sending them through the pipeline concurrently, so that the cost of a multi-node inference is amortized over several queries. This is akin to throughput mode where you sacrifice a bit of per-query latency to drastically increase total throughput, which is acceptable in many applications. We’ll also look at topology-aware placement: if more than two nodes, maybe arrange model parallel partitions on nodes that are directly connected (to avoid extra switch hops). AMD’s clustering of their CPUs (Infinity Fabric) might allow some special optimizations if nodes are in the same chassis or have a direct fabric link. Additionally, for the integrated GPU and NPU on the same chip, we leverage the fact that they share system memory – moving data between the NPU and GPU on one node is effectively zero-copy (no PCIe transfer needed, since it’s one APU), which is an architectural win. This means on a given node, the CPU, GPU, and NPU can all share a common memory pool for the model. We will ensure our software uses unified memory or managed memory so that, for instance, if the GPU needs an activation computed by the NPU, it can just read it via a pointer rather than an explicit copy. This is an advantage of AMD’s APU approach that not even multi-GPU servers have (they often rely on NVLink or PCIe for GPU-GPU communication). By exploiting this, each node can internally split work between GPU and NPU with minimal overhead, and the only network traffic is truly between different nodes, which we optimize as discussed. To summarize, we treat network optimization as a first-class concern: using the fastest interconnect available, reducing the amount of data sent through quantization, overlapping comms with compute, and using smart scheduling to avoid idle times. With these measures, even a cluster of modest nodes can achieve near-linear scaling on large models – meaning if one node can process X tokens per second, two nodes should get close to 2X, four nodes ~4X, and so on, up to the point where communication costs begin to dominate.

Layer-Wise Parallel Execution: In section 2 we discussed splitting layers across nodes (pipeline). Here we emphasize the performance aspect: by distributing model layers in parallel across nodes, we effectively batch across layers. For example, suppose each node has 2 NPUs (or 1 NPU and 1 GPU) and we split the model into two halves. While Node 1’s NPU is computing layer 1 on batch A, Node 2’s NPU can simultaneously compute layer 2 on batch B. When Node 1 finishes, it sends batch A’s output to Node 2, and Node 2 sends batch B’s output to Node 1 (if going in reverse for backprop, or if more pipeline stages exist). This kind of asynchronous parallelism keeps all NPUs busy. The trick is choosing an optimal micro-batch size and number of pipeline stages such that the pipeline is always full. Our cluster control software will likely incorporate a pipeline scheduler (like the PopArt scheduler used in Graphcore or DeepSpeed’s pipeline module) to automate this. In inference, this is analogous to an assembly line: multiple token evaluations are happening in different stages of the model concurrently. The outcome is improved throughput, especially when each stage’s compute time is roughly equal (we may need to group layers to balance). In practice, if one layer group is heavier, we could allocate more NPUs to it (if a node has an idle GPU, for instance, it could assist the NPU for that stage). This concept of layer-wise parallel execution ensures that adding more nodes (and hence more pipeline stages) yields a proportional increase in the number of tokens processed per second, up to the limits of the model’s inherent sequential dependencies. It’s an important part of being easily scalable for larger deployments – you don’t want diminishing returns after adding a few nodes. By careful design (equalizing work per stage, overlapping communication), we aim for close to linear scaling in the number of nodes, at least until dozens of nodes. This means our cluster framework can grow from 2 nodes (small lab setup) to, say, 32 nodes (serious research model training) with predictable performance gains, making it useful for a broad range of users.

4. Alternative Computation Models

Speculative Decoding with Distributed Resources: In addition to traditional parallelism, we can explore algorithmic innovations like speculative decoding to speed up generative inference. Speculative decoding involves using a smaller “draft” model to predict several tokens ahead, and then having the large model confirm or correct those predictions in a single batch, rather than generating tokens one by one. This method can drastically improve throughput, often 2–3× faster, without quality loss (Speculative Decoding — Make LLM Inference Faster - Medium). In a multi-node setup, we can dedicate one node (or one NPU) to run a fast draft model (for example, a distilled version of our main model) that generates k candidate tokens, and have the rest of the cluster run the large model to validate those tokens in parallel. This is a form of task parallelism on top of model parallelism – different nodes doing different tasks concurrently. Recent research on Distributed Speculative Inference (DSI) has shown that orchestrating a draft model and a target model in parallel can guarantee speedups over standard autoregressive decoding (Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference | OpenReview) (Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference | OpenReview). Essentially, while the main model is busy processing the current token, a draft model speculates the next few tokens; the main model can then jump directly to evaluating those in one go, skipping ahead. By splitting these roles across nodes, we minimize idle time – the draft node is always a few tokens ahead. If the speculation is correct, we saved time; if not, we fall back gracefully. The cluster can even run multiple speculative branches – for instance, Node A and Node B each hypothesize different continuations from a given state, and Node C (the main model) evaluates both in parallel and selects the better outcome. This tree-based decoding could utilize the cluster’s extra compute to explore multiple possibilities concurrently, thus reducing latency of finding a high-quality sequence. While speculative decoding is primarily an inference technique, it underscores the flexibility of our framework: by combining model parallelism with novel parallel algorithms, we push token generation throughput beyond what pure hardware scaling would give. We plan to integrate options for speculative decoding in the serving stack (likely via a modified scheduler in Ray Serve or in the model code itself). The user could toggle a “fast generation” mode where a smaller model (possibly 6 B or 13 B parameter size) runs on one NPU to guide the big model (say 70 B) running on the rest. This is especially useful for long text generation, where the cluster can maintain a high token per second rate even as it generates thousands of tokens. As research like DSI suggests, 1.3–1.9× speed-ups over standard decoding are achievable even in single-node setups (Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference | OpenReview), so distributed speculative decoding could yield additive improvements on top of our existing parallelism – truly improving tokens generated per second in a way users will notice.

Fusion Kernels and Operator Optimizations: To reduce computational overhead, we will employ fusion kernels – combining multiple neural network operations into one efficient kernel wherever possible. This reduces the number of separate compute launches and memory passes. In GPU terms, this avoids being kernel launch bound at small batch sizes (Making Transformer inference faster on GPUs - performance - PyTorch Developer Mailing List). For our AMD cluster, fusion is equally beneficial. For example, in a transformer, instead of computing linear projection -> add bias -> activation -> dropout as four steps, a fused kernel can do the projection, add bias, and apply activation in one pass over the data. This not only saves time but also keeps data localized in faster registers or SRAM. AMD’s NPU is very amenable to this since its programming model (via Vitis) allows defining custom dataflow graphs on the hardware. We can instruct the NPU to perform a whole sequence of ops on each token or batch, rather than loading and storing after each op. On the CPU side, we can use libraries that fuse layer normalization and matrix multiplies (oneDNN does this for x86, for instance) to use cache more efficiently. NVIDIA’s Transformer Engine uses fused kernels extensively (along with mixed precision) to achieve high efficiency (Making Transformer inference faster on GPUs - performance), and we will adopt similar tactics. A concrete example is FlashAttention, a fused attention kernel that computes the attention softmax and weighted sum in one pass, reducing memory traffic and enabling use of high-speed SRAM for intermediate results. Incorporating FlashAttention or an AMD-optimized equivalent will significantly speed up the attention layers, which are often memory-bound. Similarly, fused optimizers (update weights + apply decay in one kernel) can speed up backprop. We will monitor profiling results: if we see, say, layer norm taking a significant fraction of time, we’ll attempt to fuse it with neighboring ops. Luckily, projects like Hugging Face’s Accelerate and DeepSpeed already provide many fused ops for popular models (e.g., DeepSpeed has fused Adam optimizer, fused transformer block implementations). We will ensure those are compatible with ROCm/AMD or implement our own where needed. The effect of fusion is more pronounced in distributed scenarios because often each micro-batch per device is small (to keep latency low or because of memory constraints). In those cases, launching many tiny ops would leave the NPU/CPU underutilized. By fusing, we amortize overhead and approach the ideal compute utilization. One user-friendly aspect is that this can be largely hidden under the hood – developers don’t need to change their model code; the cluster’s runtime or libraries will just execute a more optimized kernel. We can cite that frameworks which applied full transformer fusion have achieved up to ~4× speedups in small-batch inference (Making Transformer inference faster on GPUs - performance - PyTorch Developer Mailing List) (Making Transformer inference faster on GPUs - performance - PyTorch Developer Mailing List). Our cluster will leverage these best practices so that even at batch size 1 (as is common in interactive LLM usage), the hardware is kept busy doing actual math rather than waiting on memory or runtime. In summary, kernel fusion and operator optimization reduce overhead per token generated or per sample trained, which directly translates to faster fine-tuning runs and snappier inference serving.

Tensor Slicing Strategies: “Tensor slicing” refers to splitting a single tensor across devices – essentially what we discussed with tensor model parallelism. But beyond standard TP, we can explore slicing in unusual ways to optimize the use of NPUs. For instance, one approach is to slice by tensor depth and apply different precisions: perhaps half of the NPU tiles operate on a tensor in 8-bit mode and the other half in 16-bit to preserve some accuracy, combining results at the end. Another idea is to use the CPU or GPU to handle a slice of a layer that involves a different data type or sparse pattern. Imagine a scenario where 90% of the weights are dense (best for NPU) and 10% are sparse or require higher precision (maybe a small subset of model like an embedding for rare tokens) – one could run the dense part on NPU and the rest on CPU in parallel, then merge. While these are advanced techniques, they highlight that the cluster’s heterogeneity can be leveraged at the tensor-operation level, not just layer or model level. The framework could eventually automatically decide to route portions of a computation to the most appropriate hardware. For now, this might be manual: e.g., using PyTorch’s TensorParallel utilities to assign certain columns of a weight matrix to one device vs another. But as the software matures, one could imagine an AI compiler that, given the model graph and devices, partitions each operation for optimal performance. TVM (used in MLC-LLM) is a step in this direction – it can compile a model by splitting ops across multiple GPUs for speed (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs) (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs). We will keep our framework aligned with such tools so that as they gain ability to target AMD NPUs or multi-device, we can integrate those improvements.

Exploring New Paradigms: The cluster will also allow experimenting with alternative computation paradigms like mixture-of-experts (MoE) and asynchronous training. MoE models activate only a subset of weights per token (routing each token to a few “expert” sub-networks). In a multi-node environment, each node (or each NPU) could host different experts, and tokens are routed over the network to whichever node has the expert needed for that token. This can drastically reduce per-token computation by not using all weights for all tokens. While MoEs introduce complexity (load balancing the experts, dynamic communication), our high-speed network and specialized hardware might handle it efficiently, enabling models that scale to trillions of parameters by distributing experts. Another paradigm is asynchronous pipeline or decoupled architectures where generation of tokens and training can intermix. For example, one could fine-tune a model on Node 1 while Node 2 serves inference requests with the previous version of the model, then they swap when Node 1 finishes an epoch – a form of continuous deployment. The cluster’s design around accessibility means we can provide high-level APIs for these experimental setups so advanced users can push the boundaries of model training and serving. Essentially, by not being tied to one specific model or training algorithm, the framework remains open to incorporate cutting-edge research ideas (like speculative decoding, DSI, MoE) that promise better efficiency. Our guiding principle is to optimize token generation/training time from all angles: hardware utilization, algorithmic shortcuts, and parallelism – ensuring the cluster gets more useful work done per cycle than a naive setup would.

5. Benchmarking & Iterative Improvements

Adapting MLC-LLM and Llama.cpp: To maximize efficiency on AMD hardware, we will draw inspiration from projects that optimize LLMs for resource-constrained environments. MLC-LLM (Machine Learning Compilation for LLMs) has demonstrated the ability to compile models for multiple GPUs and backends with near-optimal performance (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs) (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs). They achieved 34.5 tokens/s on Llama2-70B with 2 NVIDIA 4090s and 29.9 tokens/s with 2 AMD 7900 XTX GPUs by compiling to each platform’s strengths (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs). We intend to integrate a compilation step (optionally) in our workflow: after fine-tuning a model, we can use an MLC-style pipeline to generate optimized code for the specific cluster configuration (e.g., using TVM with ROCm support for the integrated GPU and leveraging any OpenCL hooks for the NPU). This could yield a specialized runtime for the model, improving inference throughput. The fact that MLC can scale beyond two GPUs with consistent gains (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs) is encouraging – it means our cluster should scale as more nodes are added, without hitting software bottlenecks. Llama.cpp, on the other hand, has pioneered ultra-efficient CPU inference by using quantization and optimized memory access (mmap paging of model weights). We can adopt similar techniques on our cluster’s CPU side. For instance, Llama.cpp keeps the model weights memory-mapped to avoid loading everything into RAM at once; in a multi-node scenario, one could mmap from a network file system or use memory servers, though our cluster likely has enough RAM per node. More directly, Llama.cpp’s core kernel (essentially a packed matrix multiply using AVX2) could be replaced or augmented with calls to our NPU. A future direction is to integrate the Llama.cpp runtime with AMD’s NPU so that when running on a Ryzen AI machine, it offloads the heavy linear algebra to the NPU via Vitis or ONNX EP. This would let users run Llama.cpp models (which are typically quantized) using the full power of the chip rather than just the CPU. We foresee incorporating this into our framework – for example, providing a mode where the fine-tuned model is exported to GGML format (used by Llama.cpp) and then executed cluster-wide with a custom backend that utilizes NPUs. This could make serving very lightweight: each node could run an instance that waits for a prompt, does minimal coordination with others, and generates tokens using a highly optimized loop. The key benefit is tapping into the community-driven optimizations already present in tools like Llama.cpp and MLC. These projects have shown that with clever low-level work, even consumer hardware can approach the performance of specialized setups. By aligning our framework with their methods (quantization, caching strategies, etc.), we both improve performance and ensure a broad user base is comfortable – many developers already use Llama.cpp or have heard of MLC, so providing an option to use a “clustered Llama.cpp” approach could be very appealing.

Continuous Benchmarking: As we deploy the cluster, we will set up a suite of benchmarks to guide iterative improvements. This will include standard tasks like training throughput on GPT-3-sized models (tokens per second), inference latency for a single token, and end-to-end evaluation on tasks (like how long to generate a 1000-token text). We’ll compare these metrics against known baselines (e.g., an 8×A100 GPU cluster, or a Cloud TPU pod) to identify gaps. Since our hardware is unique (Ryzen AI Max APUs), we’ll likely discover certain operations are slower or faster than expected. For instance, maybe the NPU excels at int8 matmul but struggles with large softmax dimensions. If a benchmark shows the softmax is a bottleneck, we might offload that to the CPU (which can use vector instructions) or implement a fused softmax kernel on the NPU. We will also measure utilization: how busy each component is during training. Ideally, we want the NPUs at high utilization most of the time. If we find, say, the NPU is often waiting on data from CPU, that suggests a need for better overlap or faster data feeding (perhaps prefetching next batch to NPU SRAM while current batch is processing). Our framework will include profiling tools (integration with PyTorch Profiler or AMD’s performance APIs) so developers can see a breakdown per iteration. Over time, these insights will drive iterative tuning: adjusting kernel launch sizes, increasing/decreasing batch sizes, tweaking parallelism strategy. One particular area to watch is the dynamic workload allocation between NPU and CPU. We have the capability to switch ops between devices in real-time – for example, if the NPU is saturated, let the CPU handle some smaller ops concurrently. A dynamic scheduler (perhaps built on Syne Tune or Ray scheduling) could monitor the queue of operations and assign them to NPU or CPU (or GPU) depending on current load and estimated execution time. We will experiment with this: initially in a simple heuristic (e.g., if an operation is matrix multiply > certain size, run on NPU; if it’s element-wise or very small, run on CPU to avoid NPU context switch). As we benchmark, we might discover threshold where this pays off. The goal is adaptive utilization – no part of the hardware sits idle if there’s any work it could usefully do. This is especially useful during sequence generation: while the NPU is computing the next token’s output, the CPU could simultaneously be computing the logits of the previous token or running beam search logic, etc. A concrete example: in a text generation with beam search, one could have the NPU evaluate the model on the top-K beams in parallel while the CPU sorts and prunes beams between steps. By overlapping these, overall latency drops.

Leveraging Established Best Practices: We will continually incorporate best practices from established AI hardware providers into our cluster. For example, NVIDIA’s guidelines for multi-GPU training emphasize overlapping communication with computation and using collective communication algorithms optimized for the topology – we’ll do the same, taking advantage of any NCCL replacements for our environment. Google’s TPU pods use a ring-based all-reduce and have very optimized input pipelines – similarly, we will ensure our data loading (perhaps using Ray or Dask across nodes) doesn’t bottleneck training. We’ll also look at experiences from Meta’s Research (which developed FSDP, etc.) and Microsoft’s DeepSpeed team (which championed ZeRO and many optimizer tricks). By grounding our approach in such documentation and prior art, we avoid reinventing wheels and focus on adapting them to AMD’s hardware. AMD’s own documentation on ROCm and performance tuning will guide us in tuning the integrated GPU (for instance, ensuring we use the best GEMM libraries on RDNA, possibly tuning MIOpen for any convolution operations, etc.). The result should be a framework that is holistic in optimization: from the silicon level (making use of XDNA features) to the algorithm level (speculative decoding, etc.). We will document these optimizations so developers can understand what’s happening under the hood, and even opt in/out certain features if needed (for example, turning off speculative decoding if it complicates results, or switching between FP16 vs INT8 training).

Accessibility and Ease of Use: While we employ all these advanced techniques, a core goal is maintaining accessibility. We will package the cluster software in containers with sane defaults. A developer with two Ryzen AI 395 machines should be able to run one script that sets up the environment (installing ROCm, the Ryzen AI SDK (AMD Ryzen AI), etc.), and launches a distributed training job without needing to manually configure NCCL or MPI. High-level APIs or config files can hide the complexity of FSDP, TP, pipeline decisions – providing presets like “auto_split” that uses heuristics based on model size. We’ll also integrate with popular libraries: for example, Hugging Face Transformers integration so that Trainer can utilize the cluster (they already support DeepSpeed, etc., so we can make our cluster appear as a sort of DeepSpeed backend to the Trainer). By doing so, the broad user base of PyTorch/Transformers can fine-tune models on our AMD cluster by largely reusing their existing training scripts – just adding some configuration for distributed run. Moreover, since AMD is positioning Ryzen AI for wide use (including Windows PCs), we might even consider making parts of the solution cross-platform. While our primary focus is likely Linux for cluster management, the underlying ideas (like using ONNX EP for NPU) are cross-platform, meaning a developer on a Ryzen AI laptop could try out fine-tuning on their single machine (with the same software stack) before scaling out to multiple nodes. This continuity helps adoption and learning.

In conclusion, this comprehensive framework combines cutting-edge hardware (AMD XDNA NPUs) with state-of-the-art distributed training techniques and algorithmic innovations. By grounding our approach in documented best practices and real hardware capabilities, we’ve outlined a path to optimize fine-tuning efficiency on AMD Ryzen AI Max 395-based clusters. The result will be a highly scalable, power-efficient, and user-friendly platform for fine-tuning and serving large AI models – lowering the barrier for researchers and developers to experiment with big models on their own AMD-powered clusters.

Sources:

AMD Ryzen AI Max+ 395 architecture and performance – Tom’s Hardware (2023) (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware) (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware)
Power efficiency of XDNA 2 NPU vs CPU – Tom’s Hardware (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware)
AMD claims on Llama 70B 4-bit vs RTX 4090 – Hackster CES 2025 News (AMD Unveils Its Fastest Edge AI Chips Yet: The Ryzen AI Max and Ryzen AI Max+ Strix Halo Families - Hackster.io)
AMD XDNA 2 NPU dataflow design – Tom’s Hardware (RDNA 3.5, XDNA 2 engine, and Thoughts - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 5 | Tom's Hardware)
Zen 5 AVX-512 VNNI instruction support – Tom’s Hardware (Zen5's AVX512 Teardown and More - Hacker News)
Ryzen AI software stack (Vitis AI and ONNX Runtime EP) – Hugging Face Optimum Docs (AMD Ryzen AI)
Fully Sharded Data Parallel (FSDP) overview – Meta AI (Ott et al., 2021) (Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -)
ZeRO (Zero Redundancy Optimizer) memory partitioning – DeepSpeed AI (Zero Redundancy Optimizer - DeepSpeed)
2D Parallelism (Tensor + FSDP) for large models – Lightning AI Docs (2D Parallelism (Tensor Parallelism + FSDP) — lightning 2.4.0 documentation)
Ray Serve multi-node inference capabilities – Anyscale Blog (2024) (Low-latency Generative AI Model Serving with Ray, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM)
Ethernet vs InfiniBand in AI clusters – WWT Research (The Battle of AI Networking: Ethernet vs InfiniBand - WWT)
4-bit quantization boosting multi-GPU throughput – MLC AI Blog (2023) (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs)
Kernel fusion to reduce launch latency – PyTorch Dev Discussion (Making Transformer inference faster on GPUs - performance - PyTorch Developer Mailing List)
Distributed speculative inference (DSI) speedups – Timor et al., 2025 (ICLR) (Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference | OpenReview)
NVIDIA Transformer Engine (fused ops, FP8) – NVIDIA Dev Blog (Making Transformer inference faster on GPUs - performance)
MLC multi-GPU scaling (NVIDIA vs AMD) – MLC AI Blog (MLC | Scalable Language Model Inference on Multiple NVIDIA and AMD GPUs)
Hugging Face Optimum – Ryzen AI setup – Hugging Face Docs (AMD Ryzen AI) (AMD Ryzen AI)