Highlights from "The Llama 3 Herd of Models" paper

"My basic perception is that's it's insane they went bf16/dense -- feels like a 4x algorithmic hit" Well, dense makes inference cheaper and more memory-efficient at the cost of less compute-efficient training, right?

• Managing complexity. We make design choices that seek to maximize our ability to scale the model development process. For example, we opt for a standard dense Transformer model architecture (Vaswani et al., 2017) with minor adaptations, rather than for a mixture-of-experts model (Shazeer et al., 2017) to maximize training stability. Similarly, we adopt a relatively simple post-training procedure based on supervised finetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO; Rafailov et al. (2023)) as opposed to more complex reinforcement learning algorithms (Ouyang et al., 2022; Schulman et al., 2017) that tend to be less stable and harder to scale

Like, Arctic runs on the same 8xH100 node or a MMLU of 67 instead of Llama's 87, at a TPS that is not that much higher.

This is kind of fun:

Compute. Llama 3 405B is trained on up to 16K H100 GPUs, each running at 700W TDP with 80GB HBM3, using Meta’s Grand Teton AI server platform (Matt Bowman, 2022). Each server is equipped with eight GPUs and two CPUs. Within a server, the eight GPUs are connected via NVLink. Training jobs are scheduled using MAST (Choudhury et al., 2024), Meta’s global-scale training scheduler. Storage. Tectonic (Pan et al., 2021), Meta’s general-purpose distributed file system, is used to build a storage fabric (Battey and Gupta, 2024) for Llama 3 pre-training. It offers 240 PB of storage out of 7,500 servers equipped with SSDs, and supports a sustainable throughput of 2 TB/s and a peak throughput of 7 TB/s. A major challenge is supporting the highly bursty checkpoint writes that saturate the storage fabric for short durations. Checkpointing saves each GPU’s model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. We aim to minimize GPU pause time during checkpointing and increase checkpoint frequency to reduce the amount of lost work after a recovery.

What's interesting about this is that 7 TB/s is not that high; you could get 1 TB/s on a single node, this isn't a shock in terms of vertical scaling, the impressive thing is that it works for the 16000 GPUs.

Network. Llama 3 405B used RDMA over Converged Ethernet (RoCE) fabric based on the Arista 7800 and Minipack2 Open Compute Project4 OCP rack switches. Smaller models in the Llama 3 family were trained using Nvidia Quantum2 Infiniband fabric. Both RoCE and Infiniband clusters leverage 400 Gbps interconnects between GPUs. Despite the underlying network technology differences between these clusters, we tune both of them to provide equivalent performance for these large training workloads. We elaborate further on our RoCE network since we fully own its design. • Network topology. Our RoCE-based AI cluster comprises 24K GPUs5 connected by a three-layer Clos network (Lee et al., 2024). At the bottom layer, each rack hosts 16 GPUs split between two servers and connected by a single Minipack2 top-of-the-rack (ToR) switch. In the middle layer, 192 such racks are connected by Cluster Switches to form a pod of 3,072 GPUs with full bisection bandwidth, ensuring no oversubscription. At the top layer, eight such pods within the same datacenter building are connected via Aggregation Switches to form a cluster of 24K GPUs. However, network connectivity at the aggregation layer does not maintain full bisection bandwidth and instead has an oversubscription ratio of 1:7. Our model parallelism methods (see Section 3.3.2) and training job scheduler (Choudhury et al., 2024) are all optimized to be aware of network topology, aiming to minimize network communication across pods

One wonders what they're using the other 8k GPUs for. Experiments?

• Load balancing. LLM training produces fat network flows that are hard to load balance across all available network paths using traditional methods such as Equal-Cost Multi-Path (ECMP) routing. To address this challenge, we employ two techniques. First, our collective library creates 16 network flows between two GPUs, instead of just one, thereby reducing the traffic per flow and providing more flows for load balancing. Second, our Enhanced-ECMP (E-ECMP) protocol effectively balances these 16 flows across different network paths by hashing on additional fields in the RoCE header of packets

They acheive a model flops utilization (MFU) of 38-43%, which is reasonable for a model this size.

It seems like they trained a way bigger model with the same 15T tokens as the 8b/70b models, which almost makes sense on the basis that those models have too much tokens for their size

our flagship model outperforms smaller models trained using the same procedure. While our scaling laws suggest our flagship model is an approximately compute-optimal size for our training budget, we also train our smaller models for much longer than is compute-optimal. The resulting models perform better than compute-optimal models at the same inference budget.

Generally interesting

To address these issues, we modify our pipeline schedule as shown in Figure 6, which allows setting N flexibly—in this case N = 5, which can run a arbitrary number of micro-batches in each batch. This allows us to run: (1) fewer micro-batches than the number of stages when we have batch size limit at large scale; or (2) more micro-batches to hide point-to-point communication, finding a sweet spot between DFS and breadth first schedule (BFS) for the best communication and memory efficiency. To balance the pipeline, we reduce one Transformer layer each from the first and the last stages, respectively. This means that the first model chunk on the first stage has only the embedding, and the last model chunk on the last stage has only output projection and loss calculation. To reduce pipeline bubbles, we use an interleaved schedule (Narayanan et al., 2021) with V pipeline stages on one pipeline rank. Overall pipeline bubble ratio is PP−1 V ∗M . Further, we adopt asynchronous point-to-point communication in PP, which considerably speeds up training, especially in cases when the document mask introduces extra computation imbalance. We enable TORCH_NCCL_AVOID_RECORD_STREAMS to reduce memory usage from asynchronous point-to-point communication. Finally, to reduce memory cost, based on detailed memory allocation profiling, we proactively deallocate tensors that will not be used for future computation, including the input and output tensors of each pipeline stage, that will not be used for future computation. With these optimizations, we could pre-train Llama 3 on sequences of 8K tokens without activation checkpointing.

The off-hand "we develop a performance-projection tool" really kills me a little with the implied discretionary human capital. like you did all this shit to get this far, presumably at incredible capex and opex, you have everyone hand-holding the cluster, and still have people for sims

Network-aware parallelism configuration. The order of parallelism dimensions, [TP, CP, PP, DP], is optimized for network communication. The innermost parallelism requires the highest network bandwidth and lowest latency, and hence is usually constrained to within the same server. The outermost parallelism may spread across a multi-hop network and should tolerate higher network latency. Therefore, based on the requirements for network bandwidth and latency, we place parallelism dimensions in the order of [TP, CP, PP, DP]. DP (i.e., FSDP) is the outermost parallelism because it can tolerate longer network latency by asynchronously prefetching sharded model weights and reducing gradients. Identifying the optimal parallelism configuration with minimal communication overhead while avoiding GPU memory overflow is challenging. We develop a memory consumption estimator and a performance-projection tool which helped us explore various parallelism configurations and project overall training performance and identify memory gaps effectively

Although the MFU is reasonable, it's somewhat shocking that they're not using fp8. This is especially surprising considering that inference is likely mostly going to be done in fp8 (with their special considerations for not quantizing self-attetion or the first and last layers), so why not train in fp8? This is partially addressed:

Numerical stability. By comparing training loss between different parallelism setups, we fixed several numerical issues that impact training stability. To ensure training convergence, we use FP32 gradient accumulation during backward computation over multiple micro-batches and also reduce-scatter gradients in FP32 across data parallel workers in FSDP. For intermediate tensors, e.g., vision encoder outputs, that are used multiple times in the forward computation, the backward gradients are also accumulated in FP32.

Similarly to the offhand performance-projection tool, this gives a sense of nausea and disquiet similar to beholding a gigantic waterpark in the middle of the desert.

The complexity and potential failure scenarios of 16K GPU training surpass those of much larger CPU clusters that we have operated. Moreover, the synchronous nature of training makes it less fault-tolerant—a single GPU failure may require a restart of the entire job. Despite these challenges, for Llama 3, we achieved higher than 90% effective training time while supporting automated cluster maintenance, such as firmware and Linux kernel upgrades (Vigraham and Leonhardi, 2024), which resulted in at least one training interruption daily. The effective training time measures the time spent on useful training over the elapsed time. During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions.

Other commentators have remarked on this, and compared it to S3 noticing a relevant number of bit flips during the day.

One interesting observation is the impact of environmental factors on training performance at scale. For Llama 3 405B , we noted a diurnal 1-2% throughput variation based on time-of-day. This fluctuation is the result of higher mid-day temperatures impacting GPU dynamic voltage and frequency scaling

This seems surprisingly arbitrary / hard to reproduce exactly. similarly "we average models obtained from experiments using various versions of data or hyperparameters" like you have the reported hyperparameters but also some other ones involved that you don't mention? [emphasis mine]

Our preference data annotation process is similar to Llama 2. We deploy multiple models for annotation after each round and sample two responses from two different models for each user prompt. These models can be trained with different data mixes and alignment recipes, allowing for different capability strength (e.g., code expertise) and increased data diversity.

Ah yeah. the careful ablations. Can't believe I forgot to use careful ablations in my training runs when I curate data mixes for one step of one of my seven post-training capabilities. That's on me. Maybe I can go get some at the careful ablations store?

• Long context code reasoning: We parse Python files to identify import statements and determine their dependencies. From here, we select the most commonly depended-upon files, specifically those referenced by at least five other files. We remove one of these key files from a repository and prompt the model to identify which files depended on the missing file and to generate the necessary missing code. We further categorize these synthetically generated samples based on the sequence length (16K, 32K, 64K and 128K) to enable more fine-grained targeting of input lengths. Through careful ablations, we observe that mixing 0.1% of synthetically generated long-context data with the original short-context data optimizes the performance across both short-context and long-context benchmarks.

Mogging on MoE

Small models. Developments in smaller models have paralleled those in large models. Models with fewer parameters can dramatically improve inference cost and simplify deployment (Mehta et al., 2024; Team et al., 2024). The smaller Llama 3 models achieve this by training far beyond the point of compute optimal training, effectively trading training compute for inference efficiency. An alternative path is to distill larger models into smaller ones, as in Phi (Abdin et al., 2024). Architectures. While Llama 3 makes minimal architectural modifiations to compared to Llama 2, other recent foundation models have explored other designs. Most notably, mixture of experts architectures (Shazeer et al., 2017; Lewis et al., 2021; Fedus et al., 2022; Zhou et al., 2022) can be used as an efficient way to increase the capacity of a models, such as in Mixtral (Jiang et al., 2024) and Arctic (Snowflake, 2024). Llama 3 outperforms these models, suggesting that dense architectures are not the limiting factor, but there remain numerous trade offs in terms of training and inference efficiency, and model stability at scale.

Also a little insane but if you're going to go to all these lengths then you may as well keep kosher. Really more upsetting is that they don't mention the organizational decisions that led to pulling off this level of complexity in the first place

Developing a flagship foundation model such as Llama 3 involves overcoming a plethora of deep technical problems but also requires clever organizational decisions. For example, to ensure Llama 3 is not accidentally overfitted on commonly used benchmarks, our pre-training data was procured and processed by a separate team that was strongly incentivized to prevent contamination of that pre-training data with external benchmarks. As another example, we ensure that our human evaluations remain trustworthy by allowing only a small set of researchers who do not contribute to model development to perform and access these evaluations. While such organizational decisions are rarely discussed in technical papers, we found them to be pivotal to the successful development of the Llama 3 family of models.

Acknowledgements lists 219 core contributors (worked on the project >2/3rd of the runtime) and 311 additional contributors (>1/5th), a train time of ~3mo and who knows how much non-train work. I don't know, I've yet to be outside a startup but that doesn't sound like the norm for bigtech?

Finally, one place where meta has not distinguished itself is brevity and concision in writing but we skimmed sections 5 [results], 7 [data] and 8 [speech] to get to the meat in 3 [pre-training], 4 [post-training] and 6 [inference]. Intro, related work, conclusion are also informative.

technillogue/llama_3.1_highlights.md