Skip to content

Instantly share code, notes, and snippets.

@f0ster
Created February 17, 2025 20:03
Show Gist options
  • Save f0ster/9d94e8215d1abad03734ce4706908e0c to your computer and use it in GitHub Desktop.
Save f0ster/9d94e8215d1abad03734ce4706908e0c to your computer and use it in GitHub Desktop.
Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban

Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban

1. CUDA and PTX Optimizations

DeepSeek-V3’s engineers optimized GPU performance at the low-level by tailoring kernels and memory access patterns to NVIDIA’s hardware. A key strategy was warp specialization: they partitioned a subset of GPU threads (warps) specifically for communication tasks, allowing compute to overlap with data transfers (DeepSeek-V3 Technical Report). In practice, only ~20 of the GPU’s Streaming Multiprocessors (SMs) were reserved to handle all cross-node communications – enough to saturate both InfiniBand (IB) and NVLink bandwidth – while the remaining SMs focused purely on computation (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report). This isolation ensured that communication overhead did not stall the main training loops.

To implement this, custom PTX instructions were used to finely control memory operations. The team auto-tuned the communication chunk size at the PTX level, meaning data was broken into optimally sized pieces for transfer. This reduced overuse of L2 cache and minimized interference with other SMs (DeepSeek-V3 Technical Report). In essence, by sending data in well-sized batches and using low-level memory fence and load instructions, they prevented cache thrashing and contention between communication and compute threads. The result was nearly full compute-communication overlap, eliminating the typical waiting time when shuttling token activations between GPUs (GitHub - deepseek-ai/DeepSeek-V3).

Memory management on GPU was also optimized. For example, DeepSeek-V3 recomputed certain values on the fly during backpropagation (such as RMSNorm outputs and MLA up-projection results) instead of storing them, to save memory (DeepSeek-V3 Technical Report). Less critical data (like the Exponential Moving Average of weights) was kept in CPU memory and updated asynchronously, freeing GPU RAM (DeepSeek-V3 Technical Report). These choices reduced memory pressure and allowed using larger batch sizes or sequence lengths without running out of GPU memory.

Code Example – Warp-Specialized Communication Kernel (Pseudo-PTX):

// Simplified pseudo-code illustrating warp specialization for communication
const int warp_id = threadIdx.y;        // Identify warp in the block
if (warp_id < num_ib_send_warps) {
    // This subset of warps handles InfiniBand sending
    perform_IB_send(token_chunk);
} else if (warp_id < num_ib_send_warps + num_nvlink_fwd_warps) {
    // Next subset handles forwarding via NVLink within node
    forward_via_NVLink(token_chunk);
} else if (warp_id < num_comm_warps) {
    // Remaining comm warps handle NVLink receive & IB receive
    perform_NVLink_receive(token_chunk);
    accumulate_IB_receive(token_chunk);
} else {
    // Other warps (and other SMs) handle regular compute (FFN, attention, etc.)
    compute_transformer_layer(data_chunk);
}
// (Communication chunk sizes and warp allocations are auto-tuned for optimal L2 cache use)

In this pseudo-code, different warps within a CUDA thread block are assigned to different communication subtasks (InfiniBand send, NVLink forward/receive, etc.). The actual implementation uses PTX-level synchronization and optimized memory moves (e.g. ldmatrix and cp.async in PTX) to ensure data transfers overlap with compute. This concurrent scheduling allows DeepSeek-V3 to hide almost all communication latency behind computation, achieving high hardware utilization. Empirically, the team reported that only 20 SMs (out of the entire GPU) sufficed to fully utilize IB (≈50 GB/s) and NVLink (≈160 GB/s) bandwidth, with no slowdown to the transformer computations on the other SMs (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report). This low-level efficiency was crucial in making training scalable across the 2048 GPUs in the cluster.

2. Mixture-of-Experts (MoE) Architecture

DeepSeek-V3 uses a Mixture-of-Experts architecture, dubbed DeepSeekMoE, to scale up the model’s capacity cost-effectively (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report). In each MoE layer, instead of a single large feed-forward network (FFN), there are many smaller expert networks. DeepSeek-V3’s design includes 256 routed experts per MoE layer (finer-grained than prior MoE models) plus 1 shared expert that is always active (DeepSeek-V3 Technical Report). Each expert is essentially a smaller FFN (with an intermediate hidden size of 2048 in this model, much smaller than a dense FFN in an equally large model) (DeepSeek-V3 Technical Report). At inference/training time, only a sparse subset of these experts is used for any given token – specifically, each token activates the shared expert and its top-8 routed experts (i.e. $k=8$) based on a gating function (DeepSeek-V3 Technical Report). This means although the model has 671 billion parameters in total, only ~37 billion are active per token prediction (DeepSeek-V3 Technical Report), dramatically cutting computation. The MoE architecture thus expands capacity (for diversity of knowledge across experts) without linearly increasing the compute required per token.

Expert Gating Mechanism: DeepSeek-V3 uses a learned gating network to decide which experts to use for each input token. For a token with hidden state $\mathbf{h}i$, the gate computes a token-to-expert affinity score for each expert. Formally, one can imagine each expert $e$ has a trainable centroid vector $\mathbf{c}e$; the affinity could be the dot product $a{i,e} = \sigma(\mathbf{h}i \cdot \mathbf{c}e)$, passed through a sigmoid activation (DeepSeek-V3 Technical Report). The gating algorithm then selects the top-$K$ experts with highest $a{i,e}$ (with $K=8$ for DeepSeek-V3). Let $E_i$ be the set of selected expert indices for token $i$. The final gating value $g{i,e}$ for each chosen expert is the affinity normalized over the top-$K$ set (so that the contributions sum to 1) (DeepSeek-V3 Technical Report). In other words, if $S_i = \sum{e\in E_i} a_{i,e}$, then for each expert $e\in E_i$, $g_{i,e} = a_{i,e}/S_i$. These gating values determine how much each expert’s output will influence the token’s final output. The token’s FFN output is then a weighted combination of the shared expert output and the selected experts’ outputs:

$$ y_i ;=; E_{\text{shared}}(\mathbf{h}i);+;\sum{e \in E_i} g_{i,e}, E_e(\mathbf{h}_i),, $$

where $E_{\text{shared}}(\cdot)$ is the shared expert (always active) and $E_e(\cdot)$ is the FFN function for expert $e$. This sparse aggregation means most experts stay “dormant” for a given token, saving compute. Importantly, DeepSeek-V3 uses sigmoid normalization for affinities (instead of softmax used in some MoE variants), which they found effective for stable training (DeepSeek-V3 Technical Report).

Code Example – Gating and Sparse Expert Forward Pass:

# Pseudocode for MoE gating and expert dispatch for one token
affinities = sigmoid(h_i @ expert_centroids.T)         # compute token-to-expert affinity for each expert
topk_indices = affinities.topk(k=8)                    # select indices of top-8 experts
selected_aff = affinities[topk_indices]
gating_values = selected_aff / selected_aff.sum(dim=0) # normalize affinities of selected experts

# Compute outputs of selected experts (plus shared expert)
output = shared_expert.forward(h_i)  # output from always-active shared expert
for idx, g in zip(topk_indices, gating_values):
    output += g * experts[idx].forward(h_i)            # weighted sum of top-8 expert outputs

This pseudocode shows the token’s hidden state h_i being routed to the shared expert and a handful of top experts. Only those experts’ forward passes are computed, making the layer computation sparse. The gating values g weight each expert’s contribution. In DeepSeek-V3’s actual implementation, this routing is done in batch: each GPU handles a subset of experts and tokens are exchanged between GPUs in an all-to-all fashion so that each token’s data reaches its assigned experts. The earlier CUDA optimizations (warp-specialized communication) ensure this routing is efficient.

Load Balancing without Auxiliary Loss: A notorious challenge in MoE models is preventing certain experts from becoming “overloaded” (attracting too many tokens) while others are rarely used. Prior MoE systems (e.g. Google’s GShard or Switch Transformer) often added an auxiliary loss term to encourage balanced expert utilization (penalizing high variance in load) (DeepSeek-V3 Technical Report). DeepSeek-V3 innovates by achieving balance without an auxiliary loss, avoiding the trade-off where such losses can hurt model quality (DeepSeek-V3 Technical Report). Instead, it introduces a dynamic bias term in the gating function for each expert (DeepSeek-V3 Technical Report). During training, after each training step, it measures each expert’s load (how many tokens were routed to it). If an expert was overloaded relative to others, the gating bias $b_e$ for that expert is decreased slightly; if underloaded, $b_e$ is increased (DeepSeek-V3 Technical Report). In formula form, one can imagine updating:

$$ b_e \leftarrow b_e - \eta,\mathbf{1}{\text{overloaded}(e)} + \eta,\mathbf{1}{\text{underloaded}(e)},, $$

with a small learning rate $\eta$ (called bias update speed) (DeepSeek-V3 Technical Report). This bias $b_e$ is added to the affinity score for expert $e$ only for the purpose of the top-$K$ selection, not for the gating value itself (DeepSeek-V3 Technical Report). Thus, if an expert was overused, its bias will drop, making it slightly less likely to be selected next time (and vice versa for underused experts). Over thousands of steps, this has the effect of equalizing the load across experts without directly injecting a balancing term into the loss. DeepSeek-V3 reports that this auxiliary-loss-free strategy maintains excellent balance and even yields better model quality than using a fixed auxiliary loss term (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report). They still include a tiny sequence-level load loss as a safety measure (to prevent one single sequence from sending all its tokens to the same expert), but this is set with an extremely small weight (e.g. $10^{-4}$) just to avoid pathological cases (DeepSeek-V3 Technical Report). In practice, the dynamic gating bias was sufficient to keep the load evenly distributed, so no tokens had to be dropped due to expert overflow (some MoE implementations drop excess tokens when an expert’s capacity is exceeded, but DeepSeek-V3 did not need to drop any) (DeepSeek-V3 Technical Report).

Expert Parallelism and Efficiency: The MoE design greatly improves computational efficiency. With 8 experts active per token out of 256, the model achieves sparsity of 97% in the expert layers – i.e. 97% of expert parameters are inactive for each token, saving compute. Meanwhile, the one shared expert provides a dense “backbone” so that every token always has at least one expert processing it (this helps model stability and generalization). The effective per-token FLOPs are much lower than a dense model of comparable size. To handle the experts distributed across many GPUs, DeepSeek-V3 employs an optimized all-to-all communication that sends tokens to the GPUs hosting their selected experts, using the strategies from Section 1 (limiting each token to 4 nodes max to cap communication cost) (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report). Thanks to those optimizations, they achieve near-linear scaling of the MoE layers across the 2048 GPUs. In summary, the DeepSeekMoE architecture allows the model to scale to 671B parameters total, but with only 37B parameters activated per token on average (DeepSeek-V3 Technical Report), striking an excellent balance between model capacity and throughput.

3. Multi-Head Latent Attention (MLA)

Another major innovation in DeepSeek-V3 is its Multi-Head Latent Attention (MLA) mechanism, which addresses the memory and speed bottlenecks of attention over long sequences. In a standard Transformer, each new token’s attention requires computing dot-products with all past keys and then weighting all past values. To avoid recomputing keys and values for past tokens repeatedly, modern LLMs use a KV cache: they store the key and value vectors for every past token so that on each new token, the model can just fetch those and do the attention calculation (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog). However, this cache can become huge – for example, a model with hidden size $d=7168$ and 128 heads storing keys and values for a context of 2048 tokens uses many gigabytes of memory. For very long contexts (e.g. 32k or 128k tokens), the KV cache becomes a serious memory bottleneck.

MLA tackles this by compressing the keys and values into a lower-dimensional latent space. The core idea is a learned low-rank projection: it finds that keys and values often have redundant information that can be jointly compressed. Concretely, for each attention layer, let $d_h$ be the per-head dimension (for DeepSeek-V3, $d_h=128$ per head) and $d_c$ be a smaller latent dimension (the paper calls this the KV compression dimension, e.g. $d_c=512$ for the whole multi-head block, which might correspond to $d_c=4$ per head in that case – we'll explain the exact numbers shortly) (DeepSeek-V3 Technical Report). For each token $i$, MLA performs:

  1. Joint projection of Key and Value: Form the concatenation of the key and value vectors (for a given head or set of heads) and multiply by a down-projection matrix $W_{\downarrow}$. If $\mathbf{K}_i$ and $\mathbf{V}_i$ are the original key and value, then $$\mathbf{z}i = W{\downarrow}; [\mathbf{K}_i; \mathbf{V}_i] \in \mathbb{R}^{d_c},$$ where $[\mathbf{K}_i; \mathbf{V}_i]$ denotes concatenation of the two vectors (DeepSeek-V3 Technical Report). This produces a compressed latent vector $\mathbf{z}_i$ that captures the information of both $\mathbf{K}_i$ and $\mathbf{V}_i$ in a lower dimensional space.

  2. Up-projection to Key and Value: To use this latent in attention, two up-projection matrices map $\mathbf{z}_i$ back to the original key and value spaces (or their approximations). That is, $$\tilde{\mathbf{K}}i = W{\uparrow}^K, \mathbf{z}i, \qquad \tilde{\mathbf{V}}i = W{\uparrow}^V, \mathbf{z}i,$$ with $W{\uparrow}^K \in \mathbb{R}^{d_h \times d_c}$ and $W{\uparrow}^V \in \mathbb{R}^{d_h \times d_c}$ (DeepSeek-V3 Technical Report). $\tilde{\mathbf{K}}_i$ and $\tilde{\mathbf{V}}_i$ are the reconstructed key and value for token $i$. If the compression is effective, $\tilde{\mathbf{K}}_i$ will be close to the original $\mathbf{K}_i$ in direction, and similarly for $\tilde{\mathbf{V}}_i$.

At this point, if we only did the above, we could cache $\mathbf{z}_i$ for each token instead of full $\mathbf{K}_i$ and $\mathbf{V}_i$. That alone would shrink memory usage by roughly a factor of $\frac{2d_h}{d_c}$. For example, if $2d_h=256$ and $d_c=64$, that’s a 4x reduction. DeepSeek-V3 actually achieves even more compression by sharing this latent across multiple heads (i.e. the 512 latent dimension is for all 128 heads combined in some way, as implied by the hyperparameters) – effectively rank-reducing the entire multi-head attention. As reported, MLA brings KV memory down to only 5–13% of what standard multi-head attention (MHA) would use for the same context (DeepSeek’s Multi-Head Latent Attention Method - Horasis).

However, a challenge arises with Rotary Positional Embeddings (RoPE), which DeepSeek-V3 uses to encode token positions in the keys and queries. RoPE (Su et al., 2021) multiplies each key and query vector by a position-dependent rotation matrix, intertwining sinusoidal functions to inject positional phase information. In normal attention, you can apply RoPE to keys and queries directly since you store full keys for each position. But in MLA, since keys are stored in compressed form, we need to ensure the positional information isn’t lost. DeepSeek-V3 solves this by a clever decoupling of positional components (DeepSeek-V3 Technical Report) (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog):

  1. Decoupling and applying RoPE: Before caching, the key is split into two parts – one part that will carry the RoPE positional modulation and one part that remains in latent form. In the paper’s formulation, a matrix $D$ is applied to the reconstructed key: $$\hat{\mathbf{K}}_i = D,\tilde{\mathbf{K}}_i,$$ to produce a decoupled key vector (DeepSeek-V3 Technical Report). This decoupled key $\hat{\mathbf{K}}_i$ is then passed through the RoPE operation: $$\mathbf{K}^{(\text{rope})}_i = \text{RoPE}(\hat{\mathbf{K}}_i,, \text{position}=i).$$ Here $\text{RoPE}(\cdot)$ applies the complex-valued rotation to the key coordinates corresponding to different frequency bands of the positional embedding. Crucially, in DeepSeek-V3’s implementation, the input to the RoPE step for keys comes from the original input embedding rather than the compressed latent (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog). In simpler terms, they don’t rely on the compressed $\tilde{\mathbf{K}}_i$ alone to inject position— they incorporate direct information from the uncompressed representation for the positional part. The “decoupling” matrix $D$ likely selects or transforms the key in such a way that the positional components (e.g. the sinusoids) are isolated and can be accurately rotated.

    Practically, what this means is that only two small vectors per token need to be stored for caching: the compressed latent $\mathbf{z}_i$ and the decoupled position-aware key $\hat{\mathbf{K}}_i$ (or an equivalently small positional identifier). These are the “blue-boxed vectors” the paper refers to that are cached (DeepSeek-V3 Technical Report). The value vector doesn’t need an extra positional part (since positional info only affects attention weights via keys and queries), so $\tilde{\mathbf{V}}_i$ can be recreated on the fly from $\mathbf{z}_i$ when needed. By caching $\mathbf{z}_i$ (say 64 dims) instead of both $K_i$ and $V_i$ (256 dims total) – plus perhaps a 64-dim $\hat{K}_i$ – the memory footprint is massively reduced (on the order of 25% or less of the original, consistent with the reported 5–13% range when all heads and overhead are accounted) (DeepSeek’s Multi-Head Latent Attention Method - Horasis).

  2. Compressing Queries: MLA also compresses the query vectors, though queries are not cached (they exist only for the current token during inference). Compressing queries is useful during training to reduce activation memory and compute in backprop. They introduce a query compression dimension $d'_c$ (in DeepSeek-V3, $d'c=1536$ for all heads combined) (DeepSeek-V3 Technical Report). The query compression works similarly: $$\mathbf{z}^q_i = W{\downarrow}^q, \mathbf{Q}_i,$$ $$\tilde{\mathbf{Q}}i = W{\uparrow}^q, \mathbf{z}^q_i,$$ $$\hat{\mathbf{Q}}_i = R, \tilde{\mathbf{Q}}_i,$$ $$\mathbf{Q}^{(\text{rope})}i = \text{RoPE}(\hat{\mathbf{Q}}i,, \text{position}=i).$$ Here $W{\downarrow}^q$ and $W{\uparrow}^q$ are the query down/up projection matrices, and $R$ is analogous to $D$ (a decoupling transform for the query) (DeepSeek-V3 Technical Report). The query is thus split into a content part $\tilde{\mathbf{Q}}_i$ and a positional part $\mathbf{Q}^{(\text{rope})}_i$. Interestingly, for queries, both parts can come from the compressed query (since we can compute the query fresh each time, we might not need to mix in any original signal beyond what $R$ does).

  3. Reconstructing attention outputs: After these steps, the attention mechanism proceeds with a mix of compressed and reconstructed vectors. The final query used for attention is effectively $$\mathbf{Q}^_i = [,\tilde{\mathbf{Q}}_i;; \mathbf{Q}^{(\text{rope})}_i,],$$ and similarly each past key has $$\mathbf{K}^_j = [,\tilde{\mathbf{K}}_j;; \mathbf{K}^{(\text{rope})}j,].$$ In DeepSeek-V3’s architecture, the concatenation of the content and rope parts per head yields a vector that is the same length as the original full key or query (they chose these dimensions such that $d_h$ is effectively split between the two) (DeepSeek-V3 Technical Report). For example, they set each head to 128 dimensions, with 64 of those being “decoupled (RoPE) dimensions” and 64 being through the latent content path (DeepSeek-V3 Technical Report). This way, $\mathbf{Q}^$ and $\mathbf{K}^$ are 128-dim, comparable to original, and attention can operate in the usual way: $$\alpha{i,j} ;=; \mathrm{softmax}\Big(\frac{\mathbf{Q}^_i \cdot {\mathbf{K}^j}^T}{\sqrt{d_h}}\Big),$$ $$\text{AttnOutput}i ;=; \sum{j \le i} \alpha{i,j},\tilde{\mathbf{V}}_j,$$ $$\mathbf{h}'i ;=; W{\text{out}},\text{AttnOutput}i,$$ where we sum over all past tokens $j$ up to $i$ (for autoregressive decoding) and $W{\text{out}}$ is the output projection of the attention head (DeepSeek-V3 Technical Report). Notice that we use $\tilde{\mathbf{V}}_j$ (the value reconstructed from latent $z_j$) in the weighted sum. This means at inference time, when a new token $i$ comes in, the model will: retrieve each cached $(\mathbf{z}_j, \hat{\mathbf{K}}_j)$ for $j &lt; i$, rebuild $\tilde{\mathbf{K}}j$ and $\tilde{\mathbf{V}}j$ via $W{\uparrow}^K, W{\uparrow}^V$, apply RoPE to $\hat{\mathbf{K}}_j$ to get $\mathbf{K}^{(\text{rope})}_j$, form $\mathbf{K}^_j$, and then do the dot-products with $\mathbf{Q}^_i$. Although this sounds like extra work, the key point is that $\mathbf{z}_j$ and $\hat{\mathbf{K}}_j$ are much smaller to store and retrieve than full $\mathbf{K}_j, \mathbf{V}_j$. The small overhead of reconstructing the full vectors on the fly is outweighed by the memory and bandwidth savings of moving around 5–10× less data for each token. In fact, MLA can even speed up generation because memory bandwidth is often the bottleneck in serving large models.

To illustrate the memory advantage: the DeepSeek-V3 report notes that MLA reduces KV cache memory to 5–13% of what standard multi-head attention would require (DeepSeek’s Multi-Head Latent Attention Method - Horasis). This was validated up to 128K context lengths in their tests (the model can handle very long prompts efficiently) (DeepSeek-V3 Technical Report). Compared to alternative approaches like Multi-Query Attention (MQA) (which shares one key/value across all heads) or grouped-query attention, MLA provides a better accuracy-memory trade-off. It preserves per-head diversity (each head still has its own keys/values, just compressed) and the paper’s analysis shows MLA’s modeling capacity is on par with or even better than full multi-head attention (DeepSeek-V3 Explained 1: Multi-head Latent Attention | Towards Data Science). By choosing appropriate compression ranks (they set the joint KV latent to 512 for 128 heads, and query latent to 1536 for 128 heads (DeepSeek-V3 Technical Report)), they ensure minimal drop in attention expressiveness. In fact, the authors highlight that MLA slightly improved performance in some benchmarks, likely because it acts as a form of regularization and allowed them to use more heads (128 heads instead of perhaps 64) to compensate (DeepSeek-V3 Explained 1: Multi-head Latent Attention | Towards Data Science).

In summary, Multi-Head Latent Attention compresses keys and values into a smaller latent vector and only stores that, plus a small RoPE-adjustment vector, in the cache. When combined with efficient implementation, this yields huge memory savings for long contexts and speeds up inference (less data to shuffle around) (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog) (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog), all while maintaining accuracy comparable to standard attention (DeepSeek-V3 Technical Report). MLA is one of the cornerstones that enabled DeepSeek-V3 to handle large contexts and deploy on limited hardware without running into memory limits.

4. Optimization Analysis (Efficiency Strategies)

Beyond MoE and MLA, DeepSeek-V3 employed numerous additional optimizations to maximize training throughput and minimize costs:

  • FP8 Precision Training: DeepSeek-V3 is one of the first ultra-large models trained with 8-bit floating point precision for most operations. The team designed a mixed-precision framework using FP8 for matrix multiplies, while keeping certain sensitive parts in higher precision (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report). In their setup, all the heavy GEMM (General Matrix-Multiply) operations in forward and backward (including the feed-forward layers, attention projection layers, etc.) run in FP8, accumulating results in FP32 or BF16 outputs (DeepSeek-V3 Technical Report). By using NVIDIA Hopper GPUs (H800) which support FP8 Tensor Core operations, they effectively doubled the compute throughput for those layers compared to the typical BF16 precision (DeepSeek-V3 Technical Report). This is a massive speedup: matrix multiplies dominate training time, so running them at 2× speed translates to nearly 2× faster training, if the reduced precision doesn’t hurt convergence.

    The challenge with FP8 (which often uses a format like E4M3 or E5M2, i.e. 4- or 5-bit exponent) is the limited dynamic range – large outlier values can cause overflow or underflow, and tiny differences can be lost. DeepSeek-V3 mitigated this with fine-grained quantization strategies (DeepSeek-V3 Technical Report). Instead of using one scale factor for an entire matrix or tensor, they group elements into small tiles or blocks and compute separate scale/offset for each. The paper mentions “tile-wise grouping with elements or block-wise grouping with elements” (the exact number of elements is in the paper, likely 128 based on context) (DeepSeek-V3 Technical Report). For example, they might divide an activation matrix into tiles of size 128 and compute an FP8 scale for each tile. This per-group scaling captures outliers better: even if one part of the tensor has very large values, it won’t force the entire tensor’s scale to be huge – only that tile’s scale will adjust. This technique is commonly used to handle activation outliers in low-precision training (it’s akin to percentile clipping or a per-channel scaling). The overhead of extra scaling multiplies is kept low by their next trick: higher-precision accumulation.

    DeepSeek-V3 implements an “increased-precision accumulation process” in the GPU kernels (DeepSeek-V3 Technical Report). Essentially, while using Tensor Cores for FP8 matrix multiply-add (MMA) operations, they promote partial results to FP32 on CUDA cores for accumulation (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report). In other words, after multiplying FP8 inputs, the intermediate sum is converted to FP32 and accumulated in a standard register, rather than accumulating in FP8 or FP16. This avoids precision loss from summing many small FP8 values. Thakkar et al. (2023) is cited for this approach, which requires moving data between Tensor Cores and regular CUDA cores mid-operation (DeepSeek-V3 Technical Report). DeepSeek’s team acknowledges that doing this frequently incurs some overhead (data movement between cores) (DeepSeek-V3 Technical Report), but it significantly improves numerical accuracy. The fine-grained FP8 quantization plus FP32 accumulation ensured that training remained stable: they report that an FP8-trained model reached the same loss as a BF16-trained model within 0.25% relative error (DeepSeek-V3 Technical Report) – essentially negligible difference – over a 1 trillion token pilot run. This is a remarkable result: it validates that FP8 is feasible even for 100B+ scale models (DeepSeek-V3 Technical Report).

    Not every operation was done in FP8. DeepSeek-V3 keeps a few types of computations in higher precision (BF16 or FP32) where needed (DeepSeek-V3 Technical Report): for example, the embedding layers and softmax in attention, layer normalization, and the MoE gating softmax all remain in BF16/FP32 (DeepSeek-V3 Technical Report). These are either low-cost operations (so using higher precision has minimal speed impact) or particularly sensitive to quantization (softmax and normalization can suffer from precision issues). By “selectively retaining” full precision in those parts, they ensure no loss spikes or divergence occurs (DeepSeek-V3 Technical Report). Another place they used higher precision is in the optimizer states: while activations and weights might be stored in FP8 during computation, the AdamW optimizer accumulators (moments) were stored in BF16 (DeepSeek-V3 Technical Report). They also used BF16 for weight updates. This way, the long-term state of the model isn’t affected by 8-bit noise accumulation.

    Furthermore, they stored activations in FP8 whenever possible to save memory (DeepSeek-V3 Technical Report). For instance, after the forward pass, the activations needed for backward (like outputs of linear layers) were kept in FP8 instead of BF16. Because they had the FP8 GEMM kernels for weight gradients, they could feed those 8-bit activations directly into backward passes. This contributed to memory savings and bandwidth reduction (less data to read/write from GPU memory). Combined with the MoE and MLA savings, this allowed very large batch sizes and sequence lengths even on 80GB GPUs.

  • Custom GPU Kernel Tuning: DeepSeek’s team did not rely on stock deep learning frameworks alone; they actively co-designed kernels and algorithms to squeeze out performance. The earlier example of warp-specialized communication is one such kernel optimization at the system level. They also mention support for things like transposed GEMM operations in their pipeline (DeepSeek-V3 Technical Report). During backpropagation through attention, one often needs to multiply by transposed weight matrices. The standard libraries might not have FP8 kernels for all transpose cases, so DeepSeek likely wrote or tuned those. In their report, they even suggest future hardware changes to handle transposed matrix reads more efficiently for FP8 (to avoid an extra read-dequantize-transpose-requantize cycle) (DeepSeek-V3 Technical Report). This indicates they developed a workflow to handle these transposes under the hood, possibly by fusing operations or using the new Hopper Tensor Memory Accelerator (TMA) features in creative ways.

    Another custom optimization is the DualPipe scheduling. DualPipe refers to how they overlap pipeline-parallel workloads. DeepSeek-V3’s model is split into multiple stages across GPUs (since 61 transformer layers wouldn’t fit on a single GPU). They likely divide into, say, 8 pipeline stages. DualPipe means while one micro-batch of data is flowing forward through the pipeline, another micro-batch’s gradients are flowing backward – effectively utilizing the pipeline in both directions concurrently. This halves the “pipeline bubble” overhead. By carefully timing the forward and backward passes, they keep every pipeline stage busy almost all the time. The result was near-linear scaling efficiency even with pipeline parallelism. In their GitHub, they note “through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap” (GitHub - deepseek-ai/DeepSeek-V3). This co-design likely encompasses DualPipe (algorithmic pipeline schedule), custom NCCL communication tweaks (framework), and taking advantage of NVSwitch/NVLink topology (hardware).

    All-to-all communication for the MoE experts was heavily optimized. Normally, sending tokens to 256 experts across nodes could overwhelm network bandwidth. DeepSeek-V3 limited each token to at most 4 nodes (since it picks 8 experts and spreads them over at most 4 machines) (DeepSeek-V3 Technical Report). They also “aggregate IB traffic” such that a GPU sends one combined message to each remote node for all experts on that node (DeepSeek-V3 Technical Report). On receiving nodes, tokens are immediately forwarded via NVLink to the specific GPU hosting the expert, without waiting (this prevents IB from becoming sequential) (DeepSeek-V3 Technical Report). By overlapping IB and NVLink transfer, they hide latency. The chunking of communication – breaking the transfer into smaller pieces – was adjusted so that while one chunk is being processed by the compute kernel, the next chunk is transferring, etc. (DeepSeek-V3 Technical Report). The custom PTX mentioned earlier ensured that these chunks didn’t evict too much cache. In effect, the communication for MoE became async and pipeline-driven, just like the compute. This is a highly non-trivial engineering feat, turning what could be a barrier (sending data for 37B parameters worth of activations across 2048 GPUs each step) into a mostly overlapped task. They state that under this strategy, MoE communication only occupied the designated warps and did not bottleneck the overall training loop (DeepSeek-V3 Technical Report).

    Memory-wise, apart from FP8 compression and MLA, they did a couple of clever things: they shared the embedding and output layers of the model thanks to their Multi-Token Prediction (MTP) setup (DeepSeek-V3 Technical Report). MTP is a training objective where each token predicts multiple future tokens, which can be implemented with a small additional head. By co-locating and sharing parameters for the word embeddings and this output head on the same GPU, they saved memory (the embedding table is huge, but used in both places, so they store one copy) (DeepSeek-V3 Technical Report). They also offloaded some state to CPU (e.g., they kept a copy of model weights on CPU for doing an Exponential Moving Average to monitor training progress) (DeepSeek-V3 Technical Report). All gradient accumulation was done in 16-bit, and checkpoints were kept carefully to avoid duplicates. These all contribute to reducing the memory per GPU, which allows either bigger batches or larger model shards per GPU.

Benchmarking these optimizations: In ablation studies, DeepSeek-AI showed that FP8 training did not degrade performance notably (DeepSeek-V3 Technical Report), and the training remained remarkably stable – they encountered no irreversible loss spikes and never had to restart training from a checkpoint due to divergence (GitHub - deepseek-ai/DeepSeek-V3). This is impressive given the combination of techniques (MoE, FP8, large batch schedules) which can be unstable individually. The training of 14.8 trillion tokens completed without incident, suggesting their careful design (e.g. keeping certain operations in high precision, using RMSNorm which is more stable than LayerNorm, etc.) paid off.

In terms of speed, although exact throughput numbers aren’t given in the text, we can infer that each H800 GPU in the 2048-node cluster was kept busy. FP8 effectively doubled math throughput; overlapping communication hid most latency. The result is that DeepSeek-V3 could be trained in a reasonable timeframe (see next section), whereas a naive implementation might have taken significantly longer or required more GPUs.

5. Economic Analysis

DeepSeek-V3 demonstrates that a GPT-4 class model can be trained at a fraction of the cost through clever engineering. According to reports, DeepSeek-AI trained the 671B-parameter model on a cluster of 2048 NVIDIA H800 GPUs over approximately 2 months (DeepSeek’s Multi-Head Latent Attention Method - Horasis). The total compute used was about 2.788 million GPU-hours on H800 (Hopper architecture) GPUs (DeepSeek-V3 Technical Report). In monetary terms, if one assumes roughly $2 per hour for an 80GB GPU, the pre-training cost comes out to **$5.6 million** (DeepSeek-V3 Technical Report). Indeed, the authors explicitly calculate this: “2.788M H800 GPU hours… amount to only $5.576M” for the entire training run (DeepSeek-V3 Technical Report). Even adding expenses for fine-tuning and reinforcement learning phases, the total was likely around $6M or a bit more. This aligns with a quote by Andrej Karpathy highlighting “an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M)” (DeepSeek’s Multi-Head Latent Attention Method - Horasis).

To put this in perspective, training costs for models like OpenAI’s GPT-4 have been estimated around $100M (unofficially) (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog). While OpenAI hasn’t confirmed exact figures, Wired and other outlets reported sums in the tens of millions for GPT-4’s training. The DeepSeek-V3 budget is roughly an order of magnitude lower. Similarly, Meta’s LLaMA-2 70B model (much smaller) was rumored to cost a few million to train; DeepSeek-V3 at 671B (sparsely activated) coming in near $5-6M is exceptionally cost-effective. DeepSeek R1 (a reinforcement learning tuned model) also claims GPT-4 level reasoning at a fraction of the cost (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog), emphasizing the company’s focus on efficiency.

How did DeepSeek achieve such cost reductions? It’s a combination of algorithmic innovation and scaling strategy:

  • Sparsity for Compute Efficiency: The MoE architecture is the prime factor. By activating only ~5% of the model’s parameters for each input, DeepSeek-V3 drastically cuts the FLOPs required per token compared to a dense model of similar size. In effect, it’s getting “671B-scale” model diversity at “37B-scale” compute cost per token (DeepSeek-V3 Technical Report). This is like having a huge ensemble of networks (experts) but only using a handful for each query. Thus, to achieve a certain loss or accuracy, DeepSeek-V3 doesn’t need to spend as much compute as a dense 671B model would. The team actually compared DeepSeek-V3 to dense models: it outperforms dense models of 70B and even competes with ones of 400B+ (LLaMA-3.1 405B) in many benchmarks (DeepSeek-V3 Technical Report), but with far less compute used during training. This means better “quality per FLOP”, which directly translates to lower cost for a given performance target.

  • Maximizing Hardware Utilization: The engineering optimizations (overlap, parallelism, etc.) ensured that the expensive GPU hardware was never idle. A significant portion of large-model training cost is actually GPU time wasted due to communication or load imbalance. DeepSeek’s near-100% utilization (by overlapping comm/compute and using pipeline and data parallelism effectively) means they got the most out of each of those 2048 GPUs. For example, if naive training only achieved 50% utilization due to waiting on data, you’d need twice as many GPU-hours (and thus twice the cost) to finish the job. DeepSeek’s co-design prevented such inefficiencies (GitHub - deepseek-ai/DeepSeek-V3). The per-step time was minimized, so running on 2048 GPUs for 60 days yielded a huge 14.8 trillion tokens processed, which is an excellent throughput.

  • Hardware Choices and Scalability: They used NVIDIA H800 GPUs, which are a variant of the H100 optimized for certain markets. These have slightly lower interconnect bandwidth than the flagship H100, but DeepSeek mitigated that via algorithm. It’s possible H800s were more accessible or cost-effective for them to procure. They also leveraged NVSwitch and NVLink within each 8-GPU node effectively, and InfiniBand across nodes. By limiting each token’s cross-node communication (max 4 nodes) (DeepSeek-V3 Technical Report), they kept the IB traffic manageable, which means they didn’t need an excessively expensive network infrastructure beyond the 2048 GPU cluster (no need for every GPU to talk to every other GPU at full speed). In essence, they struck a balance between model parallelism and data parallelism that fit the hardware well. Each MoE expert is sharded across 64 GPUs (8 nodes) (DeepSeek-V3 Technical Report), and each node handles many experts – this mapping allowed scaling to 2048 GPUs with high efficiency without a blowup in communication cost.

  • Faster Convergence per Token: Another subtle point is their Multi-Token Prediction (MTP) training objective (each token predicts the next two tokens on average) (DeepSeek-V3 Technical Report). This could potentially accelerate training by learning from additional targets per sequence, and it also enables speculative decoding for faster inference (not directly a training cost thing, but an efficiency). If MTP helps convergence, they might reach a given perplexity in fewer token steps or with more parallel prediction per step, again saving cost.

  • No costly resets: The stability of training (no loss spikes, no restarts needed) (GitHub - deepseek-ai/DeepSeek-V3) also saved time. Some large runs encounter instabilities that require reverting to an earlier checkpoint, effectively wasting some compute. DeepSeek-V3’s smooth training meant every GPU hour spent was an hour of useful learning.

  • Leverage open source and community: While not a direct training cost factor, it’s worth noting DeepSeek open-sourced their model. This means they didn’t need to invest heavily in long secrecy or multiple redundant experiments in private; they could incorporate the latest techniques (like borrowing ideas from Bloom, LLaMA, etc.) and perhaps even got community feedback during development (DeepSeek-V2 was earlier, and V3 builds on it). Open-sourcing also spreads out the burden of evaluation and improvement, which economically can be seen as getting “free R&D” from the community post-release.

In summary, DeepSeek-V3’s training efficiency is a result of architectural choices that reduce required computation (MoE, MLA, MTP) and low-level optimizations that fully utilize hardware (FP8, kernel fusion, overlapping, etc.). The trade-offs they navigated include the added complexity of implementation and potential risk of newer techniques, but those clearly paid off. The model reached frontier performance (comparable to GPT-4 in many benchmarks as claimed) (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog) (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog) at a cost an order of magnitude lower than what was believed such models require. By investing in engineering, DeepSeek avoided the brute-force expense many others incurred. This democratizes high-end AI research – showing that with the right innovations, a team with a $5M budget and a modest GPU cluster can produce a model in the league of those made by tech giants. Indeed, DeepSeek’s achievement was to “make it look easy” with an open-source release of a model trained on “a joke of a budget” (DeepSeek’s Multi-Head Latent Attention Method - Horasis), illustrating that thoughtful optimizations can dramatically bend the cost curve of large-scale AI.

Overall, DeepSeek-V3’s development showcases a synergy of algorithmic advances (sparse MoE, latent attention, low-precision training) and systems engineering (custom CUDA kernels, pipeline parallelism, memory management) to push the boundaries of efficient large-model training. It serves as a blueprint for how future models might be trained more economically, bringing us one step closer to accessible AGI-level systems (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog), and highlights the economic advantage of innovation in AI development.

Sources: The analysis above is based on the DeepSeek-V3 technical report (DeepSeek-V3 Technical Report) (DeepSeek-V3 Technical Report), the official GitHub documentation (GitHub - deepseek-ai/DeepSeek-V3) (GitHub - deepseek-ai/DeepSeek-V3), and insights from industry discussions and articles (DeepSeek’s Multi-Head Latent Attention Method - Horasis) (DeepSeek V3: Inside the Open-Source AI Model Rivaling GPT-4 - Zilliz blog). Each specific claim is backed by citations inline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment