URL: https://arxiv.org/pdf/2509.25149
Can you give me a very clear explanation of the core assertions, implications, and mechanics elucidated in this paper?
[THINKING]
This paper is about training Large Language Models (LLMs) using a new 4-bit floating point format called NVFP4. Let me break down the core assertions, implications, and mechanics.
Core Assertions:
- NVFP4 (a 4-bit floating point format) can be used to train very large language models successfully
- They trained a 12B parameter model on 10T tokens - the longest documented 4-bit training run
- NVFP4 achieves comparable results to FP8 training in terms of loss and downstream task accuracy
- NVFP4 is better than MXFP4 (another 4-bit format)
Implications:
- 2-3x speedup in compute compared to FP8
- Half the memory usage compared to FP8
- Makes training more efficient and cost-effective
Mechanics: The paper introduces several key technical innovations:
- NVFP4 format itself - uses smaller blocks (16 vs 32 elements), more precise scaling (E4M3 vs UE8M0), and two-level scaling
- Training methodology including:
- Mixed precision (keeping some layers in higher precision)
- Random Hadamard transforms to handle outliers
- 2D block scaling for weights
- Stochastic rounding for gradients
Let me provide a clear, structured explanation.
[ANSWER]
-
FP4 training is viable at scale: The paper demonstrates that 4-bit floating point precision can successfully train billion-parameter models over trillion-token horizons without significant accuracy loss.
-
NVFP4 achieves near-parity with FP8: A 12B-parameter model trained on 10T tokens using NVFP4 matches FP8 baseline performance (e.g., 62.58% vs 62.62% on MMLU-pro).
-
NVFP4 outperforms MXFP4: NVFP4 reaches comparable loss with 36% fewer tokens than MXFP4, making it the superior 4-bit format.
-
Multiple techniques are essential: Success requires a combination of methodologies - no single technique alone is sufficient at scale.
Performance Benefits:
- 2-3Γ faster computation than FP8 on NVIDIA Blackwell GPUs
- 50% reduction in memory usage compared to FP8
- Enables training larger models or using longer contexts with same hardware
Economic Impact:
- Reduces training time, compute costs, and energy consumption
- Makes frontier model training more accessible
- Potential billions in cost savings for large-scale training runs
Block Structure:
- Uses 16-element blocks (vs. 32 for MXFP4) β captures local dynamic range better
- Each element stored as E2M1 (1 sign bit, 2 exponent bits, 1 mantissa bit)
- Represents: Β±{0, 0.5, 1, 1.5, 2, 3, 4, 6}
Two-Level Scaling:
- Block-level (local): E4M3 scale factor per 16-element block
- More precise than MXFP4's power-of-two (UE8M0) scaling
- Preserves ~6.25% of values at near-FP8 precision
- Tensor-level (global): FP32 scale for entire tensor
- Ensures block scales fit in E4M3 range
Why this matters: NVFP4 avoids wasting FP4 samples and preserves more dynamic range than MXFP4.
The paper introduces four critical techniques:
- Keep ~15% of layers (mostly final layers) in BF16/MXFP8
- Final layers are more sensitive to quantization
- Critical for preventing divergence
- Applied to weight gradient (Wgrad) inputs only
- Uses 16Γ16 Hadamard matrices with random sign vectors
- Purpose: Redistributes outliers into approximately Gaussian distribution
- Makes values easier to represent in FP4 range
- Not applied to weights themselves (breaks 2D scaling benefits)
- Weights scaled in 16Γ16 blocks (both input and output channels)
- Activations/gradients use standard 1Γ16 blocks
- Critical insight: Ensures consistent quantized representation in forward AND backward passes
- Prevents chain rule violations (backward pass differentiates same function as forward)
- Applied to gradient tensors only
- Prevents systematic bias from deterministic rounding
- Rounds probabilistically to nearest representable values
- Essential for convergence - deterministic rounding causes divergence
Forward Pass:
Input (BF16) β Quantize to NVFP4 β GEMM (FP4) β Output (BF16/FP32)
Backward Pass (Wgrad):
Gradient (BF16) β Hadamard Transform β Quantize (stochastic rounding)
β GEMM (FP4) β Weight Gradient (BF16)
Weight Updates:
- Weights stored in FP32 (optimizer precision)
- Quantized to NVFP4 with 2D scaling for each GEMM
- Ensures same quantization in forward/backward via 2D blocks
The ablation studies (Figure 4) show that removing ANY component degrades performance:
- No stochastic rounding: ~1% relative loss increase
- No RHT: ~0.5% relative loss increase
- No 2D scaling: ~0.75% relative loss increase
- Fewer high-precision layers: Convergence instability
-
Scale matters: Techniques not needed at 1B scale become essential at 12B scale over 10T tokens
-
Forward pass is bottleneck: Most accuracy loss comes from quantizing forward pass tensors (can recover by switching to BF16 in final training phase)
-
Weights are sensitive: Weight quantization inconsistency (forward vs. backward) impacts accuracy more than activation inconsistency
-
Format design matters: NVFP4's numerical advantages over MXFP4 translate to real training efficiency gains (36% fewer tokens needed with MXFP4)
This represents a significant algorithmic advancement enabling practical 4-bit LLM training at unprecedented scale.
Can you explain the value of this in basic terms? Like you're talking to a CEO. So what? What's the bottom line here?
[THINKING]
The user wants me to explain this paper's value in simple, business-oriented terms to a CEO. I need to focus on:
- What problem this solves
- What the business impact is
- Why it matters
- Bottom line numbers
Let me strip away the technical jargon and focus on dollars, speed, and competitive advantage.
[ANSWER]
NVIDIA just figured out how to train AI models 2-3Γ faster while using half the memory, with virtually no loss in quality.
Training frontier AI models is brutally expensive:
- A single large model training run costs $50-200+ million in compute
- Takes months to complete
- Uses tens of thousands of GPUs
- Consumes megawatts of power
Every percentage point of efficiency improvement translates to millions of dollars saved and weeks of time reclaimed.
They trained a 12-billion parameter model on 10 trillion tokens using 4-bit precision (NVFP4) instead of the industry-standard 8-bit (FP8):
Results:
- β Same model quality (62.58% vs 62.62% accuracy)
- β 2-3Γ faster computation
- β 50% less memory required
- β First successful multi-trillion-token training run at 4-bit precision
Let's use a concrete example:
Traditional FP8 Training:
- 25,000 GPUs Γ 90 days
- ~$100M in compute costs
- ~50 megawatts continuous power
With NVFP4 (conservative 2Γ speedup):
- 25,000 GPUs Γ 45 days, OR
- 12,500 GPUs Γ 90 days
- ~$50M in compute costs (50% reduction)
- ~25 megawatts (50% reduction in power)
With the same budget, you could:
- Train 2Γ as many model variants (more experiments = better models)
- Train on 2Γ more data (better model quality)
- Reach market 2Γ faster (competitive advantage)
- Faster iteration = more shots on goal to find breakthrough models
- Compress 6-month training runs into 3 months
- Ship improved models to customers twice as fast
- Train competitive models with half the capital expenditure
- Better unit economics for AI products
- More sustainable path to profitability
- Same infrastructure can train larger models or process more data
- Infrastructure investments go 2Γ further
- Can afford to train models competitors cannot
- 50% reduction in energy consumption per training run
- Better ESG story
- Reduced exposure to energy cost volatility
This matters now because:
- Model size is still growing: GPT-4 β GPT-5 scale models cost 5-10Γ more to train
- Competition is intensifying: OpenAI, Anthropic, Google, Meta all racing to build better models
- Hardware refresh cycle: New NVIDIA Blackwell GPUs have native FP4 support (GB300 offers 6Γ speedup over BF16)
Implementation complexity: Requires:
- Careful engineering (multiple specialized techniques working together)
- New training recipes (can't just flip a switch)
- Hardware support (needs latest NVIDIA Blackwell GPUs)
But NVIDIA is providing the code and methodology, so the barrier to adoption is manageable.
1. Cloud AI Providers (Microsoft, Google, Amazon):
- Better margins on AI training services
- Can offer lower prices or higher capacity
2. AI Model Builders (OpenAI, Anthropic, Meta, etc.):
- Train more models, faster, cheaper
- Competitive advantage in model quality
3. NVIDIA:
- Drives Blackwell GPU adoption
- Deepens moat in AI infrastructure
4. Enterprises Building Custom Models:
- Makes fine-tuning and domain-specific models more affordable
- Democratizes access to large-scale training
In 2-3 years, this becomes table stakes. Everyone training large models will use 4-bit precision. The question is:
- Do you adopt early and get 12-18 months of cost/speed advantage?
- Or wait and let competitors train 2Γ as many models as you in the same timeframe?
In AI, speed and scale compound. The team that can run twice as many experiments finds breakthrough architectures first, ships better products faster, and captures market share that's hard to reclaim.
If you're training models at scale:
- Evaluate immediately: Pilot NVFP4 on next training run
- Plan hardware refresh: Blackwell GPUs unlock the full benefit
- Train your team: New techniques require expertise
Expected ROI: 40-50% reduction in training costs within 12 months, with potential for 2Γ improvement in time-to-market.
TL;DR: This is like going from dial-up to broadband for AI training. Same destination, half the time, half the cost. First movers win.
Can you give me an analogy or metaphor that will help explain this to a broad audience?
[THINKING]
The user wants an analogy or metaphor to explain NVFP4 training to a broad audience. I need something that:
- Captures the essence of using lower precision to get faster/cheaper results
- Makes sense to non-technical people
- Illustrates why it's tricky but valuable
- Shows the trade-offs and why the methodology matters
Let me think of some options:
- Photography analogy - resolution vs file size
- Shipping/logistics - precision vs speed
- Manufacturing - tolerances
- Communication/language - vocabulary size
- Money/currency - denominations
- Painting - detail vs speed
- Maps - resolution vs usability
I think a photography/camera analogy could work well, or maybe a manufacturing one. Let me try a few and pick the best.
Actually, I think a "language/vocabulary" or "precision in communication" analogy might work really well because it directly relates to how much detail you need to convey information accurately.
Or maybe a construction/blueprints analogy - building something with different levels of measurement precision.
Let me go with multiple analogies to cover different aspects, starting with the strongest one.
[ANSWER]
Imagine you're building a custom dining table:
Traditional Method (FP8 - 8-bit precision): You use a ruler marked in 1/8-inch increments. Very precise, but each measurement takes time to read carefully, and you fill up your notebook with detailed measurements.
New Method (NVFP4 - 4-bit precision): You use a ruler marked in 1/2-inch increments. Half as many markings, twice as fast to read, half as much to write down.
The Problem: If you build the entire table with 1/2-inch precision, the legs might wobble and joints might not fit perfectly.
The Solution (NVFP4 methodology):
- Use 1/2-inch precision for rough cuts and the main structure (fast, good enough)
- Switch to 1/8-inch precision for critical joints and the final assembly (precise where it matters)
- Use special techniques to make sure errors don't compound
The Result: You build the table in half the time, with half the measurements written down, and it still looks and functions perfectly.
FP8 training = Taking photos at 12 megapixels
- High quality, large file sizes, takes longer to process and store
NVFP4 training = Taking photos at 3 megapixels with a really smart camera
- Smaller files, much faster to process and store
- But uses AI to figure out which parts need detail (faces, text) and which don't (sky, background)
- Final photo looks just as good for most purposes
The trick: The camera is smart about where to preserve detail and where to compress aggressively.
FP8 training = Shipping packages with tracking updates every mile
- Very precise location data
- Uses more bandwidth and processing
- More expensive
NVFP4 training = Shipping with tracking updates every 5 miles
- Less frequent updates (lower precision)
- Much cheaper and faster
- Package still arrives at the exact same place, same time
The key insight: You don't need to know the truck's location every single mile to deliver the package successfully. Strategic checkpoints are enough.
FP8 = Paying for everything in pennies
- Can represent any price exactly: $42.67
- Requires counting lots of coins
- Heavy to carry around
NVFP4 = Paying for most things in quarters, pennies only for exact change
- Faster transactions (fewer coins to count)
- Half the weight in your pocket
- Still get correct change at the end
Why it works: Most transactions don't need penny-level precision at every step, only at the final total.
FP8 = Gourmet recipe with precise measurements
- "Add 2.375 teaspoons of salt"
- "Simmer at 203.5Β°F for 42.3 minutes"
- Very precise, but tedious and time-consuming
NVFP4 = Practical recipe with smart shortcuts
- "Add about 2.5 teaspoons" for most ingredients
- "Add exactly 2.375 teaspoons" for the critical baking powder (where precision matters)
- "Simmer until reduced by half" (goal-oriented rather than precise)
Result: Dish tastes identical, but you cut cooking time in half.
Here's how I'd explain the whole paper to a non-technical person:
"Training AI is like building thousands of complex machines simultaneously, and you need to measure and adjust millions of tiny components billions of times.
Traditionally, we use very precise measuring tools (8-bit floating point) - like a ruler marked in 1/8-inch increments. This works great, but it's slow and requires writing down huge amounts of data.
NVIDIA figured out how to use simpler measuring tools (4-bit floating point) - like a ruler with only 1/2-inch markings - for most of the work. This is twice as fast and requires half as much record-keeping.
But here's the catch: if you use less precise measurements everywhere, you'll build a wobbly machine that doesn't work properly.
So they developed a clever system:
- Use simple measurements for most components (where 1/2-inch precision is fine)
- Switch to precise measurements for critical joints (where perfect fit matters)
- Add a special pre-check (Hadamard transforms) that shakes out defects before they compound
- Use probabilistic rounding to prevent systematic errors from accumulating
The result? They built a massive AI model - 12 billion parameters, trained on 10 trillion pieces of data - that works just as well as one built with precise tools, but in half the time and half the cost.
Think of it like this: if building an AI model normally costs $100 million and takes 3 months, this technique could get you there for $50 million in 6 weeks - with the same quality."
"NVFP4 is like discovering you can build a skyscraper twice as fast and for half the cost by using simpler tools for most of the work, while still being super precise where it really matters."
"NVIDIA found a way to train AI models 2Γ faster using 'lower resolution math' - like taking 3MP photos that look as good as 12MP because the camera knows where detail matters. Could cut AI training costs in half. π§΅"
The furniture-building analogy tends to resonate most because everyone understands the difference between rough cuts and final assembly, and why you'd want different levels of precision for each.
Can you provide a specific example that best illustrates the key points elucidated in this paper?
[THINKING]
The user wants a specific, concrete example that illustrates the key points from the paper. I should create an example that:
- Shows the actual mechanics of NVFP4
- Demonstrates why the methodology components are necessary
- Uses real numbers/scenarios from the paper
- Makes the technical concepts concrete and understandable
I think the best approach would be to walk through a specific scenario - maybe showing:
- What happens during a single layer computation
- How quantization works with specific numbers
- Why each technique is needed (with examples of what goes wrong without them)
Let me create a detailed, specific example that ties together all the key concepts.
[ANSWER]
Let me walk you through exactly what happens when training a single neural network layer using NVFP4, showing why each technique matters.
We're training the 12B parameter model. Let's focus on one linear layer that transforms 5,120 inputs into 5,120 outputs during forward pass.
The computation: Y = W Γ X
- W (weights): 5,120 Γ 5,120 matrix = 26.2 million parameters
- X (activations): batch of input data
- Y (output): result we'll use in next layer
In traditional training, W and X are stored in BF16 (16 bits each). With NVFP4, we store them in 4 bits.
Let's say one block of 16 weight values looks like this:
[6.2, -4.1, 2.7, -0.9, 3.3, 1.1, -2.2, 0.5,
4.8, -1.7, 3.9, -0.3, 2.1, -5.6, 1.8, -3.4]
E2M1 (the FP4 format) can only store: Β±{0, 0.5, 1, 1.5, 2, 3, 4, 6}
So 6.2 doesn't fit, 2.7 doesn't fit, -0.9 doesn't fit... most values don't fit!
Step 1 - Global Tensor Scale (FP32):
- Find max absolute value across entire weight matrix:
|β5.6| = 5.6
(in this block), but let's say across all 26.2M weights it's10.8
- Compute global scale:
s_enc = (6 Γ 448) / 10.8 = 248.89
- This remaps the entire tensor so the biggest value becomes representable
Step 2 - Local Block Scale (E4M3 in FP8):
- Find max in this block:
6.2
- After global scaling:
6.2 Γ 248.89 / 10.8 = 143.0
(still too big for FP4 max of 6!) - Compute block scale:
s_dec,block = 143.0 / 6 = 23.83
- Store this as E4M3:
23.875
(FP8 can represent this precisely)
Step 3 - Quantize Each Value:
Original β After global scale β After block scale β FP4
6.2 β 143.0 β 6.0 β 6
-4.1 β -94.7 β -3.97 β -4
2.7 β 62.3 β 2.61 β 2.5 or 3 (round to nearest)
-0.9 β -20.8 β -0.87 β -1
3.3 β 76.2 β 3.19 β 3
...
During computation, Tensor Core automatically reconstructs:
6 Γ 23.875 Γ (10.8/248.89) = 6.2 β (almost perfect!)
What makes NVFP4 better than MXFP4:
- MXFP4 uses 32-element blocks β wider dynamic range per block β worse fit
- MXFP4 uses power-of-two scales (e.g., 16, 32) β
23.83
would round to32
β much worse reconstruction - With MXFP4's power-of-two rounding:
6 Γ 32 Γ scale = 7.98
β (28% error!)
During training (at step 5.2 trillion tokens), the gradient for one weight block looks like:
[-0.001, 0.002, -0.001, 0.003, -0.002, 0.001, -0.001, 0.002,
-0.001, 0.002, 84.7, -0.001, 0.003, -0.002, 0.001, -0.002]
Notice: One massive outlier (84.7
), while everything else is tiny (~0.001-0.003).
Block max = 84.7
- Scale factor:
84.7 / 6 = 14.12
- Small values after scaling:
0.001 / 14.12 = 0.00007
- This underflows to 0 in FP4!
- Result: We lose almost all gradient information β model stops learning
Apply random rotation matrix that spreads the outlier:
H Γ gradient_block =
[-2.1, 3.8, -1.9, 4.2, -2.8, 3.1, -2.3, 3.6,
-2.0, 3.7, 5.8, -1.8, 4.5, -2.6, 2.9, -3.2]
Now:
- Block max = 5.8 (much smaller!)
- Scale factor:
5.8 / 6 = 0.97
- Small values after scaling:
1.8 / 0.97 = 1.86
β quantizes to 2 β - Result: We preserve gradient information β model keeps learning
The Math: The dot-product inverts the transform automatically:
(H Γ gradient)^T Γ (H Γ activation) = gradient^T Γ activation
β
Forward pass (computing Y = W Γ X):
Weight matrix W (transposed for forward):
[w1 w2 w3 w4]
[w5 w6 w7 w8]
[w9 w10 w11 w12]
[w13 w14 w15 w16]
Quantize along rows (input channels):
Block 1: [w1, w2, w3, w4] β scale_row1 = 2.3
Block 2: [w5, w6, w7, w8] β scale_row2 = 1.8
...
Backward pass (computing gradient w.r.t. W):
Same weight matrix (now accessed by columns for W^T Γ grad):
[w1 w2 w3 w4]
[w5 w6 w7 w8]
[w9 w10 w11 w12]
[w13 w14 w15 w16]
If we quantize along columns (output channels):
Block 1: [w1, w5, w9, w13] β scale_col1 = 3.1 β DIFFERENT!
Block 2: [w2, w6, w10, w14] β scale_col2 = 2.7 β DIFFERENT!
...
The Problem:
- Forward pass:
w1
scaled by2.3
β quantized to some FP4 value, say3
- Backward pass:
w1
scaled by3.1
β quantized to different FP4 value, say2
- Chain rule violation: Gradient doesn't match function used in forward!
Group weights in 16Γ16 squares:
[w1 w2 w3 w4 | w17 w18 w19 w20]
[w5 w6 w7 w8 | w21 w22 w23 w24]
[w9 w10 w11 w12 | w25 w26 w27 w28]
[w13 w14 w15 w16 | w29 w30 w31 w32]
-----------------------------------------
[w33 w34 w35 w36 | w49 w50 w51 w52]
...
Block scale computed over entire 16Γ16 square = 2.5
Result:
- Forward pass:
w1
scaled by2.5
- Backward pass:
w1
scaled by2.5
β - Same quantized value in both passes β chain rule preserved!
From the paper (Figure 4), removing 2D scaling:
- Training loss relative error increases by ~0.75%
- At 10T tokens, that's the difference between a model that works and one that's noticeably degraded
Consider accumulating small gradient updates over 1,000 microbatches:
Gradient value in high precision: 0.38
With Round-to-Nearest (Deterministic):
- FP4 representable values near 0.38:
0
(would need scale of 0.064 to hit exactly 0.5 or 1) - After appropriate scaling, say value becomes
0.9
- Closest FP4 values:
1.0
and0.5
- Rounds to:
1.0
every time - After 1,000 steps: accumulated value =
1.0 Γ 1000 = 1000
- True accumulated value should be:
0.38 Γ 1000 = 380
- Error: 163%! β
With Stochastic Rounding:
- Value is
0.9
, between0.5
and1.0
- Distance to
1.0
: 0.1 - Distance to
0.5
: 0.4 - Round to
1.0
with probability:0.4/(0.4+0.1) = 80%
- Round to
0.5
with probability:20%
- After 1,000 steps:
(0.8 Γ 1.0 + 0.2 Γ 0.5) Γ 1000 = 900
- Error: Only ~5% (much better, though still off) β
Without stochastic rounding on gradients (Figure 4):
- Training diverges completely at 12B scale
- Model never converges to reasonable loss
- Makes the difference between success and total failure
Last linear layer (the "output head") converts 5,120 hidden dimensions to 51,200 vocabulary tokens.
Typical weight distribution:
99% of weights: between -0.01 and +0.01
1% of weights: between -10.0 and +10.0 (output layer has wider distribution)
With FP4 quantization:
- Block scale needs to accommodate
10.0
β scale =10.0/6 = 1.67
- Small values:
0.01 / 1.67 = 0.006
β underflows to 0 β - Lose 99% of the weights!
Real experiment (Appendix E.2):
- Quantize all layers including last 8 blocks β training diverges
- Keep last 8 blocks in BF16 β training converges β
- Keep last 4 blocks in BF16 β training converges β
- Keep last 1 block in BF16 β training diverges
Solution: Keep ~15% of layers (mostly at the end) in BF16.
Here's what happens in one training iteration of a single layer:
- Load weights W from FP32 (optimizer storage)
- Apply 2D scaling (16Γ16 blocks) and quantize to NVFP4
- Load activations X, apply 1D scaling (1Γ16 blocks), quantize to NVFP4
- Tensor Core computes:
Y = W_fp4 Γ X_fp4
β Output in FP32 - Pass Y to next layer
-
Receive gradient βY from next layer
-
Compute activation gradients (Dgrad):
- Quantize βY to NVFP4 (1D scaling, round-to-nearest)
- Use same W quantization as forward (2D scaling ensures consistency)
βX = W_fp4^T Γ βY_fp4
-
Compute weight gradients (Wgrad):
- Apply 16Γ16 Hadamard transform to βY β spreads outliers
- Quantize transformed βY to NVFP4 with stochastic rounding
- Apply Hadamard transform to X
- Quantize transformed X to NVFP4
βW = X_transformed_fp4^T Γ βY_transformed_fp4
- Hadamards cancel in dot-product β correct gradient β
-
Update weights:
- Accumulate βW in FP32
- Pass to optimizer (Adam)
- Update W in FP32
- Weights: 26.2M parameters Γ 4 bits = 10.5 MB (vs. 52 MB in FP16)
- Activations: Similar 4Γ reduction
- Total for 12B model: ~6 GB (vs. ~24 GB in FP16)
- Tensor Core on GB300: 6Γ faster FP4 math vs BF16
- Memory bandwidth: 2Γ better (smaller data to move)
- Effective speedup: ~3-4Γ (accounting for overhead)
For the actual 12B parameter model trained on 10T tokens:
Without these techniques:
- Training diverges (loss explodes to infinity)
- Model is unusable
With all techniques:
- Loss: 1.26 (FP8) vs 1.28 (NVFP4) - only 1.5% worse
- MMLU-Pro: 62.62% (FP8) vs 62.58% (NVFP4) - effectively identical
- Training time: ~50% faster (projected from Tensor Core throughput)
- Memory usage: ~50% lower
Cost for a real training run:
- FP8: $100M, 90 days, 25,000 GPUs
- NVFP4: $50-65M, 45-60 days, 25,000 GPUs
That's $35-50M saved and 1-1.5 months faster to market.
This concrete example shows exactly why each technique is necessary and how they work together to enable practical 4-bit training at scale.