Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Last active November 4, 2025 14:54
Show Gist options
  • Save bigsnarfdude/3e3291d14e30571d98774af547522499 to your computer and use it in GitHub Desktop.
Save bigsnarfdude/3e3291d14e30571d98774af547522499 to your computer and use it in GitHub Desktop.
noisy clipping variance schedules
https://openreview.net/pdf?id=tyEyYT267x
Data-Driven Noise Schedules
Traditional Diffusion Models:
- Use fixed noise schedules (linear, cosine, etc.)
- Same schedule for all tokens/positions
- Hand-designed based on intuition
BD3-LM Innovation: Data-Driven Schedules
From the paper and codebase, they propose:
1. Variance-Reducing Schedules
Instead of fixed noise, they estimate gradient variance during training and adjust the noise schedule to minimize
it. This is what sampling_eps_min and sampling_eps_max control.
'noise': {
'type': 'loglinear', # Base schedule type
't_min': 0.001,
},
'training': {
'sampling_eps_min': 1e-3, # Minimum sampling noise
'sampling_eps_max': 1.0, # Maximum sampling noise
'antithetic_sampling': True, # Variance reduction technique
}
2. Importance Sampling
They can optionally use importance sampling to focus training on difficult timesteps:
'importance_sampling': False, # Can enable for harder examples
3. Log-Linear Schedule
The loglinear schedule is their proposed schedule that works better for discrete text than standard schedules:
- Standard (linear): Uniform noise across time
- Log-linear: More noise early (easy denoising), less noise late (hard denoising)
- Matches the difficulty curve of language generation
4. Why It Matters
Problem with fixed schedules:
- Some tokens are easy to predict (common words like "the", "a")
- Others are hard (rare words, proper nouns)
- Fixed schedule wastes compute on easy tokens
Data-driven solution:
- Automatically learns which noise levels are informative
- Focuses training on challenging timesteps
- Reduces training variance → faster convergence
5. The Missing Parameters We Added
When we added these:
'sampling_eps_min': 1e-3,
'sampling_eps_max': 1.0,
We enabled the adaptive noise schedule that:
- Clips extreme noise values
- Prevents numerical instability
- Allows the model to learn the optimal noise distribution
This is why training failed without them - the model couldn't properly sample from the noise schedule, leading to
poor learning.
Novel Aspect: Variance Estimation
From their paper, they introduce estimators of gradient variance to guide the noise schedule. This is particularly
important for discrete diffusion where:
- Categorical distributions are harder than continuous Gaussians
- Token-level noise has higher variance
- Need careful schedule to avoid mode collapse
The antithetic_sampling: True setting is part of this - it's a variance reduction technique where they sample pairs
of noise levels that are negatively correlated, reducing variance while maintaining the correct expectation.
This is much more sophisticated than standard diffusion schedules, and explains why these parameters are absolutely
critical for proper training!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment