Last active
November 4, 2025 14:54
-
-
Save bigsnarfdude/3e3291d14e30571d98774af547522499 to your computer and use it in GitHub Desktop.
noisy clipping variance schedules
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| https://openreview.net/pdf?id=tyEyYT267x | |
| Data-Driven Noise Schedules | |
| Traditional Diffusion Models: | |
| - Use fixed noise schedules (linear, cosine, etc.) | |
| - Same schedule for all tokens/positions | |
| - Hand-designed based on intuition | |
| BD3-LM Innovation: Data-Driven Schedules | |
| From the paper and codebase, they propose: | |
| 1. Variance-Reducing Schedules | |
| Instead of fixed noise, they estimate gradient variance during training and adjust the noise schedule to minimize | |
| it. This is what sampling_eps_min and sampling_eps_max control. | |
| 'noise': { | |
| 'type': 'loglinear', # Base schedule type | |
| 't_min': 0.001, | |
| }, | |
| 'training': { | |
| 'sampling_eps_min': 1e-3, # Minimum sampling noise | |
| 'sampling_eps_max': 1.0, # Maximum sampling noise | |
| 'antithetic_sampling': True, # Variance reduction technique | |
| } | |
| 2. Importance Sampling | |
| They can optionally use importance sampling to focus training on difficult timesteps: | |
| 'importance_sampling': False, # Can enable for harder examples | |
| 3. Log-Linear Schedule | |
| The loglinear schedule is their proposed schedule that works better for discrete text than standard schedules: | |
| - Standard (linear): Uniform noise across time | |
| - Log-linear: More noise early (easy denoising), less noise late (hard denoising) | |
| - Matches the difficulty curve of language generation | |
| 4. Why It Matters | |
| Problem with fixed schedules: | |
| - Some tokens are easy to predict (common words like "the", "a") | |
| - Others are hard (rare words, proper nouns) | |
| - Fixed schedule wastes compute on easy tokens | |
| Data-driven solution: | |
| - Automatically learns which noise levels are informative | |
| - Focuses training on challenging timesteps | |
| - Reduces training variance → faster convergence | |
| 5. The Missing Parameters We Added | |
| When we added these: | |
| 'sampling_eps_min': 1e-3, | |
| 'sampling_eps_max': 1.0, | |
| We enabled the adaptive noise schedule that: | |
| - Clips extreme noise values | |
| - Prevents numerical instability | |
| - Allows the model to learn the optimal noise distribution | |
| This is why training failed without them - the model couldn't properly sample from the noise schedule, leading to | |
| poor learning. | |
| Novel Aspect: Variance Estimation | |
| From their paper, they introduce estimators of gradient variance to guide the noise schedule. This is particularly | |
| important for discrete diffusion where: | |
| - Categorical distributions are harder than continuous Gaussians | |
| - Token-level noise has higher variance | |
| - Need careful schedule to avoid mode collapse | |
| The antithetic_sampling: True setting is part of this - it's a variance reduction technique where they sample pairs | |
| of noise levels that are negatively correlated, reducing variance while maintaining the correct expectation. | |
| This is much more sophisticated than standard diffusion schedules, and explains why these parameters are absolutely | |
| critical for proper training! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment