Guided diffusion sampling typically uses two forward passes per step:
- One caption-conditional forward pass, to compute
E[flow | noisy image, noise level, caption]
- One unconditional forward pass, to compute
E[flow | noisy image, noise level]
These results are then linearly combined to form a single guided/superconditioned flow prediction.