Diffusion Models
Also Known As
TL;DR
Explanation
A diffusion model is trained by progressively adding Gaussian noise to real samples across many timesteps until nothing but noise remains — then the network learns to predict and remove the noise at each step. At inference, you start from pure noise and iteratively apply the learned denoiser, optionally conditioned on text embeddings (via cross-attention to a CLIP or T5 encoder) to steer the output. The two ingredients that made diffusion practical at scale: running the process in a compressed latent space via a VAE (the basis of Latent Diffusion / Stable Diffusion), and classifier-free guidance for controllable conditioning strength. Inference cost scales with the number of denoising steps — DDIM, DPM-Solver and consistency-model distillations reduce step counts from ~50 to as few as 1-4. Diffusion now dominates image, video and 3D generation and is expanding into text and audio.
Common Misconception
Why It Matters
Common Mistakes
- Confusing training steps with inference steps — a model may be trained with 1000-step noise schedules and inference in 20 steps via a faster sampler.
- Cranking guidance scale to 20+ — very high CFG produces over-saturated, burned-out images; 7-10 is typically the sweet spot.
- Assuming latent space = pixel space — Stable Diffusion operates in a 4-channel 64×64 latent that the VAE decodes to 512×512 pixels; a mask in pixel space needs to be scaled.
- Using the wrong sampler for the step budget — Euler-a is robust for low steps, DPM-Solver++ excels at 10-30 steps, DDIM is deterministic and needed for reproducibility.
- Training a LoRA on a base model and expecting it to work on a different model family — LoRAs are tied to the base model's weights and architecture.
Avoid When
- Pure text generation — autoregressive transformers still dominate for language.
- Real-time sub-100ms generation — even distilled diffusion is usually slower than GANs for tight latency budgets.
When To Use
- Generating images, video or 3D where diversity and high fidelity matter more than inference latency.
- Any task where you need controllable generation via text or image conditioning — the field is richest here.