Mixture of Experts (MoE)
Also Known As
TL;DR
Explanation
Mixture of Experts is a sparse architecture pattern. Instead of every parameter participating in every forward pass (dense models), an MoE layer contains many parallel feed-forward sub-networks called experts, plus a gating network that decides which experts handle each token. Typical configurations activate only 2 of 8 (Mixtral 8x7B), 2 of 16, or similar small fractions per token. Effects: total parameter count scales without proportionally scaling FLOPs per token; the model can specialize different experts to different domains (code, math, languages); inference compute is determined by active params (~13B for Mixtral 8x7B), but VRAM/RAM must hold all experts (~47B params total). Training is harder — gating networks need to load-balance experts and avoid routing collapse where most tokens go to a few experts. Production MoE models include Mixtral, DeepSeek-V3, Grok, and reportedly GPT-4. For developers, the practical implication is mostly model selection and infrastructure: an MoE model rated 'comparable to a 70B dense' may need similar VRAM but run faster per token.
Common Misconception
Why It Matters
Common Mistakes
- Comparing MoE to dense models by 'active' parameters only — total parameters drive memory cost.
- Assuming MoE always beats dense at the same total params — it depends on workload, batch size, and how well experts specialize.
- Confusing MoE with model ensembling — ensembling runs multiple full models in parallel; MoE routes within a single forward pass.
- Expecting expert specialization to be human-interpretable — gating decisions are learned, not labelled, and don't necessarily map to clean domain categories.
Avoid When
- Workload has tight VRAM constraints — a same-active-param dense model may fit where an MoE doesn't.
- Latency at very small batch sizes is critical — MoE expert routing can hurt worst-case latency.
When To Use
- Selecting an open-source LLM where parameter efficiency matters (capability per FLOP).
- Explaining infra sizing for MoE deployments to ops teams expecting dense-model behaviour.
- Evaluating model trade-offs: throughput-per-token vs total memory footprint.
Code Examples
// ❌ Sizing infrastructure based on active parameters of an MoE model
// Mixtral 8x7B has 13B active params — assuming a 13B-class GPU is enough
$config = [
'model' => 'mistralai/Mixtral-8x7B-Instruct-v0.1',
'gpu' => 'A10G', // 24 GB — insufficient, model needs to load all 47B params
'quant' => 'fp16'
];
// Will OOM at load time — VRAM is gated by total params, not active.
// ✅ Size infrastructure by total parameters; pick quantization to fit
$config = [
'model' => 'mistralai/Mixtral-8x7B-Instruct-v0.1',
'gpu' => 'A100-80GB', // fits ~47B params at fp16 with KV cache headroom
'quant' => 'fp16'
];
// Alternative: 4-bit quant on smaller GPU
$configQuant = [
'model' => 'mistralai/Mixtral-8x7B-Instruct-v0.1',
'gpu' => 'A10G',
'quant' => 'gptq-4bit' // ~24 GB footprint, fits — at some quality cost
];