Thundering Herd Problem
debt(d9/e5/b5/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints field states automated detection is 'no', and the code pattern (cache->get.*null.*db->) requires manual review to identify. There is no linter or static tool listed that catches this; it typically surfaces only when a cache expires in production and the database is hammered, often causing an outage.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix lists multiple strategies: distributed mutex locking, stale-while-revalidate, TTL jitter, and proactive cache warming. These are not single-line patches — they require coordinated changes to caching logic, potentially across multiple cache call sites, worker startup configurations, and infrastructure settings. Not quite a cross-cutting architectural rework (e7), but more than a simple parameter swap.
Closest to 'persistent productivity tax' (b5). The problem applies across web, cli, and queue-worker contexts, meaning any caching or resource-contention pattern in the codebase must account for it. Once mitigation strategies (jitter, mutexes, stale serving) are in place they impose an ongoing cognitive and maintenance tax on future developers working with caches or shared resources, but they don't fully define the system's architecture.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states that developers assume thundering herd only affects caches, when it actually affects any situation where many processes simultaneously target the same resource (queue workers, connection pools, server restarts). This narrow mental model is a serious trap: fixes applied only to cache layers leave other stampede vectors untreated, and the 'obvious' mitigations like long TTLs actually just delay the problem rather than solve it.
TL;DR
Explanation
Cache stampede (a thundering herd variant): cache entry expires → hundreds of concurrent requests all miss → all hit the DB simultaneously → DB overloaded → all queries slow/timeout → all requests fail. Solutions: (1) Cache locking (mutex on cache miss — only one regenerates). (2) Cache stale-while-revalidate (serve stale during regeneration). (3) Probabilistic early expiry (randomly regenerate before expiry). (4) Cache warming before expiry. Another variant: all workers wake on a queue event — only one gets the job but all CPU-spin. Fix: use SKIP LOCKED or randomised polling intervals.
Common Misconception
Why It Matters
Common Mistakes
- Long TTL to avoid expiry — just delays the stampede.
- Not locking cache regeneration — every request races to regenerate.
- Simultaneous queue worker start — all workers poll at the same time.
Code Examples
// Cache stampede:
$value = $cache->get('expensive_query');
if ($value === null) {
$value = $db->runExpensiveQuery(); // All 500 concurrent requests hit this
$cache->set('expensive_query', $value, 3600);
}
// Mutex on cache miss:
$value = $cache->get('expensive_query');
if ($value === null) {
$lock = $redis->set('lock:expensive_query', 1, ['NX','EX'=>10]);
if ($lock) {
$value = $db->runExpensiveQuery();
$cache->set('expensive_query', $value, 3600);
$redis->del('lock:expensive_query');
} else {
// Wait briefly then re-check cache:
usleep(100000);
$value = $cache->get('expensive_query') ?? $fallbackValue;
}
}