Bloom Filter
debt(d9/e5/b5/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints indicate 'automated: no' with no tooling support. Misuse — such as missing a fallback for false positives, wrong sizing, or attempting deletion — produces no compile error, no linter warning, and no runtime exception. The system silently returns incorrect membership answers under load, only surfacing when users encounter data inconsistencies or performance anomalies in production.
Closest to 'touches multiple files / significant refactor in one component' (e5). Correcting misuse isn't a one-liner: wrong bit-array sizing requires recalculating parameters and rebuilding the filter; missing fallback checks require adding a second lookup path at every call site; supporting deletion requires swapping to a Counting Bloom Filter implementation. These changes touch the filter construction, all call sites, and potentially the caching/storage layer, but are contained within one subsystem rather than being codebase-wide.
Closest to 'persistent productivity tax' (b5). Bloom filters apply across web, cli, and queue-worker contexts per applies_to. Once adopted as a caching gate, every team member must understand the probabilistic semantics, the mandatory fallback pattern, and sizing tradeoffs. This shapes cache-miss handling patterns across multiple work streams, but doesn't define the entire system's architecture.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field states developers believe Bloom filters can replace a database set, but they cannot — false positives require a fallback, and elements cannot be removed from a standard filter. This directly contradicts the mental model of a normal set or hash-table membership check, where lookups are definitive and deletion is trivially supported. The 'obvious' usage pattern (drop-in set replacement) is always incomplete.
Also Known As
TL;DR
Explanation
A Bloom filter uses multiple hash functions to set bits in a bit array. To test membership, all hash positions are checked — if any is 0, the element is definitely not in the set. If all are 1, the element is probably in the set (possible false positive). Bloom filters never produce false negatives. They are ideal for cache miss pre-screening, username availability checks, and preventing unnecessary database lookups.
Diagram
flowchart LR
ADD[Add item X] --> HASH[Hash with k functions<br/>h1 h2 h3]
HASH --> SET[Set bits in bit array<br/>positions 2, 7, 14]
CHECK[Check item Y] --> HASH2[Hash with same k functions]
HASH2 --> BITS{All bits set?}
BITS -->|yes| MAYBE[Probably in set<br/>false positive possible]
BITS -->|no| DEFNO[Definitely NOT in set<br/>no false negatives]
INFO[Use case: cache negative lookups<br/>block known-bad passwords<br/>deduplicate crawl URLs]
style DEFNO fill:#238636,color:#fff
style MAYBE fill:#d29922,color:#fff
style INFO fill:#1f6feb,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Not accounting for false positive rate when sizing the bit array — too small a filter has high false positive rates.
- Trying to delete elements from a standard Bloom filter — use a Counting Bloom Filter for deletion support.
- Using a Bloom filter without a fallback — false positives require a definitive check against the source.
- Not understanding that Bloom filters are write-only for membership — you cannot enumerate members.
Code Examples
// Without bloom filter — every cache miss hits the DB:
function getUser(int $id): ?User {
$cached = $cache->get("user:$id");
if (!$cached) {
return $db->find($id); // DB hit even for non-existent IDs
}
return $cached;
}
// Bloom filter pre-screens — DB only queried if probably exists:
function getUser(int $id): ?User {
if (!$bloomFilter->mightContain("user:$id")) return null; // Definite miss
$cached = $cache->get("user:$id");
if (!$cached) {
$user = $db->find($id); // Only if bloom says 'probably exists'
if ($user) $cache->set("user:$id", $user);
return $user;
}
return $cached;
}