Entity Resolution and Deduplication
debt(d9/e7/b7/t7)
Closest to 'silent in production until users hit it' (d9). detection_hints.automated is no, and the code_pattern hints (DISTINCT|GROUP BY|drop_duplicates) only catch naive exact-match dedup, not the actual failure mode of false merges or missed fuzzy duplicates, which silently corrupt analytics and only surface when fragmented or conflated identities are noticed downstream.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix is not a one-liner: it requires adding a blocking step, field-level similarity scoring with measured thresholds, human review routing, and a reversible merge/audit trail — a coordinated pipeline change touching multiple components and data stores.
Closest to 'strong gravitational pull' (b7). applies_to spans library, queue-worker, node, and web contexts; the canonicalization and merge logic becomes load-bearing for any downstream system relying on entity identity, and irreversible/transitive merge decisions shape every future change to the data model.
Closest to 'serious trap' (t7). The misconception is that deduplication just means dropping byte-for-byte identical rows, when the hard cases are non-identical records referring to the same entity; the obvious exact-equality approach is wrong, and transitive merging blindly collapses unrelated entities — contradicting the intuitive mental model.
Also Known As
TL;DR
Explanation
Entity resolution (also called record linkage or deduplication) is the discipline of deciding which records, across one or many datasets, refer to the same real-world thing - a person, company, product, or place - and then linking or merging them into one canonical entity. It is distinct from schema unification (aligning field structures) and from data cleaning (fixing malformed values within a record). Entity resolution operates one level up: even after every record is individually clean and stored in a common schema, you may still have "Robert Smith, 12 Oak St" and "Bob Smith, 12 Oak Street" that are the same person, and the job is to recognize that.
A resolution pipeline typically has three stages. Blocking (or indexing) groups candidate records into manageable buckets so you avoid the quadratic cost of comparing every pair; common blocking keys are phonetic codes, postal codes, or n-gram prefixes. Comparison computes similarity between candidate pairs using field-level metrics such as Jaro-Winkler for names, token overlap for addresses, or date proximity. Classification then decides match / no-match / possible-match from those similarity vectors, using thresholded rules, a probabilistic model like Fellegi-Sunter, or a trained classifier. Matched records are finally clustered (often with transitive closure or graph connected components) and merged into a survivorship record that picks the best value per field.
The hard parts are precision-recall trade-offs and transitivity. Set thresholds too loose and you merge distinct people into one identity, leaking data or corrupting analytics; too strict and you scatter one customer across many accounts, inflating counts and breaking personalization. Transitive merges compound this: if A matches B and B matches C, you may inadvertently merge A and C even though they are unrelated. Production systems address these with human review queues for borderline pairs, audit trails that make merges reversible, and incremental resolution so new records are matched against existing clusters rather than re-running the whole batch. Done well, entity resolution turns fragmented data into a trustworthy single view of each entity; done carelessly it silently fabricates or destroys identity.
Common Misconception
Why It Matters
Common Mistakes
- Relying on exact equality or a single field to detect duplicates, missing records that differ by spelling, formatting, or typos.
- Skipping a blocking step and attempting all-pairs comparison, which scales quadratically and fails on large datasets.
- Applying transitive merges blindly so weak A-B and B-C links collapse unrelated A and C into one entity.
- Setting match thresholds without measuring precision and recall on a labeled sample, so you cannot quantify false merges.
- Making merges irreversible with no audit trail, so an incorrect merge cannot be undone or investigated.
Avoid When
- Records already carry a reliable shared identifier (such as a verified national ID or stable foreign key) that makes fuzzy matching unnecessary.
- The dataset is small and static enough that manual review beats building a matching pipeline.
- Incorrect merges carry unacceptable risk and you lack any review or rollback mechanism.
- The real problem is malformed values within single records, which is data cleaning rather than entity resolution.
When To Use
- Building a single customer or product view by merging records from multiple source systems.
- Deduplicating CRM, mailing, or master-data records that lack a shared reliable key.
- Linking datasets across organizations where the same entities appear with inconsistent formatting.
- Maintaining a knowledge graph or master data store that must keep one canonical node per real-world entity.
Code Examples
# Naive dedup: only catches byte-identical rows and merges anything sharing a name
def dedupe(records):
seen = {}
for r in records:
# exact-match key misses 'Bob' vs 'Robert', '12 Oak St' vs '12 Oak Street'
key = (r["name"], r["address"])
seen.setdefault(key, r)
# and a name-only merge would conflate two different John Smiths
return list(seen.values())
rows = [
{"name": "Robert Smith", "address": "12 Oak St"},
{"name": "Bob Smith", "address": "12 Oak Street"}, # same person, kept as duplicate
]
print(len(dedupe(rows))) # -> 2, the duplicate survives
from itertools import combinations
from difflib import SequenceMatcher
def sim(a, b):
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
def block_key(r):
# blocking: only compare records sharing a postcode to avoid O(n^2)
return r["postcode"]
def resolve(records, threshold=0.82):
blocks = {}
for r in records:
blocks.setdefault(block_key(r), []).append(r)
matches = []
for bucket in blocks.values():
for a, b in combinations(bucket, 2):
score = 0.6 * sim(a["name"], b["name"]) + 0.4 * sim(a["address"], b["address"])
if score >= threshold:
matches.append((a["id"], b["id"], round(score, 3)))
return matches # feed borderline scores to a review queue, keep an audit trail
rows = [
{"id": 1, "name": "Robert Smith", "address": "12 Oak St", "postcode": "OX1"},
{"id": 2, "name": "Bob Smith", "address": "12 Oak Street", "postcode": "OX1"},
]
print(resolve(rows)) # -> [(1, 2, 0.86)] candidate match surfaced for merge