When should you NOT use Entity Resolution and Deduplication?

Records already carry a reliable shared identifier (such as a verified national ID or stable foreign key) that makes fuzzy matching unnecessary. The dataset is small and static enough that manual review beats building a matching pipeline. Incorrect merges carry unacceptable risk and you lack any review or rollback mechanism. The real problem is malformed values within single records, which is data cleaning rather than entity resolution.

When is Entity Resolution and Deduplication the right choice?

Building a single customer or product view by merging records from multiple source systems. Deduplicating CRM, mailing, or master-data records that lack a shared reliable key. Linking datasets across organizations where the same entities appear with inconsistent formatting. Maintaining a knowledge graph or master data store that must keep one canonical node per real-world entity.

← Back to glossary

Entity Resolution and Deduplication

Knowledge Engineering Advanced

debt(d9/e7/b7/t7)

d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). detection_hints.automated is no, and the code_pattern hints (DISTINCT|GROUP BY|drop_duplicates) only catch naive exact-match dedup, not the actual failure mode of false merges or missed fuzzy duplicates, which silently corrupt analytics and only surface when fragmented or conflated identities are noticed downstream.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix is not a one-liner: it requires adding a blocking step, field-level similarity scoring with measured thresholds, human review routing, and a reversible merge/audit trail — a coordinated pipeline change touching multiple components and data stores.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). applies_to spans library, queue-worker, node, and web contexts; the canonicalization and merge logic becomes load-bearing for any downstream system relying on entity identity, and irreversible/transitive merge decisions shape every future change to the data model.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The misconception is that deduplication just means dropping byte-for-byte identical rows, when the hard cases are non-identical records referring to the same entity; the obvious exact-equality approach is wrong, and transitive merging blindly collapses unrelated entities — contradicting the intuitive mental model.

About DEBT scoring → scored by claude-opus-4-8 · 2026-06-18 · reviewed by human

Also Known As

record linkage deduplication data matching identity resolution

TL;DR

The process of identifying records that refer to the same real-world entity and merging them into a single canonical representation.

Explanation

Entity resolution (also called record linkage or deduplication) is the discipline of deciding which records, across one or many datasets, refer to the same real-world thing - a person, company, product, or place - and then linking or merging them into one canonical entity. It is distinct from schema unification (aligning field structures) and from data cleaning (fixing malformed values within a record). Entity resolution operates one level up: even after every record is individually clean and stored in a common schema, you may still have "Robert Smith, 12 Oak St" and "Bob Smith, 12 Oak Street" that are the same person, and the job is to recognize that.

A resolution pipeline typically has three stages. Blocking (or indexing) groups candidate records into manageable buckets so you avoid the quadratic cost of comparing every pair; common blocking keys are phonetic codes, postal codes, or n-gram prefixes. Comparison computes similarity between candidate pairs using field-level metrics such as Jaro-Winkler for names, token overlap for addresses, or date proximity. Classification then decides match / no-match / possible-match from those similarity vectors, using thresholded rules, a probabilistic model like Fellegi-Sunter, or a trained classifier. Matched records are finally clustered (often with transitive closure or graph connected components) and merged into a survivorship record that picks the best value per field.

The hard parts are precision-recall trade-offs and transitivity. Set thresholds too loose and you merge distinct people into one identity, leaking data or corrupting analytics; too strict and you scatter one customer across many accounts, inflating counts and breaking personalization. Transitive merges compound this: if A matches B and B matches C, you may inadvertently merge A and C even though they are unrelated. Production systems address these with human review queues for borderline pairs, audit trails that make merges reversible, and incremental resolution so new records are matched against existing clusters rather than re-running the whole batch. Done well, entity resolution turns fragmented data into a trustworthy single view of each entity; done carelessly it silently fabricates or destroys identity.

Common Misconception

✗ Deduplication just means dropping rows that are byte-for-byte identical. In reality the hard cases are non-identical records that still refer to the same entity, which requires similarity scoring and probabilistic matching rather than an exact-equality check.

Why It Matters

Unresolved duplicates inflate metrics, fragment customer history, and break downstream systems, while over-aggressive merging conflates distinct identities and can leak one person's data to another. Both failure modes silently corrupt analytics and decisions built on the data.

Common Mistakes

Relying on exact equality or a single field to detect duplicates, missing records that differ by spelling, formatting, or typos.
Skipping a blocking step and attempting all-pairs comparison, which scales quadratically and fails on large datasets.
Applying transitive merges blindly so weak A-B and B-C links collapse unrelated A and C into one entity.
Setting match thresholds without measuring precision and recall on a labeled sample, so you cannot quantify false merges.
Making merges irreversible with no audit trail, so an incorrect merge cannot be undone or investigated.

Avoid When

Records already carry a reliable shared identifier (such as a verified national ID or stable foreign key) that makes fuzzy matching unnecessary.
The dataset is small and static enough that manual review beats building a matching pipeline.
Incorrect merges carry unacceptable risk and you lack any review or rollback mechanism.
The real problem is malformed values within single records, which is data cleaning rather than entity resolution.

When To Use

Building a single customer or product view by merging records from multiple source systems.
Deduplicating CRM, mailing, or master-data records that lack a shared reliable key.
Linking datasets across organizations where the same entities appear with inconsistent formatting.
Maintaining a knowledge graph or master data store that must keep one canonical node per real-world entity.

Code Examples

✗ Vulnerable

# Naive dedup: only catches byte-identical rows and merges anything sharing a name
def dedupe(records):
    seen = {}
    for r in records:
        # exact-match key misses 'Bob' vs 'Robert', '12 Oak St' vs '12 Oak Street'
        key = (r["name"], r["address"])
        seen.setdefault(key, r)
    # and a name-only merge would conflate two different John Smiths
    return list(seen.values())

rows = [
    {"name": "Robert Smith", "address": "12 Oak St"},
    {"name": "Bob Smith", "address": "12 Oak Street"},  # same person, kept as duplicate
]
print(len(dedupe(rows)))  # -> 2, the duplicate survives

✓ Fixed

from itertools import combinations
from difflib import SequenceMatcher

def sim(a, b):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

def block_key(r):
    # blocking: only compare records sharing a postcode to avoid O(n^2)
    return r["postcode"]

def resolve(records, threshold=0.82):
    blocks = {}
    for r in records:
        blocks.setdefault(block_key(r), []).append(r)
    matches = []
    for bucket in blocks.values():
        for a, b in combinations(bucket, 2):
            score = 0.6 * sim(a["name"], b["name"]) + 0.4 * sim(a["address"], b["address"])
            if score >= threshold:
                matches.append((a["id"], b["id"], round(score, 3)))
    return matches  # feed borderline scores to a review queue, keep an audit trail

rows = [
    {"id": 1, "name": "Robert Smith", "address": "12 Oak St", "postcode": "OX1"},
    {"id": 2, "name": "Bob Smith", "address": "12 Oak Street", "postcode": "OX1"},
]
print(resolve(rows))  # -> [(1, 2, 0.86)] candidate match surfaced for merge

Entity Resolution and Deduplication

Also Known As

TL;DR

Explanation

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

Tags

References

Entity Resolution and Deduplication

Also Known As

TL;DR

Explanation

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

Tags

Related Terms

References