← Back to glossary

TF-IDF

Search Intermediate

debt(d7/e5/b5/t7)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). No tools are listed in detection_hints, and misuse of TF-IDF (e.g. cross-index score comparison, missing length normalisation, stop words in index) produces subtly degraded relevance that only manifests in production or under careful query-result review. There is no static analysis or linter that catches semantic misuse of a ranking algorithm.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix points to switching to MATCH...AGAINST or ts_rank and configuring stop words in the index, but fixing misuse (e.g. adding length normalisation, log-normalised TF, removing stop words, rebuilding the index) requires changes to index configuration, query logic, and potentially re-indexing the entire corpus — touching multiple layers of the search pipeline within one component.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). TF-IDF as a ranking strategy shapes how indexes are built, how stop words are managed, and how relevance is understood across the team. Common mistakes (no length normalisation, stop words in index, no synonym handling) persist as silent quality debt and slow down search improvement work, but the burden is scoped to the search/indexing component rather than the entire codebase.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The canonical misconception is that TF-IDF scores are absolute and comparable across indexes, which contradicts how most developers expect numeric scores to behave (similar to probabilities or percentages). Combined with common mistakes like cross-collection comparison and raw TF usage, a competent developer unfamiliar with the concept will confidently misuse it in ways that produce wrong results without obvious errors.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-11 · reviewed by human

Also Known As

TF-IDF term frequency inverse document frequency TF/IDF relevance scoring

TL;DR

Term Frequency–Inverse Document Frequency — a relevance scoring formula that ranks documents higher when a query term appears frequently in them but rarely across the whole collection.

Explanation

TF-IDF combines two signals: Term Frequency (TF) — how often a term appears in a specific document, normalised by document length — and Inverse Document Frequency (IDF) — the log of the ratio of total documents to documents containing the term. A term that appears in every document (like 'the') has IDF near zero and contributes little to ranking. A rare, specific term has high IDF and strongly differentiates relevant documents. The combined score TF × IDF ranks documents where the query term is both frequent and distinctive. TF-IDF was the dominant relevance formula before BM25, which refines it with better length normalisation. MySQL and PostgreSQL full-text search use TF-IDF variants internally. Understanding TF-IDF explains why search results change when your document collection grows — IDF scores shift as the corpus changes.

Common Misconception

✗ TF-IDF scores are absolute and can be compared across different search indexes. TF-IDF scores are relative to the document collection they were computed on — a score of 0.8 in one index means nothing compared to 0.8 in another. IDF is computed from the full corpus, so adding or removing documents changes all scores. When comparing relevance across collections, normalise scores or use precision/recall metrics instead.

Why It Matters

TF-IDF is the conceptual foundation for understanding why search results are ranked the way they are. When a PHP developer builds a search feature and notices that common words produce irrelevant results while specific terms work well, that is TF-IDF behaving correctly — common words have low IDF. When results degrade as the document collection grows, IDF recalculation is the cause. Understanding TF-IDF also explains why stop-word lists exist and why synonym expansion at index time improves recall without harming precision.

Common Mistakes

Including stop words in the index — words present in nearly every document have IDF near zero and waste index space.
Not normalising by document length — long documents naturally contain more term occurrences and score higher without length normalisation.
Using raw term frequency instead of log-normalised TF — raw counts overweight high-frequency terms disproportionately.
Expecting TF-IDF to handle synonyms — TF-IDF is purely lexical; synonym expansion must be done at index or query time separately.

Code Examples

✗ Vulnerable

// ❌ Treating all words equally with simple keyword counting
function score(string $doc, string $query): int {
    $count = 0;
    foreach (explode(' ', strtolower($query)) as $word) {
        $count += substr_count(strtolower($doc), $word);
    }
    return $count;
    // "the", "a", "is" score as highly as rare meaningful terms
}

✓ Fixed

// ✅ Use the database's built-in TF-IDF — don't reimplement it
// MySQL FULLTEXT with IDF weighting
$stmt = $pdo->prepare("
    SELECT *, MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE) AS relevance
    FROM articles
    WHERE MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE)
    ORDER BY relevance DESC
    LIMIT 20
");
$stmt->execute([':q' => $userQuery]);

// PostgreSQL ts_rank (TF-IDF based)
$stmt = $pdo->prepare("
    SELECT title, ts_rank(to_tsvector('english', content),
                          plainto_tsquery('english', :q)) AS rank
    FROM articles
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', :q)
    ORDER BY rank DESC
");
$stmt->execute([':q' => $userQuery]);

References

https://en.wikipedia.org/wiki/Tf%E2%80%93idf