TF-IDF
debt(d7/e5/b5/t7)
Closest to 'only careful code review or runtime testing' (d7). No tools are listed in detection_hints, and misuse of TF-IDF (e.g. cross-index score comparison, missing length normalisation, stop words in index) produces subtly degraded relevance that only manifests in production or under careful query-result review. There is no static analysis or linter that catches semantic misuse of a ranking algorithm.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix points to switching to MATCH...AGAINST or ts_rank and configuring stop words in the index, but fixing misuse (e.g. adding length normalisation, log-normalised TF, removing stop words, rebuilding the index) requires changes to index configuration, query logic, and potentially re-indexing the entire corpus — touching multiple layers of the search pipeline within one component.
Closest to 'persistent productivity tax' (b5). TF-IDF as a ranking strategy shapes how indexes are built, how stop words are managed, and how relevance is understood across the team. Common mistakes (no length normalisation, stop words in index, no synonym handling) persist as silent quality debt and slow down search improvement work, but the burden is scoped to the search/indexing component rather than the entire codebase.
Closest to 'serious trap' (t7). The canonical misconception is that TF-IDF scores are absolute and comparable across indexes, which contradicts how most developers expect numeric scores to behave (similar to probabilities or percentages). Combined with common mistakes like cross-collection comparison and raw TF usage, a competent developer unfamiliar with the concept will confidently misuse it in ways that produce wrong results without obvious errors.
Also Known As
TL;DR
Explanation
TF-IDF combines two signals: Term Frequency (TF) — how often a term appears in a specific document, normalised by document length — and Inverse Document Frequency (IDF) — the log of the ratio of total documents to documents containing the term. A term that appears in every document (like 'the') has IDF near zero and contributes little to ranking. A rare, specific term has high IDF and strongly differentiates relevant documents. The combined score TF × IDF ranks documents where the query term is both frequent and distinctive. TF-IDF was the dominant relevance formula before BM25, which refines it with better length normalisation. MySQL and PostgreSQL full-text search use TF-IDF variants internally. Understanding TF-IDF explains why search results change when your document collection grows — IDF scores shift as the corpus changes.
Common Misconception
Why It Matters
Common Mistakes
- Including stop words in the index — words present in nearly every document have IDF near zero and waste index space.
- Not normalising by document length — long documents naturally contain more term occurrences and score higher without length normalisation.
- Using raw term frequency instead of log-normalised TF — raw counts overweight high-frequency terms disproportionately.
- Expecting TF-IDF to handle synonyms — TF-IDF is purely lexical; synonym expansion must be done at index or query time separately.
Code Examples
// ❌ Treating all words equally with simple keyword counting
function score(string $doc, string $query): int {
$count = 0;
foreach (explode(' ', strtolower($query)) as $word) {
$count += substr_count(strtolower($doc), $word);
}
return $count;
// "the", "a", "is" score as highly as rare meaningful terms
}
// ✅ Use the database's built-in TF-IDF — don't reimplement it
// MySQL FULLTEXT with IDF weighting
$stmt = $pdo->prepare("
SELECT *, MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE) AS relevance
FROM articles
WHERE MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC
LIMIT 20
");
$stmt->execute([':q' => $userQuery]);
// PostgreSQL ts_rank (TF-IDF based)
$stmt = $pdo->prepare("
SELECT title, ts_rank(to_tsvector('english', content),
plainto_tsquery('english', :q)) AS rank
FROM articles
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', :q)
ORDER BY rank DESC
");
$stmt->execute([':q' => $userQuery]);