TF-IDF
Also Known As
TL;DR
Explanation
TF-IDF combines two signals: Term Frequency (TF) — how often a term appears in a specific document, normalised by document length — and Inverse Document Frequency (IDF) — the log of the ratio of total documents to documents containing the term. A term that appears in every document (like 'the') has IDF near zero and contributes little to ranking. A rare, specific term has high IDF and strongly differentiates relevant documents. The combined score TF × IDF ranks documents where the query term is both frequent and distinctive. TF-IDF was the dominant relevance formula before BM25, which refines it with better length normalisation. MySQL and PostgreSQL full-text search use TF-IDF variants internally. Understanding TF-IDF explains why search results change when your document collection grows — IDF scores shift as the corpus changes.
Common Misconception
Why It Matters
Common Mistakes
- Including stop words in the index — words present in nearly every document have IDF near zero and waste index space.
- Not normalising by document length — long documents naturally contain more term occurrences and score higher without length normalisation.
- Using raw term frequency instead of log-normalised TF — raw counts overweight high-frequency terms disproportionately.
- Expecting TF-IDF to handle synonyms — TF-IDF is purely lexical; synonym expansion must be done at index or query time separately.
Code Examples
// ❌ Treating all words equally with simple keyword counting
function score(string $doc, string $query): int {
$count = 0;
foreach (explode(' ', strtolower($query)) as $word) {
$count += substr_count(strtolower($doc), $word);
}
return $count;
// "the", "a", "is" score as highly as rare meaningful terms
}
// ✅ Use the database's built-in TF-IDF — don't reimplement it
// MySQL FULLTEXT with IDF weighting
$stmt = $pdo->prepare("
SELECT *, MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE) AS relevance
FROM articles
WHERE MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE)
ORDER BY relevance DESC
LIMIT 20
");
$stmt->execute([':q' => $userQuery]);
// PostgreSQL ts_rank (TF-IDF based)
$stmt = $pdo->prepare("
SELECT title, ts_rank(to_tsvector('english', content),
plainto_tsquery('english', :q)) AS rank
FROM articles
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', :q)
ORDER BY rank DESC
");
$stmt->execute([':q' => $userQuery]);