← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

TF-IDF

search Intermediate

Also Known As

TF-IDF term frequency inverse document frequency TF/IDF relevance scoring

TL;DR

Term Frequency–Inverse Document Frequency — a relevance scoring formula that ranks documents higher when a query term appears frequently in them but rarely across the whole collection.

Explanation

TF-IDF combines two signals: Term Frequency (TF) — how often a term appears in a specific document, normalised by document length — and Inverse Document Frequency (IDF) — the log of the ratio of total documents to documents containing the term. A term that appears in every document (like 'the') has IDF near zero and contributes little to ranking. A rare, specific term has high IDF and strongly differentiates relevant documents. The combined score TF × IDF ranks documents where the query term is both frequent and distinctive. TF-IDF was the dominant relevance formula before BM25, which refines it with better length normalisation. MySQL and PostgreSQL full-text search use TF-IDF variants internally. Understanding TF-IDF explains why search results change when your document collection grows — IDF scores shift as the corpus changes.

Common Misconception

TF-IDF scores are absolute and can be compared across different search indexes. TF-IDF scores are relative to the document collection they were computed on — a score of 0.8 in one index means nothing compared to 0.8 in another. IDF is computed from the full corpus, so adding or removing documents changes all scores. When comparing relevance across collections, normalise scores or use precision/recall metrics instead.

Why It Matters

TF-IDF is the conceptual foundation for understanding why search results are ranked the way they are. When a PHP developer builds a search feature and notices that common words produce irrelevant results while specific terms work well, that is TF-IDF behaving correctly — common words have low IDF. When results degrade as the document collection grows, IDF recalculation is the cause. Understanding TF-IDF also explains why stop-word lists exist and why synonym expansion at index time improves recall without harming precision.

Common Mistakes

  • Including stop words in the index — words present in nearly every document have IDF near zero and waste index space.
  • Not normalising by document length — long documents naturally contain more term occurrences and score higher without length normalisation.
  • Using raw term frequency instead of log-normalised TF — raw counts overweight high-frequency terms disproportionately.
  • Expecting TF-IDF to handle synonyms — TF-IDF is purely lexical; synonym expansion must be done at index or query time separately.

Code Examples

✗ Vulnerable
// ❌ Treating all words equally with simple keyword counting
function score(string $doc, string $query): int {
    $count = 0;
    foreach (explode(' ', strtolower($query)) as $word) {
        $count += substr_count(strtolower($doc), $word);
    }
    return $count;
    // "the", "a", "is" score as highly as rare meaningful terms
}
✓ Fixed
// ✅ Use the database's built-in TF-IDF — don't reimplement it
// MySQL FULLTEXT with IDF weighting
$stmt = $pdo->prepare("
    SELECT *, MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE) AS relevance
    FROM articles
    WHERE MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE)
    ORDER BY relevance DESC
    LIMIT 20
");
$stmt->execute([':q' => $userQuery]);

// PostgreSQL ts_rank (TF-IDF based)
$stmt = $pdo->prepare("
    SELECT title, ts_rank(to_tsvector('english', content),
                          plainto_tsquery('english', :q)) AS rank
    FROM articles
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', :q)
    ORDER BY rank DESC
");
$stmt->execute([':q' => $userQuery]);

Added 23 Mar 2026
Views 40
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 2 pings F 1 ping S 1 ping S 0 pings M 0 pings T 1 ping W 0 pings T 1 ping F 1 ping S 0 pings S 0 pings M 0 pings T 1 ping W 0 pings T
No pings yet today
Amazonbot 15 Perplexity 9 Ahrefs 3 Google 2 SEMrush 2 ChatGPT 1 Meta AI 1
crawler 33
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: Medium
⚡ Quick Fix
Use MATCH...AGAINST in MySQL or ts_rank in PostgreSQL — both implement TF-IDF variants automatically; configure stop words in your index for the query language

✓ schema.org compliant