{
    "slug": "tfidf",
    "term": "TF-IDF",
    "category": "search",
    "difficulty": "intermediate",
    "short": "Term Frequency–Inverse Document Frequency — a relevance scoring formula that ranks documents higher when a query term appears frequently in them but rarely across the whole collection.",
    "long": "TF-IDF combines two signals: Term Frequency (TF) — how often a term appears in a specific document, normalised by document length — and Inverse Document Frequency (IDF) — the log of the ratio of total documents to documents containing the term. A term that appears in every document (like 'the') has IDF near zero and contributes little to ranking. A rare, specific term has high IDF and strongly differentiates relevant documents. The combined score TF × IDF ranks documents where the query term is both frequent and distinctive. TF-IDF was the dominant relevance formula before BM25, which refines it with better length normalisation. MySQL and PostgreSQL full-text search use TF-IDF variants internally. Understanding TF-IDF explains why search results change when your document collection grows — IDF scores shift as the corpus changes.",
    "aliases": [
        "TF-IDF",
        "term frequency inverse document frequency",
        "TF/IDF",
        "relevance scoring"
    ],
    "tags": [
        "tfidf",
        "search",
        "relevance",
        "ranking",
        "nlp",
        "full-text-search"
    ],
    "misconception": "TF-IDF scores are absolute and can be compared across different search indexes. TF-IDF scores are relative to the document collection they were computed on — a score of 0.8 in one index means nothing compared to 0.8 in another. IDF is computed from the full corpus, so adding or removing documents changes all scores. When comparing relevance across collections, normalise scores or use precision/recall metrics instead.",
    "why_it_matters": "TF-IDF is the conceptual foundation for understanding why search results are ranked the way they are. When a PHP developer builds a search feature and notices that common words produce irrelevant results while specific terms work well, that is TF-IDF behaving correctly — common words have low IDF. When results degrade as the document collection grows, IDF recalculation is the cause. Understanding TF-IDF also explains why stop-word lists exist and why synonym expansion at index time improves recall without harming precision.",
    "common_mistakes": [
        "Including stop words in the index — words present in nearly every document have IDF near zero and waste index space.",
        "Not normalising by document length — long documents naturally contain more term occurrences and score higher without length normalisation.",
        "Using raw term frequency instead of log-normalised TF — raw counts overweight high-frequency terms disproportionately.",
        "Expecting TF-IDF to handle synonyms — TF-IDF is purely lexical; synonym expansion must be done at index or query time separately."
    ],
    "when_to_use": [],
    "avoid_when": [],
    "related": [
        "inverted_index",
        "bm25",
        "full_text_search",
        "semantic_search",
        "search_relevance"
    ],
    "prerequisites": [],
    "refs": [
        "https://en.wikipedia.org/wiki/Tf%E2%80%93idf"
    ],
    "bad_code": "// ❌ Treating all words equally with simple keyword counting\nfunction score(string $doc, string $query): int {\n    $count = 0;\n    foreach (explode(' ', strtolower($query)) as $word) {\n        $count += substr_count(strtolower($doc), $word);\n    }\n    return $count;\n    // \"the\", \"a\", \"is\" score as highly as rare meaningful terms\n}",
    "good_code": "// ✅ Use the database's built-in TF-IDF — don't reimplement it\n// MySQL FULLTEXT with IDF weighting\n$stmt = $pdo->prepare(\"\n    SELECT *, MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE) AS relevance\n    FROM articles\n    WHERE MATCH(title, content) AGAINST(:q IN NATURAL LANGUAGE MODE)\n    ORDER BY relevance DESC\n    LIMIT 20\n\");\n$stmt->execute([':q' => $userQuery]);\n\n// PostgreSQL ts_rank (TF-IDF based)\n$stmt = $pdo->prepare(\"\n    SELECT title, ts_rank(to_tsvector('english', content),\n                          plainto_tsquery('english', :q)) AS rank\n    FROM articles\n    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', :q)\n    ORDER BY rank DESC\n\");\n$stmt->execute([':q' => $userQuery]);",
    "quick_fix": "Use MATCH...AGAINST in MySQL or ts_rank in PostgreSQL — both implement TF-IDF variants automatically; configure stop words in your index for the query language",
    "severity": "info",
    "effort": "medium",
    "created": "2026-03-23",
    "updated": "2026-03-23",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/tfidf",
        "html_url": "https://codeclaritylab.com/glossary/tfidf",
        "json_url": "https://codeclaritylab.com/glossary/tfidf.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[TF-IDF](https://codeclaritylab.com/glossary/tfidf) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/tfidf"
            }
        }
    }
}