{
    "slug": "bm25",
    "term": "BM25 Ranking",
    "category": "search",
    "difficulty": "intermediate",
    "short": "Best Match 25 — the industry-standard relevance ranking algorithm used by Elasticsearch, Lucene, and SQLite FTS5, refining TF-IDF with better document length normalisation and a term frequency saturation parameter.",
    "long": "BM25 (Okapi BM25) improves on TF-IDF by adding two tuning parameters: k1 controls term frequency saturation (how much additional occurrences of a term increase the score — typically 1.2–2.0), and b controls length normalisation (how strongly document length affects scoring — 0.75 is standard). The key insight over TF-IDF: in BM25, each additional occurrence of a term contributes diminishing returns to the score. A term appearing 100 times in a document does not score 100× higher than a term appearing once — there is a saturation ceiling. This makes BM25 less susceptible to term-stuffing and more accurate on documents of varying lengths. BM25 is the default ranking function in Elasticsearch 5+, Lucene, Solr, SQLite FTS5, and PostgreSQL's ts_rank_cd variant.",
    "aliases": [
        "BM25",
        "Okapi BM25",
        "Best Match 25",
        "BM25F",
        "bm25 ranking"
    ],
    "tags": [
        "bm25",
        "search",
        "relevance",
        "ranking",
        "elasticsearch",
        "lucene"
    ],
    "misconception": "BM25 and TF-IDF produce the same results and are interchangeable. BM25 consistently outperforms TF-IDF on real-world document collections, particularly on short queries against long documents. The term frequency saturation in BM25 prevents long documents from dominating results purely due to higher raw term counts. On modern search engines, TF-IDF is largely a historical reference point — BM25 is the practical baseline.",
    "why_it_matters": "BM25 is the default relevance algorithm in every major search engine and understanding it prevents cargo-cult configuration. When tuning Elasticsearch for a PHP application, the k1 and b parameters directly control search quality — lowering b reduces length normalisation bias for collections with consistent document lengths; raising k1 rewards documents where the query term appears repeatedly. Knowing what these parameters do is the difference between systematic relevance tuning and random experimentation.",
    "common_mistakes": [
        "Using default k1 and b values without evaluating them on your actual document collection — defaults are good starting points, not optimal values.",
        "Not using BM25 in SQLite FTS5 — SQLite FTS5 defaults to BM25 but FTS4 uses an older algorithm; always prefer FTS5 for new projects.",
        "Comparing BM25 scores across different queries to rank results globally — BM25 scores are relative to the query and collection, not absolute.",
        "Tuning BM25 without a relevance evaluation dataset — parameter changes without measurement produce unpredictable results."
    ],
    "when_to_use": [],
    "avoid_when": [],
    "related": [
        "tfidf",
        "inverted_index",
        "elasticsearch_fundamentals",
        "search_relevance"
    ],
    "prerequisites": [],
    "refs": [
        "https://www.elastic.co/guide/en/elasticsearch/reference/current/similarity.html"
    ],
    "bad_code": "// ❌ Hand-rolling relevance scoring with raw LIKE — no IDF weighting\nfunction search(string $query, PDO $db): array {\n    $words = explode(' ', $query);\n    $sql = \"SELECT *, 0 AS score FROM documents WHERE \";\n    $conditions = [];\n    foreach ($words as $word) {\n        $conditions[] = \"content LIKE '%$word%'\";\n    }\n    $sql .= implode(' OR ', $conditions);\n    // Counts nothing, ranks nothing, vulnerable to SQL injection\n    return $db->query($sql)->fetchAll();\n}",
    "good_code": "// ✅ Use Elasticsearch (BM25 by default since v5) or PostgreSQL FTS\n// Elasticsearch — BM25 automatic, no config needed\n$results = $es->search([\n    'index' => 'articles',\n    'body'  => [\n        'query' => [\n            'multi_match' => [\n                'query'  => $userQuery,\n                'fields' => ['title^3', 'body'], // ^3 boosts title matches\n            ]\n        ]\n    ]\n]);\n\n// PostgreSQL FTS — ts_rank uses BM25-like IDF weighting\n$stmt = $pdo->prepare(\"\n    SELECT *, ts_rank(search_vector, plainto_tsquery('english', :q)) AS rank\n    FROM articles\n    WHERE search_vector @@ plainto_tsquery('english', :q)\n    ORDER BY rank DESC\n    LIMIT 20\n\");\n$stmt->execute([':q' => $userQuery]);",
    "quick_fix": "Elasticsearch uses BM25 by default since version 5 — no configuration needed. For SQLite use FTS5 (not FTS4). Tune k1 (1.2) and b (0.75) as a starting point, then measure with real queries",
    "severity": "info",
    "effort": "medium",
    "created": "2026-03-23",
    "updated": "2026-03-23",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/bm25",
        "html_url": "https://codeclaritylab.com/glossary/bm25",
        "json_url": "https://codeclaritylab.com/glossary/bm25.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[BM25 Ranking](https://codeclaritylab.com/glossary/bm25) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/bm25"
            }
        }
    }
}