← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

BM25 Ranking

search Intermediate

Also Known As

BM25 Okapi BM25 Best Match 25 BM25F bm25 ranking

TL;DR

Best Match 25 — the industry-standard relevance ranking algorithm used by Elasticsearch, Lucene, and SQLite FTS5, refining TF-IDF with better document length normalisation and a term frequency saturation parameter.

Explanation

BM25 (Okapi BM25) improves on TF-IDF by adding two tuning parameters: k1 controls term frequency saturation (how much additional occurrences of a term increase the score — typically 1.2–2.0), and b controls length normalisation (how strongly document length affects scoring — 0.75 is standard). The key insight over TF-IDF: in BM25, each additional occurrence of a term contributes diminishing returns to the score. A term appearing 100 times in a document does not score 100× higher than a term appearing once — there is a saturation ceiling. This makes BM25 less susceptible to term-stuffing and more accurate on documents of varying lengths. BM25 is the default ranking function in Elasticsearch 5+, Lucene, Solr, SQLite FTS5, and PostgreSQL's ts_rank_cd variant.

Common Misconception

BM25 and TF-IDF produce the same results and are interchangeable. BM25 consistently outperforms TF-IDF on real-world document collections, particularly on short queries against long documents. The term frequency saturation in BM25 prevents long documents from dominating results purely due to higher raw term counts. On modern search engines, TF-IDF is largely a historical reference point — BM25 is the practical baseline.

Why It Matters

BM25 is the default relevance algorithm in every major search engine and understanding it prevents cargo-cult configuration. When tuning Elasticsearch for a PHP application, the k1 and b parameters directly control search quality — lowering b reduces length normalisation bias for collections with consistent document lengths; raising k1 rewards documents where the query term appears repeatedly. Knowing what these parameters do is the difference between systematic relevance tuning and random experimentation.

Common Mistakes

  • Using default k1 and b values without evaluating them on your actual document collection — defaults are good starting points, not optimal values.
  • Not using BM25 in SQLite FTS5 — SQLite FTS5 defaults to BM25 but FTS4 uses an older algorithm; always prefer FTS5 for new projects.
  • Comparing BM25 scores across different queries to rank results globally — BM25 scores are relative to the query and collection, not absolute.
  • Tuning BM25 without a relevance evaluation dataset — parameter changes without measurement produce unpredictable results.

Code Examples

✗ Vulnerable
// ❌ Hand-rolling relevance scoring with raw LIKE — no IDF weighting
function search(string $query, PDO $db): array {
    $words = explode(' ', $query);
    $sql = "SELECT *, 0 AS score FROM documents WHERE ";
    $conditions = [];
    foreach ($words as $word) {
        $conditions[] = "content LIKE '%$word%'";
    }
    $sql .= implode(' OR ', $conditions);
    // Counts nothing, ranks nothing, vulnerable to SQL injection
    return $db->query($sql)->fetchAll();
}
✓ Fixed
// ✅ Use Elasticsearch (BM25 by default since v5) or PostgreSQL FTS
// Elasticsearch — BM25 automatic, no config needed
$results = $es->search([
    'index' => 'articles',
    'body'  => [
        'query' => [
            'multi_match' => [
                'query'  => $userQuery,
                'fields' => ['title^3', 'body'], // ^3 boosts title matches
            ]
        ]
    ]
]);

// PostgreSQL FTS — ts_rank uses BM25-like IDF weighting
$stmt = $pdo->prepare("
    SELECT *, ts_rank(search_vector, plainto_tsquery('english', :q)) AS rank
    FROM articles
    WHERE search_vector @@ plainto_tsquery('english', :q)
    ORDER BY rank DESC
    LIMIT 20
");
$stmt->execute([':q' => $userQuery]);

Added 23 Mar 2026
Views 43
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 2 pings S 1 ping M 0 pings T 0 pings W 1 ping T 2 pings F 2 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 1 ping S 0 pings S 0 pings M 0 pings T 1 ping W 0 pings T
No pings yet today
Amazonbot 15 Perplexity 10 Google 5 Ahrefs 2 SEMrush 2 ChatGPT 1 Meta AI 1
crawler 33 crawler_json 3
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: Medium
⚡ Quick Fix
Elasticsearch uses BM25 by default since version 5 — no configuration needed. For SQLite use FTS5 (not FTS4). Tune k1 (1.2) and b (0.75) as a starting point, then measure with real queries

✓ schema.org compliant