BM25 Ranking
Also Known As
TL;DR
Explanation
BM25 (Okapi BM25) improves on TF-IDF by adding two tuning parameters: k1 controls term frequency saturation (how much additional occurrences of a term increase the score — typically 1.2–2.0), and b controls length normalisation (how strongly document length affects scoring — 0.75 is standard). The key insight over TF-IDF: in BM25, each additional occurrence of a term contributes diminishing returns to the score. A term appearing 100 times in a document does not score 100× higher than a term appearing once — there is a saturation ceiling. This makes BM25 less susceptible to term-stuffing and more accurate on documents of varying lengths. BM25 is the default ranking function in Elasticsearch 5+, Lucene, Solr, SQLite FTS5, and PostgreSQL's ts_rank_cd variant.
Common Misconception
Why It Matters
Common Mistakes
- Using default k1 and b values without evaluating them on your actual document collection — defaults are good starting points, not optimal values.
- Not using BM25 in SQLite FTS5 — SQLite FTS5 defaults to BM25 but FTS4 uses an older algorithm; always prefer FTS5 for new projects.
- Comparing BM25 scores across different queries to rank results globally — BM25 scores are relative to the query and collection, not absolute.
- Tuning BM25 without a relevance evaluation dataset — parameter changes without measurement produce unpredictable results.
Code Examples
// ❌ Hand-rolling relevance scoring with raw LIKE — no IDF weighting
function search(string $query, PDO $db): array {
$words = explode(' ', $query);
$sql = "SELECT *, 0 AS score FROM documents WHERE ";
$conditions = [];
foreach ($words as $word) {
$conditions[] = "content LIKE '%$word%'";
}
$sql .= implode(' OR ', $conditions);
// Counts nothing, ranks nothing, vulnerable to SQL injection
return $db->query($sql)->fetchAll();
}
// ✅ Use Elasticsearch (BM25 by default since v5) or PostgreSQL FTS
// Elasticsearch — BM25 automatic, no config needed
$results = $es->search([
'index' => 'articles',
'body' => [
'query' => [
'multi_match' => [
'query' => $userQuery,
'fields' => ['title^3', 'body'], // ^3 boosts title matches
]
]
]
]);
// PostgreSQL FTS — ts_rank uses BM25-like IDF weighting
$stmt = $pdo->prepare("
SELECT *, ts_rank(search_vector, plainto_tsquery('english', :q)) AS rank
FROM articles
WHERE search_vector @@ plainto_tsquery('english', :q)
ORDER BY rank DESC
LIMIT 20
");
$stmt->execute([':q' => $userQuery]);