Chapter 20: Bayesian BM25 and Probabilistic Calibration
20.1 Introduction
Chapter 19 established BM25 as the standard ranking function for full-text search. BM25 produces scores that are effective for ranking documents by relevance within a single query. However, when we attempt to use BM25 scores beyond simple ranking — combining them with vector similarity, setting relevance thresholds, or comparing across queries — fundamental limitations emerge.
This chapter introduces Bayesian BM25, a probabilistic framework that transforms unbounded BM25 scores into calibrated probabilities in . The transformation preserves ranking quality while enabling principled multi-signal fusion, which forms the foundation for hybrid search (Chapter 24).
20.1.1 The Score Interpretation Problem
BM25 scores suffer from four interpretability limitations:
- Unbounded Range: , making absolute interpretation impossible. A score of 12.7 carries no intrinsic meaning.
- Query Dependence: Score magnitudes vary with query length and term specificity. A two-term query over common words produces much lower scores than a five-term query over rare terms.
- Corpus Dependence: IDF values depend on collection statistics, preventing cross-corpus comparison.
- Signal Incompatibility: Vector similarity scores live in (cosine) or (normalized). Direct combination with unbounded BM25 scores leads to one signal dominating the other.
20.1.2 The Signal Dominance Problem
For additive combination of signals with different scales:
the signal with the larger expected magnitude dominates the ranking. If BM25 scores average 8.5 while vector similarities average 0.7, the vector signal contributes less than 10% of the combined score — effectively invisible.
Weighted combination introduces a tuning parameter with no principled basis for selection. Different queries, corpora, and embedding models require different weights.
20.1.3 Why Reciprocal Rank Fusion Falls Short
Reciprocal Rank Fusion (RRF) is a popular approach to combining ranked lists:
where (by convention) and is the rank of document in list .
While simple and effective, RRF has theoretical limitations:
- Information loss: Score magnitudes are discarded. A document with BM25 score 15.3 and one with 15.2 receive different ranks but may have nearly identical relevance.
- Non-commutativity with filtering: Filtering before fusion produces different results than filtering after fusion, because rank positions change when documents are removed.
- Arbitrary constant: The parameter lacks theoretical justification.
- Missing document handling: Documents appearing in only some ranked lists have undefined behavior.
Bayesian BM25 takes a different approach: rather than operating on ranks, it transforms raw scores into a common probabilistic space where combination follows from Bayes' theorem.
20.2 Sigmoid Likelihood Model
20.2.1 Probabilistic Framework
The goal is to compute — the probability that a document is relevant given its BM25 score . Using Bayes' theorem:
We need to model — how likely a given score is for a relevant document. The sigmoid function provides a natural model:
where:
- controls the sigmoid steepness (score sensitivity)
- is the midpoint (decision boundary)
20.2.2 Symmetric Likelihood Assumption
We assume that the likelihood of observing score for a non-relevant document is the complement:
This symmetry is a consequence of the sigmoid's property . Under this assumption, the posterior simplifies to:
where is the sigmoid likelihood and is the prior probability of relevance.
20.2.3 Monotonicity Preservation
A critical property: the Bayesian transformation preserves the ranking order of BM25 scores.
Theorem (Monotonicity). For fixed prior and :
Proof. Let with . Since , and:
the posterior is strictly increasing in .
This means any document ranking produced by BM25 is preserved after Bayesian transformation — no ranking degradation occurs.
20.3 Neutral Prior and Log-Compression
20.3.1 Why a Neutral Prior
An earlier design used a per-document composite prior that combined term frequency and document length into a document-dependent prior . In practice, this prior double-counted evidence that BM25 already models:
- BM25's component already saturates term frequency through .
- BM25's length normalization ( parameter) already penalizes unusually long or short documents.
Applying a prior that re-encodes the same features inflates the posterior toward 1.0 for high-frequency terms in medium-length documents, destroying score discrimination. Removing the per-document prior and fixing (neutral) eliminates the double-counting. With a neutral prior, the posterior simplifies to the sigmoid likelihood:
because the Bayesian update with yields .
20.3.2 Log-Compression of BM25 Scores
Raw BM25 scores grow linearly with IDF. For rare terms, can be large (e.g., 8--12), producing raw BM25 scores that push into the flat region near 1.0. This sigmoid saturation destroys discrimination among high-scoring documents.
Log-compression maps the unbounded BM25 range to a moderate logit range:
The posterior becomes:
Properties of log-compression:
- Monotonicity: is strictly increasing, so document ranking is preserved.
- IDF ordering: , so the IDF discriminative power is maintained.
- Saturation prevention: Even for BM25 scores of 10--15, --, which keeps the sigmoid argument in the discriminative range.
20.4 Corpus-Level Base Rate Prior
20.4.1 The Overconfidence Problem
The auto-estimation heuristic sets , placing the sigmoid midpoint at the corpus median. Because most documents are not relevant to a typical query, this assigns approximately 50% relevance probability to scores that are almost never relevant — producing systematic overconfidence.
The neutral prior (Section 20.3.1) does not correct this because it is fixed at 0.5 and does not encode the corpus-level prevalence of relevance. A separate corpus-level base rate is needed.
20.4.2 Base Rate Definition
The base rate prior captures the global probability that a randomly selected document is relevant to a randomly selected query:
For most corpora, is very small — perhaps 0.01 to 0.05.
20.4.3 Two-Term Posterior Decomposition
With the neutral document prior (, see Section 20.3.1) and log-compressed BM25 scores (Section 20.3.2), the full posterior decomposes as two additive terms in log-odds space.
Theorem (Two-Term Decomposition). Let be the log-compressed score, be the sigmoid likelihood, and be the base rate prior. Then:
or equivalently:
where .
Proof. With a neutral prior , , so the document prior term vanishes. The single Bayes update combines likelihood with corpus prior :
Since , we recover the stated form.
This additive structure in log-odds space is fundamental — it means evidence from different sources combines linearly, which is exactly the property needed for principled multi-signal fusion (Chapter 23).
20.4.4 Base Rate Estimation
In the absence of relevance labels, can be estimated from the corpus score distribution:
Input: Corpus C with N documents, BM25 index I
Output: Estimated base rate b_r
m = min(N, 50)
indices = sample_uniform(1..N, m)
for i = 1 to m:
q_i = first_5_tokens(C[indices[i]])
S_i = {s(d, q_i) : s > 0 for all d in C}
t_i = percentile(S_i, 95)
r_i = |{s in S_i : s >= t_i}| / N
b_r = clamp(mean(r_1, ..., r_m), 1e-6, 0.5)
The 95th percentile threshold selects documents with unusually high scores relative to each pseudo-query — a proxy for the fraction of highly relevant documents. The upper clamp at 0.5 ensures , so the base rate can only shift the posterior downward, correcting the overconfidence.
20.4.5 Calibration Impact
The base rate prior reduces expected calibration error (ECE) by 68-77% without requiring any relevance labels:
| Method | ECE (NFCorpus) | ECE (SciFact) |
|---|---|---|
| Bayesian (auto) | 0.6519 | 0.7989 |
| Bayesian (auto + base rate) | 0.1461 | 0.2577 |
| Platt scaling (supervised) | 0.0186 | 0.0188 |
| Bayesian (batch fit, supervised) | 0.0084 | 0.0069 |
The unsupervised base rate correction provides the largest single improvement available without labeled data.
20.5 WAND/BMW Compatibility
20.5.1 Modified Upper Bounds
WAND (Chapter 25) requires score upper bounds for safe pruning. With the neutral prior () and log-compression, the Bayesian upper bound simplifies considerably. Since is fixed, the posterior equals the sigmoid likelihood, and the upper bound depends only on the maximum BM25 score:
The log-compression is applied to the BM25 upper bound to match the scorer's compression step (Section 20.3.2). This bound can be precomputed at index time alongside the standard BM25 upper bound, incurring no additional runtime cost.
20.5.2 Block-Max WAND with Bayesian Scoring
BMW uses block-level upper bounds for finer-grained pruning. For Bayesian BM25, the block-max upper bound applies the same sigmoid formula with the block-local maximum BM25 score:
Since for all blocks , and both and are monotonically increasing, BMW achieves higher skip rates than WAND while maintaining exact top- results:
20.6 Parameter Learning
20.6.1 Cross-Entropy Loss
When relevance labels are available, the sigmoid parameters and can be optimized using cross-entropy loss:
where is the true relevance label and is the predicted probability.
20.6.2 Gradient Computation
The gradients simplify to:
The sigmoid's self-referential derivative cancels with the cross-entropy gradient, producing these clean forms.
20.6.3 Training Modes
Two consistency conditions govern the relationship between training and inference:
Balanced training (recommended): Train on a balanced dataset where . The learned sigmoid then approximates the normalized likelihood ratio, and the neutral prior () applies at inference time without double-counting.
Prior-free inference: If trained on representative (non-balanced) data, use at inference time, yielding where . The posterior is well-calibrated with respect to the training distribution. A separate corpus-level base rate (Section 20.4) can be applied as an additive correction in log-odds space without retraining.
20.6.4 Implementation
The parameter learning applies the same log-compression as the scorer (Section 20.3.2), ensuring that the learned and are calibrated against compressed scores:
void BayesianBM25Similarity::update_parameters(
const std::vector<float>& scores,
const std::vector<float>& relevances) {
constexpr int32_t num_iterations = 1000;
constexpr float learning_rate = 0.01f;
float alpha = 1.0f;
float beta = scores[scores.size() / 2]; // Initialize to median
for (int32_t iter = 0; iter < num_iterations; ++iter) {
float alpha_grad = 0.0f;
float beta_grad = 0.0f;
for (size_t i = 0; i < scores.size(); ++i) {
// Log-compression matching BayesianBM25SimScorer::score()
float compressed = std::log(1.0f + scores[i]);
float pred = 1.0f / (1.0f + std::exp(-alpha * (compressed - beta)));
float error = pred - relevances[i];
alpha_grad += error * (compressed - beta);
beta_grad += -error * alpha;
}
alpha -= learning_rate * alpha_grad / static_cast<float>(scores.size());
beta -= learning_rate * beta_grad / static_cast<float>(scores.size());
}
std::scoped_lock guard{lock_};
alpha_ = alpha;
beta_ = beta;
}
20.7 Computational Complexity
20.7.1 Scoring Overhead
The additional cost of Bayesian BM25 over standard BM25 is per document:
| Operation | BM25 | Bayesian BM25 | Notes |
|---|---|---|---|
| Score computation | 2 div, 2 mul, 1 add | +1 log, +1 exp, +2 mul, +1 div | Log-compression + sigmoid |
| IDF computation | 1 log, 2 div, 2 add | Same | Computed once per query |
With the neutral prior, there is no per-document prior computation. The only added cost beyond standard BM25 is the log-compression and the sigmoid evaluation .
20.7.2 Efficient Computation
With the neutral prior (), the posterior simplifies to the sigmoid likelihood, avoiding the full Bayesian posterior formula. The computation path is:
- Compute (log-compression)
- Compute (sigmoid likelihood = posterior)
When a non-neutral base rate is configured at the similarity level, it is folded into the parameter, preserving the single-sigmoid evaluation path. This avoids the two successive Bayes updates that were previously needed with a document-dependent prior.
20.7.3 Memory Layout
The Bayesian BM25 scorer fits within the same 32-byte inline storage as the standard BM25 scorer:
No heap allocation is needed, maintaining cache efficiency.
20.8 Implementation
20.8.1 BayesianBM25Similarity Class
class BayesianBM25Similarity {
BM25Similarity bm25_;
float alpha_ = 1.0f;
float beta_ = 0.0f;
float base_rate_ = 0.5f; // Neutral by default
mutable threading::spinlock lock_;
public:
auto create_scorer(
const IndexStatsSnapshot& index_stats,
const std::vector<std::optional<TermStatsSnapshot>>& term_stats) const
-> SimScorerType {
auto bm25_scorer = bm25_.create_scorer(index_stats, term_stats);
return BayesianBM25SimScorer{bm25_scorer, alpha_, beta_};
}
auto is_probabilistic() const -> bool { return true; }
};
20.8.2 Scorer Implementation
class BayesianBM25SimScorer {
BM25SimScorer bm25_scorer_;
float alpha_;
float beta_;
public:
auto score(float freq, float norm) const -> float {
float bm25_score = bm25_scorer_.score(freq, norm);
// Log-compression prevents sigmoid saturation for high-IDF terms
float compressed = std::log(1.0f + bm25_score);
// With neutral prior (0.5), posterior = sigmoid likelihood
float likelihood = 1.0f / (1.0f + std::exp(-alpha_ * (compressed - beta_)));
return likelihood;
}
auto is_probabilistic() const -> bool { return true; }
};
20.8.3 Score Properties
| Property | Standard BM25 | Bayesian BM25 |
|---|---|---|
| Output range | ||
| Interpretation | Relevance score | Probability |
| Cross-query comparison | Not meaningful | Valid |
| Threshold selection | Difficult | Intuitive (e.g., ) |
| Multi-signal fusion | Requires normalization | Direct combination |
| Monotonicity | Guaranteed | Preserved |
20.9 Multi-Signal Score Combination
20.9.1 Log-Odds Mean Framework
The implementation uses the log-odds mean (Chapter 23) as the unified combination operator for all Boolean modes. Rather than naive probability multiplication (which suffers from conjunction shrinkage) or Noisy-OR (which saturates toward 1.0), the log-odds mean computes the arithmetic mean of logits:
This is equivalent to the geometric mean of odds ratios and has the identity property: when , the output equals the single input.
20.9.2 Conjunction (AND)
For conjunction, the log-odds mean uses confidence scaling to amplify agreement among signals:
The scaling causes the combined confidence to grow when independent signals agree, while the arithmetic mean () used for disjunction prevents saturation. See Chapter 23 for the full derivation.
20.9.3 Disjunction (OR)
For disjunction, the log-odds mean uses full normalization by (arithmetic mean) to prevent saturation:
The Noisy-OR formula saturates toward 1.0 when individual probabilities are high, destroying discrimination among results. The arithmetic mean of logits keeps the output in a discriminative range regardless of how many scorers contribute.
20.9.4 Numerical Stability
All probability computations clamp values to where before computing logits. This epsilon is chosen as the smallest value where is representable as a float32 value distinct from . Using a smaller epsilon (e.g., ) causes to round to due to the float32 unit of least precision (ULP) at 1.0 being approximately , which makes and poisons the entire log-odds sum.
20.10 Summary
Bayesian BM25 provides the probabilistic bridge between raw relevance scores and calibrated probabilities. The key concepts covered in this chapter are:
Sigmoid Likelihood Model: Maps unbounded BM25 scores to through a parametric sigmoid , with provably preserved monotonicity.
Log-Compression: Applies before the sigmoid to prevent saturation for high-IDF terms, preserving score discrimination across the entire BM25 range.
Neutral Prior: Uses a fixed prior of , eliminating the per-document composite prior that double-counted evidence already modeled by BM25's term frequency and document length parameters. The posterior simplifies to the sigmoid likelihood.
Two-Term Posterior Decomposition: The full posterior decomposes additively in log-odds space — — enabling each evidence source to contribute independently. This is the central result that connects to the log-odds mean framework (Chapter 23).
Base Rate Prior: A corpus-level calibration mechanism that reduces expected calibration error by 68--77% without relevance labels, correcting the systematic overconfidence of the median-based estimate.
WAND/BMW Compatibility: Upper bounds use log-compressed sigmoid evaluation , maintaining exact pruning with precomputable bounds at no additional runtime cost.
Parameter Learning: Cross-entropy optimization with log-compressed scores, ensuring and are calibrated against the same transformed score space used during inference.
The next chapter covers vector search and HNSW indexes (Chapter 21), followed by vector score calibration (Chapter 22), which applies the same likelihood ratio structure to dense retrieval scores — completing the probabilistic unification of sparse and dense search.
References
- Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
- Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR.
- Platt, J. (1999). Probabilistic Outputs for Support Vector Machines. Advances in Large Margin Classifiers.
- Jeong, J. (2026). Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search. Zenodo preprint.
- Robbins, H. (1956). An Empirical Bayes Approach to Statistics. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability.
- Broder, A. Z., et al. (2003). Efficient Query Evaluation Using a Two-Level Retrieval Process. CIKM.
- Ding, S., & Suel, T. (2011). Faster Top-k Document Retrieval Using Block-Max Indexes. SIGIR.