Chapter 20: Bayesian BM25 and Probabilistic Calibration

20.1 Introduction

Chapter 19 established BM25 as the standard ranking function for full-text search. BM25 produces scores that are effective for ranking documents by relevance within a single query. However, when we attempt to use BM25 scores beyond simple ranking — combining them with vector similarity, setting relevance thresholds, or comparing across queries — fundamental limitations emerge.

This chapter introduces Bayesian BM25, a probabilistic framework that transforms unbounded BM25 scores into calibrated probabilities in $[0, 1]$ . The transformation preserves ranking quality while enabling principled multi-signal fusion, which forms the foundation for hybrid search (Chapter 24).

20.1.1 The Score Interpretation Problem

BM25 scores suffer from four interpretability limitations:

Unbounded Range: $\text{BM25}(D, Q) \in [0, +\infty)$ , making absolute interpretation impossible. A score of 12.7 carries no intrinsic meaning.
Query Dependence: Score magnitudes vary with query length and term specificity. A two-term query over common words produces much lower scores than a five-term query over rare terms.
Corpus Dependence: IDF values depend on collection statistics, preventing cross-corpus comparison.
Signal Incompatibility: Vector similarity scores live in $[-1, 1]$ (cosine) or $[0, 1]$ (normalized). Direct combination with unbounded BM25 scores leads to one signal dominating the other.

20.1.2 The Signal Dominance Problem

For additive combination of signals with different scales:

\text{Score}_{\text{naive}} = s_{\text{BM25}} + s_{\text{vector}}

the signal with the larger expected magnitude dominates the ranking. If BM25 scores average 8.5 while vector similarities average 0.7, the vector signal contributes less than 10% of the combined score — effectively invisible.

Weighted combination $\alpha \cdot s_{\text{BM25}} + (1 - \alpha) \cdot s_{\text{vector}}$ introduces a tuning parameter $\alpha$ with no principled basis for selection. Different queries, corpora, and embedding models require different weights.

20.1.3 Why Reciprocal Rank Fusion Falls Short

Reciprocal Rank Fusion (RRF) is a popular approach to combining ranked lists:

\text{RRF}(D) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_i(D)}

where $k = 60$ (by convention) and $\text{rank}_i(D)$ is the rank of document $D$ in list $i$ .

While simple and effective, RRF has theoretical limitations:

Information loss: Score magnitudes are discarded. A document with BM25 score 15.3 and one with 15.2 receive different ranks but may have nearly identical relevance.
Non-commutativity with filtering: Filtering before fusion produces different results than filtering after fusion, because rank positions change when documents are removed.
Arbitrary constant: The parameter $k = 60$ lacks theoretical justification.
Missing document handling: Documents appearing in only some ranked lists have undefined behavior.

Bayesian BM25 takes a different approach: rather than operating on ranks, it transforms raw scores into a common probabilistic space where combination follows from Bayes' theorem.

20.2 Sigmoid Likelihood Model

20.2.1 Probabilistic Framework

The goal is to compute $P(R = 1 \mid s)$ — the probability that a document is relevant given its BM25 score $s$ . Using Bayes' theorem:

P(R = 1 \mid s) = \frac{P(s \mid R = 1) \cdot P(R = 1)}{P(s \mid R = 1) \cdot P(R = 1) + P(s \mid R = 0) \cdot P(R = 0)}

We need to model $P(s \mid R = 1)$ — how likely a given score is for a relevant document. The sigmoid function provides a natural model:

P(s \mid R = 1) = \sigma(\alpha(s - \beta)) = \frac{1}{1 + e^{-\alpha(s - \beta)}}

where:

$\alpha > 0$ controls the sigmoid steepness (score sensitivity)
$\beta$ is the midpoint (decision boundary)

20.2.2 Symmetric Likelihood Assumption

We assume that the likelihood of observing score $s$ for a non-relevant document is the complement:

P(s \mid R = 0) = 1 - P(s \mid R = 1) = \sigma(-\alpha(s - \beta))

This symmetry is a consequence of the sigmoid's property $\sigma(-x) = 1 - \sigma(x)$ . Under this assumption, the posterior simplifies to:

P(R = 1 \mid s) = \frac{L \cdot p}{L \cdot p + (1 - L) \cdot (1 - p)}

where $L = \sigma(\alpha(s - \beta))$ is the sigmoid likelihood and $p = P(R = 1)$ is the prior probability of relevance.

20.2.3 Monotonicity Preservation

A critical property: the Bayesian transformation preserves the ranking order of BM25 scores.

Theorem (Monotonicity). For fixed prior $p$ and $\alpha > 0$ :

s_1 > s_2 \implies P(R = 1 \mid s_1) > P(R = 1 \mid s_2)

Proof. Let $g(s) = P(R = 1 \mid s)$ with $L(s) = \sigma(\alpha(s - \beta))$ . Since $L'(s) = \alpha \cdot L(s)(1 - L(s)) > 0$ , and:

g'(s) = \frac{L'(s) \cdot p \cdot (1 - p)}{(L \cdot p + (1 - L)(1 - p))^2} > 0

the posterior is strictly increasing in $s$ .

This means any document ranking produced by BM25 is preserved after Bayesian transformation — no ranking degradation occurs.

20.3 Neutral Prior and Log-Compression

20.3.1 Why a Neutral Prior

An earlier design used a per-document composite prior that combined term frequency and document length into a document-dependent prior $p(f, \hat{n}) \in [0.1, 0.9]$ . In practice, this prior double-counted evidence that BM25 already models:

BM25's $\text{tf}$ component already saturates term frequency through $k_1$ .
BM25's length normalization ( $b$ parameter) already penalizes unusually long or short documents.

Applying a prior that re-encodes the same features inflates the posterior toward 1.0 for high-frequency terms in medium-length documents, destroying score discrimination. Removing the per-document prior and fixing $p = 0.5$ (neutral) eliminates the double-counting. With a neutral prior, the posterior simplifies to the sigmoid likelihood:

P(R = 1 \mid s) = \sigma(\alpha(s - \beta))

because the Bayesian update with $p = 0.5$ yields $\text{logit}(P) = \text{logit}(L) + \text{logit}(0.5) = \text{logit}(L) + 0 = \text{logit}(L)$ .

20.3.2 Log-Compression of BM25 Scores

Raw BM25 scores grow linearly with IDF. For rare terms, $\text{IDF}$ can be large (e.g., 8--12), producing raw BM25 scores that push $\sigma(\alpha \cdot s)$ into the flat region near 1.0. This sigmoid saturation destroys discrimination among high-scoring documents.

Log-compression maps the unbounded BM25 range to a moderate logit range:

s_c = \ln(1 + s_{\text{BM25}})

The posterior becomes:

P(R = 1 \mid s_{\text{BM25}}) = \sigma\!\left(\alpha \cdot (\ln(1 + s_{\text{BM25}}) - \beta)\right)

Properties of log-compression:

Monotonicity: $\ln(1 + x)$ is strictly increasing, so document ranking is preserved.
IDF ordering: $\ln(1 + \text{IDF}_{\text{high}} \cdot \text{tf}) > \ln(1 + \text{IDF}_{\text{low}} \cdot \text{tf})$ , so the IDF discriminative power is maintained.
Saturation prevention: Even for BM25 scores of 10--15, $\ln(1 + s) \approx 2.4$ -- $2.8$ , which keeps the sigmoid argument in the discriminative range.

20.4 Corpus-Level Base Rate Prior

20.4.1 The Overconfidence Problem

The auto-estimation heuristic sets $\beta = \text{median}(\text{scores})$ , placing the sigmoid midpoint at the corpus median. Because most documents are not relevant to a typical query, this assigns approximately 50% relevance probability to scores that are almost never relevant — producing systematic overconfidence.

The neutral prior (Section 20.3.1) does not correct this because it is fixed at 0.5 and does not encode the corpus-level prevalence of relevance. A separate corpus-level base rate is needed.

20.4.2 Base Rate Definition

The base rate prior $b_r \in (0, 1)$ captures the global probability that a randomly selected document is relevant to a randomly selected query:

b_r = \mathbb{E}_{q \sim \mathcal{Q}} \left[ \frac{|\{d : d \text{ is relevant to } q\}|}{N} \right]

For most corpora, $b_r$ is very small — perhaps 0.01 to 0.05.

20.4.3 Two-Term Posterior Decomposition

With the neutral document prior ( $p = 0.5$ , see Section 20.3.1) and log-compressed BM25 scores (Section 20.3.2), the full posterior decomposes as two additive terms in log-odds space.

Theorem (Two-Term Decomposition). Let $s_c = \ln(1 + s_{\text{BM25}})$ be the log-compressed score, $L = \sigma(\alpha(s_c - \beta))$ be the sigmoid likelihood, and $b_r$ be the base rate prior. Then:

\text{logit}\,P(R = 1 \mid s_{\text{BM25}}) = \underbrace{\alpha(s_c - \beta)}_{\text{score evidence}} + \underbrace{\text{logit}(b_r)}_{\text{corpus prior}}

or equivalently:

P(R = 1 \mid s_{\text{BM25}}) = \sigma\!\left(\alpha(\ln(1 + s_{\text{BM25}}) - \beta) + \text{logit}(b_r)\right)

where $\text{logit}(x) = \ln\frac{x}{1-x}$ .

Proof. With a neutral prior $p = 0.5$ , $\text{logit}(p) = 0$ , so the document prior term vanishes. The single Bayes update combines likelihood $L$ with corpus prior $b_r$ :

\text{logit}(P) = \text{logit}(L) + \text{logit}(b_r) = \alpha(s_c - \beta) + \text{logit}(b_r)

Since $\sigma(\text{logit}(x)) = x$ , we recover the stated form. $\square$

This additive structure in log-odds space is fundamental — it means evidence from different sources combines linearly, which is exactly the property needed for principled multi-signal fusion (Chapter 23).

20.4.4 Base Rate Estimation

In the absence of relevance labels, $b_r$ can be estimated from the corpus score distribution:

Input: Corpus C with N documents, BM25 index I
Output: Estimated base rate b_r

m = min(N, 50)
indices = sample_uniform(1..N, m)

for i = 1 to m:
    q_i = first_5_tokens(C[indices[i]])
    S_i = {s(d, q_i) : s > 0 for all d in C}
    t_i = percentile(S_i, 95)
    r_i = |{s in S_i : s >= t_i}| / N

b_r = clamp(mean(r_1, ..., r_m), 1e-6, 0.5)

The 95th percentile threshold selects documents with unusually high scores relative to each pseudo-query — a proxy for the fraction of highly relevant documents. The upper clamp at 0.5 ensures $\text{logit}(b_r) \leq 0$ , so the base rate can only shift the posterior downward, correcting the overconfidence.

20.4.5 Calibration Impact

The base rate prior reduces expected calibration error (ECE) by 68-77% without requiring any relevance labels:

Method	ECE (NFCorpus)	ECE (SciFact)
Bayesian (auto)	0.6519	0.7989
Bayesian (auto + base rate)	0.1461	0.2577
Platt scaling (supervised)	0.0186	0.0188
Bayesian (batch fit, supervised)	0.0084	0.0069

The unsupervised base rate correction provides the largest single improvement available without labeled data.

20.5 WAND/BMW Compatibility

20.5.1 Modified Upper Bounds

WAND (Chapter 25) requires score upper bounds for safe pruning. With the neutral prior ( $p = 0.5$ ) and log-compression, the Bayesian upper bound simplifies considerably. Since $p = 0.5$ is fixed, the posterior equals the sigmoid likelihood, and the upper bound depends only on the maximum BM25 score:

\text{UB}_{\text{Bayes}}(t) = \sigma\!\left(\alpha \cdot (\ln(1 + \text{UB}_{\text{BM25}}(t)) - \beta)\right)

The log-compression $\ln(1 + \cdot)$ is applied to the BM25 upper bound to match the scorer's compression step (Section 20.3.2). This bound can be precomputed at index time alongside the standard BM25 upper bound, incurring no additional runtime cost.

20.5.2 Block-Max WAND with Bayesian Scoring

BMW uses block-level upper bounds for finer-grained pruning. For Bayesian BM25, the block-max upper bound applies the same sigmoid formula with the block-local maximum BM25 score:

\text{BlockMax}_{\text{Bayes}}(t, j) = \sigma\!\left(\alpha \cdot (\ln(1 + \text{BlockMax}_{\text{BM25}}(t, j)) - \beta)\right)

Since $\text{BlockMax}_{\text{BM25}}(t, j) \leq \text{UB}_{\text{BM25}}(t)$ for all blocks $j$ , and both $\ln(1 + \cdot)$ and $\sigma$ are monotonically increasing, BMW achieves higher skip rates than WAND while maintaining exact top- $k$ results:

\text{Skip}_{\text{BMW}} \geq \text{Skip}_{\text{WAND}}

20.6 Parameter Learning

20.6.1 Cross-Entropy Loss

When relevance labels are available, the sigmoid parameters $\alpha$ and $\beta$ can be optimized using cross-entropy loss:

\mathcal{L}(\alpha, \beta) = -\sum_{i=1}^{n} \left[ y_i \ln(\hat{y}_i) + (1 - y_i) \ln(1 - \hat{y}_i) \right]

where $y_i \in \{0, 1\}$ is the true relevance label and $\hat{y}_i = \sigma(\alpha(s_i - \beta))$ is the predicted probability.

20.6.2 Gradient Computation

The gradients simplify to:

\frac{\partial \mathcal{L}}{\partial \alpha} = \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot (s_i - \beta)

\frac{\partial \mathcal{L}}{\partial \beta} = -\sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot \alpha

The sigmoid's self-referential derivative $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ cancels with the cross-entropy gradient, producing these clean forms.

20.6.3 Training Modes

Two consistency conditions govern the relationship between training and inference:

Balanced training (recommended): Train on a balanced dataset where $P_{\text{train}}(R = 1) \approx 0.5$ . The learned sigmoid then approximates the normalized likelihood ratio, and the neutral prior ( $p = 0.5$ ) applies at inference time without double-counting.

Prior-free inference: If trained on representative (non-balanced) data, use $p = 0.5$ at inference time, yielding $P(R = 1 \mid s) = \sigma(\alpha(s_c - \beta))$ where $s_c = \ln(1 + s_{\text{BM25}})$ . The posterior is well-calibrated with respect to the training distribution. A separate corpus-level base rate $b_r$ (Section 20.4) can be applied as an additive correction in log-odds space without retraining.

20.6.4 Implementation

The parameter learning applies the same log-compression as the scorer (Section 20.3.2), ensuring that the learned $\alpha$ and $\beta$ are calibrated against compressed scores:

void BayesianBM25Similarity::update_parameters(
    const std::vector<float>& scores,
    const std::vector<float>& relevances) {
  constexpr int32_t num_iterations = 1000;
  constexpr float learning_rate = 0.01f;

  float alpha = 1.0f;
  float beta = scores[scores.size() / 2];  // Initialize to median

  for (int32_t iter = 0; iter < num_iterations; ++iter) {
    float alpha_grad = 0.0f;
    float beta_grad = 0.0f;

    for (size_t i = 0; i < scores.size(); ++i) {
      // Log-compression matching BayesianBM25SimScorer::score()
      float compressed = std::log(1.0f + scores[i]);
      float pred = 1.0f / (1.0f + std::exp(-alpha * (compressed - beta)));
      float error = pred - relevances[i];

      alpha_grad += error * (compressed - beta);
      beta_grad += -error * alpha;
    }

    alpha -= learning_rate * alpha_grad / static_cast<float>(scores.size());
    beta -= learning_rate * beta_grad / static_cast<float>(scores.size());
  }

  std::scoped_lock guard{lock_};
  alpha_ = alpha;
  beta_ = beta;
}

20.7 Computational Complexity

20.7.1 Scoring Overhead

The additional cost of Bayesian BM25 over standard BM25 is $O(1)$ per document:

Operation	BM25	Bayesian BM25	Notes
Score computation	2 div, 2 mul, 1 add	+1 log, +1 exp, +2 mul, +1 div	Log-compression + sigmoid
IDF computation	1 log, 2 div, 2 add	Same	Computed once per query

With the neutral prior, there is no per-document prior computation. The only added cost beyond standard BM25 is the log-compression $\ln(1 + s)$ and the sigmoid evaluation $\sigma(\alpha(s_c - \beta))$ .

20.7.2 Efficient Computation

With the neutral prior ( $p = 0.5$ ), the posterior simplifies to the sigmoid likelihood, avoiding the full Bayesian posterior formula. The computation path is:

Compute $s_c = \ln(1 + s_{\text{BM25}})$ (log-compression)
Compute $P = \sigma(\alpha(s_c - \beta))$ (sigmoid likelihood = posterior)

When a non-neutral base rate $b_r$ is configured at the similarity level, it is folded into the $\beta$ parameter, preserving the single-sigmoid evaluation path. This avoids the two successive Bayes updates that were previously needed with a document-dependent prior.

20.7.3 Memory Layout

The Bayesian BM25 scorer fits within the same 32-byte inline storage as the standard BM25 scorer:

Loading diagram...

No heap allocation is needed, maintaining cache efficiency.

20.8 Implementation

20.8.1 BayesianBM25Similarity Class

class BayesianBM25Similarity {
  BM25Similarity bm25_;
  float alpha_ = 1.0f;
  float beta_ = 0.0f;
  float base_rate_ = 0.5f;  // Neutral by default
  mutable threading::spinlock lock_;

public:
  auto create_scorer(
      const IndexStatsSnapshot& index_stats,
      const std::vector<std::optional<TermStatsSnapshot>>& term_stats) const
      -> SimScorerType {
    auto bm25_scorer = bm25_.create_scorer(index_stats, term_stats);
    return BayesianBM25SimScorer{bm25_scorer, alpha_, beta_};
  }

  auto is_probabilistic() const -> bool { return true; }
};

20.8.2 Scorer Implementation

class BayesianBM25SimScorer {
  BM25SimScorer bm25_scorer_;
  float alpha_;
  float beta_;

public:
  auto score(float freq, float norm) const -> float {
    float bm25_score = bm25_scorer_.score(freq, norm);

    // Log-compression prevents sigmoid saturation for high-IDF terms
    float compressed = std::log(1.0f + bm25_score);

    // With neutral prior (0.5), posterior = sigmoid likelihood
    float likelihood = 1.0f / (1.0f + std::exp(-alpha_ * (compressed - beta_)));
    return likelihood;
  }

  auto is_probabilistic() const -> bool { return true; }
};

20.8.3 Score Properties

Property	Standard BM25	Bayesian BM25
Output range	$[0, +\infty)$	$(0, 1)$
Interpretation	Relevance score	Probability
Cross-query comparison	Not meaningful	Valid
Threshold selection	Difficult	Intuitive (e.g., $> 0.5$ )
Multi-signal fusion	Requires normalization	Direct combination
Monotonicity	Guaranteed	Preserved

20.9 Multi-Signal Score Combination

20.9.1 Log-Odds Mean Framework

The implementation uses the log-odds mean (Chapter 23) as the unified combination operator for all Boolean modes. Rather than naive probability multiplication (which suffers from conjunction shrinkage) or Noisy-OR (which saturates toward 1.0), the log-odds mean computes the arithmetic mean of logits:

P_{\text{combined}} = \sigma\!\left(\frac{1}{n} \sum_{i=1}^{n} \text{logit}(P_i)\right)

This is equivalent to the geometric mean of odds ratios and has the identity property: when $n = 1$ , the output equals the single input.

20.9.2 Conjunction (AND)

For conjunction, the log-odds mean uses $\sqrt{n}$ confidence scaling to amplify agreement among signals:

P_{\text{AND}} = \sigma\!\left(\frac{1}{\sqrt{n}} \sum_{i=1}^{n} \text{logit}(P_i)\right)

The $\sqrt{n}$ scaling causes the combined confidence to grow when independent signals agree, while the arithmetic mean ( $1/n$ ) used for disjunction prevents saturation. See Chapter 23 for the full derivation.

20.9.3 Disjunction (OR)

For disjunction, the log-odds mean uses full normalization by $n$ (arithmetic mean) to prevent saturation:

P_{\text{OR}} = \sigma\!\left(\frac{1}{n} \sum_{i=1}^{n} \text{logit}(P_i)\right)

The Noisy-OR formula $P = 1 - \prod(1 - P_i)$ saturates toward 1.0 when individual probabilities are high, destroying discrimination among results. The arithmetic mean of logits keeps the output in a discriminative range regardless of how many scorers contribute.

20.9.4 Numerical Stability

All probability computations clamp values to $[\epsilon, 1 - \epsilon]$ where $\epsilon = 10^{-7}$ before computing logits. This epsilon is chosen as the smallest value where $1.0\text{f} - \epsilon$ is representable as a float32 value distinct from $1.0\text{f}$ . Using a smaller epsilon (e.g., $10^{-10}$ ) causes $1.0\text{f} - 10^{-10}\text{f}$ to round to $1.0\text{f}$ due to the float32 unit of least precision (ULP) at 1.0 being approximately $5.96 \times 10^{-8}$ , which makes $\text{logit}(1.0) = +\infty$ and poisons the entire log-odds sum.

20.10 Summary

Bayesian BM25 provides the probabilistic bridge between raw relevance scores and calibrated probabilities. The key concepts covered in this chapter are:

Sigmoid Likelihood Model: Maps unbounded BM25 scores to $(0, 1)$ through a parametric sigmoid $\sigma(\alpha(s - \beta))$ , with provably preserved monotonicity.

Log-Compression: Applies $\ln(1 + s_{\text{BM25}})$ before the sigmoid to prevent saturation for high-IDF terms, preserving score discrimination across the entire BM25 range.

Neutral Prior: Uses a fixed prior of $p = 0.5$ , eliminating the per-document composite prior that double-counted evidence already modeled by BM25's term frequency and document length parameters. The posterior simplifies to the sigmoid likelihood.

Two-Term Posterior Decomposition: The full posterior decomposes additively in log-odds space — $\text{logit}(P) = \alpha(s_c - \beta) + \text{logit}(b_r)$ — enabling each evidence source to contribute independently. This is the central result that connects to the log-odds mean framework (Chapter 23).

Base Rate Prior: A corpus-level calibration mechanism that reduces expected calibration error by 68--77% without relevance labels, correcting the systematic overconfidence of the median-based $\beta$ estimate.

WAND/BMW Compatibility: Upper bounds use log-compressed sigmoid evaluation $\sigma(\alpha(\ln(1 + \text{UB}_{\text{BM25}}) - \beta))$ , maintaining exact pruning with precomputable bounds at no additional runtime cost.

Parameter Learning: Cross-entropy optimization with log-compressed scores, ensuring $\alpha$ and $\beta$ are calibrated against the same transformed score space used during inference.

The next chapter covers vector search and HNSW indexes (Chapter 21), followed by vector score calibration (Chapter 22), which applies the same likelihood ratio structure to dense retrieval scores — completing the probabilistic unification of sparse and dense search.

References

Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR.
Platt, J. (1999). Probabilistic Outputs for Support Vector Machines. Advances in Large Margin Classifiers.
Jeong, J. (2026). Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search. Zenodo preprint.
Robbins, H. (1956). An Empirical Bayes Approach to Statistics. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability.
Broder, A. Z., et al. (2003). Efficient Query Evaluation Using a Two-Level Retrieval Process. CIKM.
Ding, S., & Suel, T. (2011). Faster Top-k Document Retrieval Using Block-Max Indexes. SIGIR.