Why Sigmoid? The Mathematical Inevitability Behind Bayesian BM25

Why Sigmoid? The Mathematical Inevitability Behind Bayesian BM25
1. A Question Nobody Asked
When we introduced Bayesian BM25, the response from the community was immediate and practical: "How much does it improve nDCG@10?" After integration into MTEB as a baseline retrieval model and txtai for hybrid search normalization, the results spoke clearly — consistent improvements of +0.8 to +3.0 nDCG@10 across all datasets, with zero configuration.
But there is a question almost nobody asked:
Why sigmoid?
The typical reading of our work is: "They chose a sigmoid to squash BM25 scores into [0, 1]." A design choice. A modeling decision. One option among many — perhaps tanh, perhaps min-max normalization, perhaps Platt scaling.
This reading is wrong. The sigmoid is not a choice. It is the only mathematically valid answer. And understanding why reveals something far deeper than a scoring trick.
2. The 50-Year Gap
In 1977, Stephen Robertson introduced the Probability Ranking Principle (PRP):
"If a retrieval system's response to each request is a ranking of documents in order of decreasing probability of relevance, the overall effectiveness of the system to its user will be the best that is obtainable."
The name of the framework that followed was deliberate: the Probabilistic Relevance Framework. And the scoring function it produced — BM25 — became the most successful ranking function in the history of information retrieval.
But BM25 does not output probabilities.
The output is an unbounded positive real number: . A score of 12.34 tells you nothing about how likely a document is to be relevant. Is it 90%? 50%? The number has no absolute meaning — it depends on query length, corpus statistics, and document length.
For 50 years, every textbook acknowledged this gap. Manning et al. (2008): "BM25 is derived from a probabilistic model, but the scores themselves are not probabilities." And for 50 years, nobody closed it.
The Probability Ranking Principle promised probabilities. It never delivered.
3. Why Not Just Pick Any Mapping?
A natural reaction is: why not just use any function that maps to ? There are infinitely many such functions: , min-max normalization, , Platt scaling, and so on. Why should one be preferred?
Because we are not looking for any mapping. We are looking for — the posterior probability of relevance given the observed score. This is a specific quantity with a specific meaning, and Bayes' theorem dictates exactly how it must be computed:
So the question becomes: what is the likelihood ?
This is where most approaches go wrong. Platt scaling, for example, fits by minimizing cross-entropy on labeled data — but it treats and as free parameters with no structural meaning. On our benchmarks, Platt scaling achieves nDCG@10 of 0.0229 on NFCorpus and 0.0000 on SciFact — a catastrophic collapse. The model has two parameters and no prior knowledge, so when the score distribution is heavily skewed (as it always is in IR), the decision boundary lands in the wrong place.
We need a principled answer. And that answer comes from asking what kind of random variable relevance is.
4. The Inevitability Argument
Relevance is binary. A document is relevant () or not (). This is a Bernoulli random variable. And the Bernoulli distribution belongs to the exponential family:
where is the natural (canonical) parameter — the log-odds.
Here is the key fact from statistical theory: every exponential family distribution has a canonical link function that maps the natural parameter to the mean. For the Bernoulli distribution, the mean is , and the canonical link is the logit function . Its inverse — the function that maps the natural parameter back to the probability — is:
The sigmoid. Not because we chose it, but because the Bernoulli distribution requires it.
More precisely, the sigmoid is the unique function satisfying all five of these conditions simultaneously:
- Range: Maps — any real-valued score becomes a valid probability
- Canonical form: Inverse of the Bernoulli canonical link — consistent with exponential family theory
- Self-referential derivative: — the gradient is determined by the output itself
- Evidence symmetry: — evidence for relevance and evidence against relevance are treated symmetrically
- Maximum entropy: Among all distributions satisfying the mean constraint, the Bernoulli distribution (and hence the sigmoid link) has maximum entropy — it assumes nothing beyond what the evidence tells us
Remove any one of these conditions, and other functions become possible. Keep all five, and sigmoid is the only solution.
This is why Bayesian BM25 uses . The parameter controls the steepness (how sensitive the probability is to score changes), and controls the midpoint (what score corresponds to 50% relevance). But the functional form — the sigmoid itself — is not a parameter. It is a theorem.
5. What the Inevitability Explains
This is not merely a theoretical nicety. It explains an empirical fact that would otherwise be mysterious.
Bayesian BM25 was designed for BM25 scores. But when integrated into txtai, we tested it on three completely different scoring mechanisms — without changing a single line of code:
| Scoring Method | Origin | Score Characteristics |
|---|---|---|
| BM25 | Term frequency with saturation and length normalization (1994) | Bounded above by IDF, moderate tails |
| TF-IDF | Raw term frequency × inverse document frequency (1972) | Unbounded, no saturation, long tails |
| SPLADE | BERT-learned sparse term importance weights (2021) | Learned distribution, model-dependent |
These three methods produce scores with different scales, different distributions, and different tail behaviors. Their scores are generated by fundamentally different mechanisms — bag-of-words multiplication, saturated frequency ratios, and neural network forward passes.
Yet the same sigmoid calibration works on all three. No parameter adjustment. No retraining. The same with default parameters.
Why? Because the sigmoid does not care how the score was generated. It only requires that the score has a monotonic relationship with relevance — higher score means more likely relevant. The inevitability argument tells us this must work: as long as relevance is binary (relevant or not), the correct posterior mapping is sigmoid, regardless of the scoring mechanism.
This is not a coincidence. It is a consequence of the mathematics.
6. Completing the Circle
With the sigmoid transform, the Probability Ranking Principle is finally fulfilled as Robertson originally intended:
This arrow has three critical properties:
- Monotonicity preservation: If , then . The ranking is unchanged — Bayesian BM25 never disagrees with BM25 about document ordering.
- Calibration: The outputs approximate true relevance probabilities. On our benchmarks, the base rate prior alone (unsupervised, no labels) reduces Expected Calibration Error by 68–77%.
- WAND/BMW compatibility: Since the sigmoid is monotonic and bounded, existing top-k pruning algorithms (WAND, Block-Max WAND) remain valid with Bayesian probability upper bounds.
Property 1 means there is no risk — you lose nothing by switching from raw BM25 to Bayesian BM25. Property 2 means the probabilities are meaningful, not just bounded scores. Property 3 means production search systems can adopt this without sacrificing query performance.
7. Why Probabilities Change Everything
Closing the PRP gap is historically significant, but the deeper contribution is what it opens. When scores become probabilities, the entire apparatus of probability theory becomes available:
Principled fusion. In score space, combining BM25 with vector similarity requires arbitrary choices — weighted sums, Reciprocal Rank Fusion (RRF), learned combination weights. In probability space, combination is defined by probability theory:
No arbitrary weights. No rank-based heuristics. The mathematics tells you how to combine.
Natural decision boundaries. In score space, "what BM25 threshold means relevant?" has no principled answer. In probability space, is the natural decision boundary — the point where relevance is more likely than not. Our benchmarks show that threshold transfer (training on one query set, applying to another) works reliably with Bayesian probabilities, whereas Platt scaling collapses (F1 near zero on SciFact).
Information-theoretic analysis. With calibrated probabilities, cross-entropy, KL divergence, and mutual information are all defined. Search quality can be analyzed with the full toolkit of information theory.
Bayesian updating. Probabilities compose through Bayes' theorem. Prior knowledge (corpus statistics, user history, document metadata) can be incorporated through the prior, and the posterior updates as new evidence arrives.
This is why we describe the contribution as opening a coordinate change — from score space to probability space. The underlying reality (document relevance) does not change, but the mathematical tools available to work with it are transformed completely.
8. Base Rate Prior: 68–77% Calibration Improvement for Free
One practical consequence deserves special attention. Our base rate prior decomposes the Bayesian posterior into three additive terms in log-odds space:
The base rate — the proportion of relevant documents in the corpus — is estimated automatically from the score distribution. No labels required.
| Method | NFCorpus ECE | SciFact ECE |
|---|---|---|
| Bayesian (no base rate) | 0.6519 | 0.7989 |
| Bayesian (base_rate=auto) | 0.1461 (−77.6%) | 0.2577 (−67.7%) |
| Bayesian (base_rate=0.001) | 0.0081 (−98.8%) | 0.0354 (−95.6%) |
| Platt scaling (supervised) | 0.0186 | 0.0188 |
The unsupervised base rate prior (auto) achieves competitive calibration with supervised Platt scaling — and with a known base rate (0.001), it exceeds Platt scaling in calibration quality. This is possible because the base rate captures a structural fact about IR: the vast majority of documents are not relevant to any given query. Encoding this fact into the prior corrects the systematic overestimation that occurs without it.
This is a monotonic transform — it does not change document ranking. The same documents in the same order, but with dramatically better-calibrated probabilities.
9. Try It
The reference implementation is available as a Python package:
pip install bayesian-bm25
Three lines to convert BM25 scores to probabilities:
from bayesian_bm25 import BayesianProbabilityTransform transform = BayesianProbabilityTransform(alpha=1.5, beta=1.0, base_rate=0.01) probabilities = transform.score_to_probability(scores, tfs, doc_len_ratios)
For hybrid search integration, see the companion blog post on building a probabilistic search engine.
10. Conclusion
The sigmoid function in Bayesian BM25 is not an engineering decision. It is the unique mathematical consequence of three facts:
- Relevance is binary (Bernoulli)
- The Bernoulli distribution belongs to the exponential family
- The exponential family's canonical link inverse is the sigmoid
This is why the same transform works on BM25, TF-IDF, and SPLADE without modification. This is why it completes Robertson's Probability Ranking Principle after 50 years. And this is why it opens the door to principled multi-signal fusion, Bayesian updating, and information-theoretic analysis of search quality.
The question was never which function to use. The mathematics had already decided.
References
- Robertson, S. E. (1977). The Probability Ranking Principle in IR. Journal of Documentation, 33(4), 294–304.
- Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.
- Jeong, J. (2026). Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search. Zenodo. doi:10.5281/zenodo.18414940
- Jeong, J. (2026). From Bayesian Inference to Neural Computation. Zenodo. doi:10.5281/zenodo.18512411
- Cox, R. T. (1946). Probability, Frequency and Reasonable Expectation. American Journal of Physics, 14(1), 1–13.
- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.