Vector Scores Are Not Probabilities: Likelihood Ratio Calibration for Hybrid Search | Blog

The Score That Means Nothing

A user searches for "transformer architecture" and your system returns two signals: a BM25 score of 14.7 and a cosine similarity of 0.83. Which document is more relevant?

The question is unanswerable. BM25 14.7 is an unbounded positive real derived from term frequency statistics. Cosine similarity 0.83 is a geometric angle between embedding vectors. They measure fundamentally different quantities on fundamentally different scales. Combining them requires knowing something neither score provides: the probability that the document is actually relevant.

For BM25, we solved this problem with Bayesian BM25. The sigmoid likelihood transforms BM25 scores into calibrated probabilities $P(R=1 \mid s)$ using Bayesian inference, completing Robertson's Probability Ranking Principle after 50 years.

But vector scores remained uncalibrated. Practitioners resort to Reciprocal Rank Fusion (RRF), which discards score magnitudes entirely, or ad-hoc normalization like min-max scaling, which creates the illusion of probabilities without their substance.

We present a principled solution: likelihood ratio calibration that derives relevance probabilities from distributional statistics that ANN indexes already compute at zero additional cost.

Why Normalization Fails

The instinct is to map vector scores to $[0, 1]$ through some monotonic function. Min-max scaling, arctangent, or linear rescaling all produce numbers that look like probabilities. They are not.

The fundamental problem is local density blindness. Consider a document at cosine distance 0.3 from the query. In a dense region of embedding space — where many documents cluster nearby — this distance is unremarkable. The same distance in a sparse region signals strong relevance. Any fixed transformation that maps 0.3 to the same "probability" regardless of context is wrong.

This is not a minor imprecision. It is a structural error. The relevance probability of a distance depends on how that distance compares to the local distribution of distances, which varies across queries, across corpus regions, and across embedding models. No global mapping function can capture this.

The Likelihood Ratio Framework

The correct formulation treats vector calibration as Bayesian inference over observed distances. Given a distance $d$ between query and document, Bayes' theorem yields:

P(R=1 \mid d) = \frac{f_R(d) \cdot P_0}{f_R(d) \cdot P_0 + f_G(d) \cdot (1 - P_0)}

where $f_R(d)$ is the density of distance among relevant documents, $f_G(d)$ is the density over the full corpus, and $P_0$ is the base rate prior.

Converting to log-odds:

\text{logit}\,P(R=1 \mid d) = \log \frac{f_R(d)}{f_G(d)} + \text{logit}\,P_0

The vector evidence is a log density ratio — the logarithm of how much more likely this distance is among relevant documents compared to the general corpus. This is exactly the same mathematical pattern as IDF in text retrieval:

\text{IDF}(t) = \log \frac{N - \text{df}(t) + 0.5}{\text{df}(t) + 0.5}

IDF is a log likelihood ratio over term occurrence distributions derived from the inverted index. Vector calibration is a log likelihood ratio over distance distributions derived from the ANN index. They are two instances of the same abstract pattern.

The Index Already Knows

The elegant resolution to computational cost: ANN indexes already compute the statistics needed for calibration. They have to — these statistics are byproducts of operations that must be performed anyway.

During HNSW index construction and search traversal, the following are naturally available:

Global distance distribution: Mean and variance of edge distances across layers characterize the corpus-wide background distribution $f_G(d)$ .
Search trajectory distances: The sequence of distances to nodes visited during greedy search provides a local density signal.
Neighborhood statistics: Distance distributions at the result layer characterize the local region around the query.
Hop count as density proxy: The number of distance computations during search correlates with local density — more hops indicate sparser regions.

The HNSW index is not merely an algorithmic structure for approximate nearest neighbor search. It is an implicit statistical model that encodes the corpus's distance distribution. Calibration extracts this implicit knowledge and converts it into explicit relevance probabilities.

Breaking the Circularity

A subtle problem remains: estimating $f_R(d)$ requires knowing which documents are relevant — the very problem retrieval is trying to solve. This is circular.

The resolution leverages cross-modal conditional independence. If we have an external relevance signal — such as a BM25 score — that is conditionally independent of vector distance given true relevance, we can use it to break the circularity:

P(D, W \mid R) = P(D \mid R) \cdot P(W \mid R)

BM25-derived relevance probabilities become importance weights in kernel density estimation for the local distribution:

\hat{f}_R(d) = \frac{1}{\sum_i w_i} \sum_{i=1}^{K} w_i \cdot \mathcal{K}_h(d - d_i)

Documents with higher BM25 relevance probability contribute more to the local density estimate. The assumption is reasonable: BM25 captures lexical match, vector distance captures semantic similarity — different aspects of relevance that are conditionally independent given a document's true relevance status.

When no external signal exists, fallback strategies apply: distance gap detection through kernel density estimation, index-derived density priors from cell populations, or cross-calibration between independent embedding models.

Additive Fusion in Log-Odds Space

With both BM25 and vector scores calibrated as log likelihood ratios, fusion becomes addition:

\text{logit}\,P(R=1 \mid s, d) = \underbrace{\alpha(s - \beta)}_{\text{BM25 evidence}} + \underbrace{\log \frac{\hat{f}_R(d)}{f_G(d)}}_{\text{vector evidence}} + \underbrace{\text{logit}\,P_0}_{\text{prior}}

Each signal contributes its own evidence term, independently calibrated through its native index. No retuning of weights is needed when adding new signals. No ad-hoc normalization. No RRF. Just Bayesian evidence accumulation.

This extends beyond two signals. Any conditionally independent relevance signal — click-through data, knowledge graph proximity, graph centrality, temporal recency — becomes a first-class evidence term that simply adds to the log-odds sum.

The Structural Duality

The deepest insight is not a technique but a unification. IDF in the inverted index and density ratios in the ANN index are not two different ideas. They are two instances of the same mathematical pattern: log likelihood ratios over native index statistics.

Paradigm	Index	Statistic	Evidence
Text (BM25)	Inverted index	Document frequency	$\log \frac{N - \text{df} + 0.5}{\text{df} + 0.5}$
Vector	ANN index (HNSW)	Distance density	$\log \frac{f_R(d)}{f_G(d)}$

Both indexes exist for algorithmic reasons — fast retrieval. Both implicitly encode statistical models of their respective signal spaces. Calibration is the act of making this implicit model explicit.

This duality has a further consequence: the next step in the theoretical progression shows that combining multiple calibrated probability signals through log-odds conjunction produces the exact computation graph of a feedforward neural network — derived entirely from first principles rather than designed.

What This Means in Practice

For systems that need hybrid search — and increasingly, most search systems do — vector score calibration provides:

Principled fusion that replaces RRF and ad-hoc normalization with Bayesian evidence
Zero additional cost because calibration statistics come from the index itself
No hyperparameters beyond bandwidth selection, which follows established rules (Silverman's rule with dimension-aware scaling)
Compositionality where any number of conditionally independent signals combine additively
Theoretical completeness that closes the gap between text and vector retrieval, unifying both under the same probabilistic framework

The probability that a document is relevant given its vector similarity is not a number you choose. It is a quantity you derive — from the same index you already built to find it.

Tags:

#Vector Search #Bayesian #Hybrid Search #Cognica #Information Retrieval #Probabilistic Models #HNSW