Chapter 22: Vector Score Calibration

22.1 Introduction

Chapter 20 established Bayesian BM25 as a framework for transforming lexical scores into calibrated probabilities. Chapter 21 covered HNSW-based vector search, which produces similarity scores — cosine similarity, inner product, or Euclidean distance — that rank documents by semantic proximity.

This chapter addresses a fundamental question: how do we transform vector similarity scores into calibrated relevance probabilities? A cosine similarity of 0.85 does not mean an 85% chance of relevance. The answer requires a likelihood ratio framework that exploits the distributional statistics already computed during ANN index construction and search.

22.1.1 The Vector Score Interpretation Problem

Vector similarity scores suffer from four interpretability limitations:

Not Probabilities: A cosine similarity $s \in [-1, 1]$ is a geometric quantity — the cosine of the angle between two vectors — not a probability of relevance.
Distribution Dependence: Score distributions vary with the embedding model, corpus, and query distribution. A score of 0.7 may be highly discriminative in one corpus and uninformative in another.
Local Density Variation: The same similarity score carries different information in dense versus sparse regions of the embedding space. In a dense cluster, a nearby document may be unremarkable; in a sparse region, the same distance implies strong relevance.
Scale Incompatibility: Direct combination with calibrated lexical scores (e.g., Bayesian BM25 probabilities from Chapter 20) is unprincipled without shared probabilistic semantics.

22.1.2 The Normalization Illusion

Rescaling vector scores to $[0, 1]$ — for instance via $(1 + \cos\theta) / 2$ — creates the appearance of probabilities without their substance. Common normalization methods include:

Min-max normalization: $p = \frac{s - s_{\min}}{s_{\max} - s_{\min}}$
Arctangent normalization: $p = \frac{2}{\pi} \arctan(\alpha \cdot s)$
Linear rescaling: $p = \frac{1 + s}{2}$ for $s \in [-1, 1]$

Theorem (Normalization Inadequacy). All query-independent normalization functions fail to account for the local density structure of the embedding space. For any fixed monotonic transformation $g: \mathbb{R} \to [0, 1]$ :

g(s_1) = g(s_2) \implies s_1 = s_2

but the true relevance probabilities may differ:

P(R = 1 \mid s, \text{dense region}) \neq P(R = 1 \mid s, \text{sparse region})

even when both documents have the same similarity score.

Proof. A fixed transformation depends only on the score $s$ and is blind to the local density. Two documents equidistant from a query carry different relevance information depending on whether they are in a region with many nearby documents (dense) or few (sparse). No query-independent function can capture this distinction. $\square$

22.1.3 Structural Parallel with Lexical Retrieval

Bayesian BM25 (Chapter 20) calibrates lexical scores using statistics from the inverted index: document frequency, term frequency, and average document length — all computed at index time. The IDF component is itself a log likelihood ratio:

\text{IDF}(t) = \log \frac{N - \text{df}(t) + 0.5}{\text{df}(t) + 0.5}

This is the log ratio of the probability of not containing term $t$ to the probability of containing it — a density ratio over the term occurrence distribution.

ANN indexes (IVF, HNSW) similarly compute and store distributional statistics during construction and search. If lexical calibration exploits inverted index statistics, vector calibration should exploit ANN index statistics. The mathematical structure — a likelihood ratio over corpus distributions — is identical in both cases.

22.2 Likelihood Ratio Calibration

22.2.1 Distance Orientation Convention

Throughout this chapter, $d$ denotes a distance-like quantity where smaller values indicate greater similarity:

For cosine similarity $s \in [-1, 1]$ : $d = 1 - s \in [0, 2]$
For Euclidean distance: $d$ is used directly
For inner product on normalized vectors: $d = 1 - \langle q, x \rangle$

22.2.2 The Posterior in Log-Odds Form

Given observed distance $d$ between query and document vectors, the posterior probability of relevance follows from Bayes' theorem:

P(R = 1 \mid d) = \frac{f_R(d) \cdot P(R = 1)}{f_R(d) \cdot P(R = 1) + f_G(d) \cdot P(R = 0)}

where:

$f_R(d)$ is the probability density of distance $d$ among relevant documents (local distribution)
$f_G(d)$ is the probability density of distance $d$ in the full corpus (global/background distribution)

Converting to log-odds:

\text{logit}\,P(R = 1 \mid d) = \underbrace{\log \frac{f_R(d)}{f_G(d)}}_{\text{vector evidence}} + \underbrace{\text{logit}\,P_0}_{\text{prior}}

The vector evidence is the log density ratio — how much more likely the observed distance is under the relevant distribution versus the background distribution.

22.2.3 Vector Evidence

We define the vector evidence function:

\text{ev}_{\text{vec}}(d) = \log \frac{f_R(d)}{f_G(d)}

This has a natural interpretation:

$\text{ev}_{\text{vec}}(d) > 0$ : distance $d$ is more likely for relevant documents — evidence for relevance
$\text{ev}_{\text{vec}}(d) = 0$ : distance $d$ is equally likely for relevant and non-relevant documents — no evidence
$\text{ev}_{\text{vec}}(d) < 0$ : distance $d$ is more likely for non-relevant documents — evidence against relevance

The vector evidence is structurally identical to the IDF-based evidence in BM25: both are log likelihood ratios over their respective index distributions. This is not a coincidence — it reflects the Neyman-Pearson lemma, which states that the likelihood ratio is a sufficient statistic for binary classification.

22.2.4 Distribution-Free Formulation

The framework makes no assumptions about the parametric form of $f_R$ or $f_G$ . The densities can be Gaussian, uniform, or any other distribution — only their ratio matters. This generality is important because distance distributions vary across embedding models, distance metrics, and corpus characteristics.

22.2.5 Concentration of Measure

In high-dimensional spaces, the background distribution of distances concentrates around a characteristic value. For random unit vectors in $\mathbb{R}^d$ , the cosine similarity distribution converges to:

f_G(s) \to \mathcal{N}\!\left(0, \frac{1}{d}\right) \quad \text{as } d \to \infty

This concentration means $f_G$ becomes increasingly well-characterized with dimensionality — a favorable property for calibration, since the background distribution stabilizes and can be estimated with high confidence from relatively few samples.

22.3.1 The Circularity Problem

Estimating the vector evidence requires knowing $f_R(d)$ — the distance distribution among relevant documents. But identifying which documents are relevant is precisely the retrieval problem we are trying to solve. This creates a circularity:

To calibrate vector scores, we need $f_R(d)$
To estimate $f_R(d)$ , we need to know which documents are relevant
To know which documents are relevant, we need calibrated scores

Naive approaches — using the top- $k$ results as "relevant" — introduce confirmation bias: the calibration would merely reinforce whatever the uncalibrated scores already believe.

22.3.2 Conditional Independence Assumption

The key insight: if we have an external relevance signal that is conditionally independent of vector distance given true relevance, we can use it to break the circularity.

Assumption (Cross-Modal Conditional Independence). For vector distance $D$ and external signal $W$ (e.g., BM25 score):

P(D, W \mid R) = P(D \mid R) \cdot P(W \mid R)

This assumes that given a document's true relevance status, knowing its BM25 score provides no additional information about its vector distance (and vice versa). The assumption is reasonable when the signals capture different aspects of relevance — lexical match versus semantic similarity.

Under this assumption, the BM25-derived relevance probability $P(R \mid W)$ can serve as importance weights for estimating $f_R$ .

22.3.3 Importance-Weighted Kernel Density Estimation

Given $K$ nearest neighbors with distances $d_1, \ldots, d_K$ and external relevance weights $w_1, \ldots, w_K$ (e.g., Bayesian BM25 probabilities from Chapter 20), the local distribution is estimated by:

\hat{f}_R(d) = \frac{1}{\sum_i w_i} \sum_{i=1}^{K} w_i \cdot \mathcal{K}_h(d - d_i)

where $\mathcal{K}_h$ is a kernel function with bandwidth $h$ . Documents with higher BM25 relevance probability contribute more to the local density estimate — exactly the weighting needed to approximate $f_R$ without knowing the true relevance labels.

22.3.4 Bandwidth Selection

The bandwidth $h$ controls the smoothness of the density estimate. Following weighted Silverman's rule:

h^* = \left(\frac{4\hat{\sigma}^5}{3K_{\text{eff}}}\right)^{1/5}

where $\hat{\sigma}$ is the weighted standard deviation of the distances and $K_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2$ is the effective sample size.

For high-dimensional embedding spaces ( $d > 100$ ), the bandwidth should be scaled by a factor of $d^{-1/(d+4)}$ to account for the curse of dimensionality.

22.3.5 Estimation Without External Signals

When no external signal is available (pure vector search without a lexical index), three fallback strategies exist:

Distance Gap Detection: Relevant documents tend to cluster at smaller distances, creating a gap between the relevant and non-relevant distance distributions. Finding this gap — via kernel density estimation on the distance distribution and identifying the first local minimum — provides a natural threshold for separating the two populations.

Index-Derived Density Priors: IVF cell populations serve as a proxy for local density. A query falling in a sparsely populated cell has fewer candidate relevant documents, while dense cells may contain many. The cell-aware base rate adjusts the prior:

P_{\text{base}}^{(j)} = P_{\text{base}} \cdot \frac{N/C}{n_j}

where $n_j$ is the population of cell $j$ and $C$ is the total number of cells.

Multi-Model Cross-Calibration: If two independent embedding models are available, each can serve as the external signal for calibrating the other, breaking the circularity without any lexical signal.

22.4 Parametric Estimation via Gaussian Mixture Models

22.4.1 Two-Component Distance Mixture

As an alternative to kernel density estimation, the distance distribution can be modeled as a two-component Gaussian mixture:

f(d) = w_R \cdot \mathcal{N}(d \mid \mu_R, \sigma_R^2) + (1 - w_R) \cdot \mathcal{N}(d \mid \mu_G, \sigma_G^2)

where:

$(\mu_R, \sigma_R^2)$ are the mean and variance of the relevant document distances (small $\mu_R$ )
$(\mu_G, \sigma_G^2)$ are the mean and variance of the background distances
$w_R$ is the mixing weight (proportion of relevant documents)

22.4.2 EM Algorithm with Informed Initialization

Standard EM for Gaussian mixtures is sensitive to initialization. Using external relevance weights from BM25 provides informed initialization:

Input: Distances d_1, ..., d_K, weights w_1, ..., w_K
Output: Parameters (mu_R, sigma_R, mu_G, sigma_G, w_R)

// Informed initialization
mu_R = weighted_mean(d, w)
sigma_R = weighted_std(d, w)
mu_G = mean(d)
sigma_G = std(d)
w_R = mean(w)

// EM iterations
for iter = 1 to max_iterations:
    // E-step: posterior responsibilities
    for i = 1 to K:
        gamma_i = w_R * N(d_i | mu_R, sigma_R) /
                  (w_R * N(d_i | mu_R, sigma_R) +
                   (1 - w_R) * N(d_i | mu_G, sigma_G))

    // M-step: update parameters (fix background component)
    w_R = mean(gamma)
    mu_R = sum(gamma * d) / sum(gamma)
    sigma_R = sqrt(sum(gamma * (d - mu_R)^2) / sum(gamma))

    if converged: break

The background component $(\mu_G, \sigma_G)$ is held fixed because it represents the corpus-level distance distribution, which is stable and well-characterized by the ANN index statistics.

22.4.3 Nonparametric vs. Parametric Comparison

Aspect	KDE (Section 22.3)	GMM (Section 22.4)
Assumptions	Distribution-free	Gaussian components
Sample efficiency	Needs more neighbors	Fewer neighbors suffice
Boundary handling	Natural	Gaussian tails may leak
Computation	$O(K)$ per query point	EM iterations needed
Adaptability	Local density naturally	Fixed parametric form
Recommended for	Large $K$ , complex distributions	Small $K$ , well-separated clusters

22.5 Index-Aware Statistics Extraction

22.5.1 The Zero Additional Cost Principle

ANN indexes already compute and store distributional statistics during construction and search. Extracting these statistics for calibration requires no additional computation — they are byproducts of operations that must be performed anyway.

22.5.2 IVF Index Statistics

The Inverted File (IVF) index partitions vectors into $C$ cells (Voronoi regions) around centroids. During construction and search, the following statistics are available:

Global statistics (computed at index construction time):

Total vector count $N$
Number of cells $C$
Global distance distribution moments ( $\mu_G$ , $\sigma_G$ )

Local statistics (available during search):

Cell population $n_j$ for the query's assigned cell
Within-cell distance distribution (distances from cell centroid to members)
Query-to-centroid distance

Density proxy:

Cell population as relevance density signal: sparse cells (low $n_j$ ) suggest the query is in an unusual region; dense cells suggest many potentially relevant documents

22.5.3 HNSW Index Statistics

The Hierarchical Navigable Small World graph maintains navigable layers with edges representing neighborhood relationships. During search traversal, the following statistics are available:

Global statistics (computed at construction time):

Mean edge distance per layer
Edge distance distribution parameters

Local statistics (available during search):

Search trajectory distances (distances to nodes visited during greedy search)
Neighborhood distances at the result layer
Number of distance computations per search

Density proxy:

Search trajectory length (number of hops to reach a result): longer trajectories suggest the query is far from the graph's dense regions

22.5.4 Unified Statistics Mapping

Both IVF and HNSW provide the same abstract statistics needed for calibration:

Calibration Need	IVF Source	HNSW Source
$f_G$ (global density)	Cross-cell distance distribution	Edge distance distribution
Local density estimate	Cell population $n_j$	Search trajectory length
Background moments	Cell centroid distances	Layer-0 edge statistics
Anomaly detection	Query-to-centroid distance	Hop count to nearest

The mathematical framework is index-agnostic — only the source of the statistics changes.

22.6 Unified Hybrid Search Fusion

22.6.1 The Complete Log-Odds Decomposition

Combining the vector calibration (this chapter) with the Bayesian BM25 calibration (Chapter 20), the full posterior for a document given both lexical and semantic evidence is:

\text{logit}\,P(R = 1 \mid s_{\text{bm25}}, d_{\text{vec}}) = \underbrace{\log \frac{\hat{f}_R(d_{\text{vec}})}{f_G(d_{\text{vec}})}}_{\text{calibrated vector evidence}} + \underbrace{\alpha(s_{\text{bm25}} - \beta)}_{\text{calibrated lexical evidence}} + \underbrace{\text{logit}\,P_{\text{base}}}_{\text{corpus prior}}

Each term contributes independent Bayesian evidence in log-odds space, where updates are naturally additive. This is the principled alternative to RRF and weighted combination — every signal is calibrated through the same likelihood ratio structure, each drawing on the statistics of its native index.

22.6.2 Structural Unification

The key insight: both sparse (BM25) and dense (vector) retrieval implement the same abstract pattern:

\text{evidence} = \log \frac{P(\text{signal} \mid \text{relevant})}{P(\text{signal} \mid \text{non-relevant})}

For BM25, the signal is the term occurrence and the likelihood ratio derives from document frequency statistics in the inverted index. For vector search, the signal is the embedding distance and the likelihood ratio derives from distance distribution statistics in the ANN index.

The unification is not merely conceptual — it has a concrete computational consequence. The log-odds decomposition means that adding a new signal type (e.g., a second embedding model, a click-through signal, or a knowledge graph proximity score) requires only:

Calibrating the new signal through its own likelihood ratio
Adding the resulting evidence term to the log-odds sum

No retuning of existing weights is needed, because each evidence term is independently calibrated.

22.6.3 Extension to Multiple Signals

For $n$ conditionally independent signals:

\text{logit}\,P(R \mid s_1, \ldots, s_n) = \sum_{i=1}^{n} \log \frac{f_{R,i}(s_i)}{f_{G,i}(s_i)} + \text{logit}\,P_{\text{base}}

Each signal contributes its own evidence term, computed from its own index statistics. The connection to neural network structure — the fact that this computation has the form of a feedforward network — is developed in Chapter 23.

22.7 Summary

Vector score calibration completes the probabilistic bridge from raw similarity scores to calibrated relevance probabilities. The key concepts covered in this chapter are:

Likelihood Ratio Foundation: Vector calibration is formulated as the ratio of local (relevant) to global (background) distance densities, grounded in Bayes' theorem and the Neyman-Pearson lemma.

Normalization Inadequacy: Fixed monotonic transformations (min-max, arctangent, linear rescaling) cannot account for local density variation in the embedding space.

Circularity Resolution: The self-referential problem of estimating the local distribution without knowing relevance labels is broken through cross-modal conditional independence — using BM25 relevance probabilities as importance weights for kernel density estimation.

Index-Aware Statistics: IVF and HNSW indexes already compute the distributional statistics needed for calibration at negligible additional cost — the index is not merely an algorithmic structure but an implicit statistical model.

Unified Log-Odds Fusion: Both sparse (BM25) and dense (vector) signals calibrate through the same likelihood ratio structure, combining additively in log-odds space. This is the principled replacement for ad-hoc fusion methods like RRF and weighted combination.

Structural Duality: IDF in the inverted index and density ratios in the ANN index are instances of the same mathematical pattern — log likelihood ratios over native index statistics.

The next chapter reveals a surprising consequence of this unified framework: when multiple calibrated probability signals are combined through the log-odds conjunction, the resulting computation has the exact structure of a feedforward neural network — derived from first principles rather than designed.

References

Jeong, J. (2026). Vector Scores as Likelihood Ratios: Index-Derived Bayesian Calibration for Hybrid Search. Zenodo preprint.
Jeong, J. (2026). Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search. Zenodo preprint.
Jeong, J. (2026). From Bayesian Inference to Neural Computation. Zenodo preprint.
Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.
Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society.