Chapter 22: Vector Score Calibration

22.1 Introduction

Chapter 20 established Bayesian BM25 as a framework for transforming lexical scores into calibrated probabilities. Chapter 21 covered HNSW-based vector search, which produces similarity scores — cosine similarity, inner product, or Euclidean distance — that rank documents by semantic proximity.

This chapter addresses a fundamental question: how do we transform vector similarity scores into calibrated relevance probabilities? A cosine similarity of 0.85 does not mean an 85% chance of relevance. The answer requires a likelihood ratio framework that exploits the distributional statistics already computed during ANN index construction and search.

22.1.1 The Vector Score Interpretation Problem

Vector similarity scores suffer from four interpretability limitations:

  1. Not Probabilities: A cosine similarity s[1,1]s \in [-1, 1] is a geometric quantity — the cosine of the angle between two vectors — not a probability of relevance.
  2. Distribution Dependence: Score distributions vary with the embedding model, corpus, and query distribution. A score of 0.7 may be highly discriminative in one corpus and uninformative in another.
  3. Local Density Variation: The same similarity score carries different information in dense versus sparse regions of the embedding space. In a dense cluster, a nearby document may be unremarkable; in a sparse region, the same distance implies strong relevance.
  4. Scale Incompatibility: Direct combination with calibrated lexical scores (e.g., Bayesian BM25 probabilities from Chapter 20) is unprincipled without shared probabilistic semantics.

22.1.2 The Normalization Illusion

Rescaling vector scores to [0,1][0, 1] — for instance via (1+cosθ)/2(1 + \cos\theta) / 2 — creates the appearance of probabilities without their substance. Common normalization methods include:

  • Min-max normalization: p=ssminsmaxsminp = \frac{s - s_{\min}}{s_{\max} - s_{\min}}
  • Arctangent normalization: p=2πarctan(αs)p = \frac{2}{\pi} \arctan(\alpha \cdot s)
  • Linear rescaling: p=1+s2p = \frac{1 + s}{2} for s[1,1]s \in [-1, 1]

Theorem (Normalization Inadequacy). All query-independent normalization functions fail to account for the local density structure of the embedding space. For any fixed monotonic transformation g:R[0,1]g: \mathbb{R} \to [0, 1]:

g(s1)=g(s2)    s1=s2g(s_1) = g(s_2) \implies s_1 = s_2

but the true relevance probabilities may differ:

P(R=1s,dense region)P(R=1s,sparse region)P(R = 1 \mid s, \text{dense region}) \neq P(R = 1 \mid s, \text{sparse region})

even when both documents have the same similarity score.

Proof. A fixed transformation depends only on the score ss and is blind to the local density. Two documents equidistant from a query carry different relevance information depending on whether they are in a region with many nearby documents (dense) or few (sparse). No query-independent function can capture this distinction. \square

22.1.3 Structural Parallel with Lexical Retrieval

Bayesian BM25 (Chapter 20) calibrates lexical scores using statistics from the inverted index: document frequency, term frequency, and average document length — all computed at index time. The IDF component is itself a log likelihood ratio:

IDF(t)=logNdf(t)+0.5df(t)+0.5\text{IDF}(t) = \log \frac{N - \text{df}(t) + 0.5}{\text{df}(t) + 0.5}

This is the log ratio of the probability of not containing term tt to the probability of containing it — a density ratio over the term occurrence distribution.

ANN indexes (IVF, HNSW) similarly compute and store distributional statistics during construction and search. If lexical calibration exploits inverted index statistics, vector calibration should exploit ANN index statistics. The mathematical structure — a likelihood ratio over corpus distributions — is identical in both cases.

22.2 Likelihood Ratio Calibration

22.2.1 Distance Orientation Convention

Throughout this chapter, dd denotes a distance-like quantity where smaller values indicate greater similarity:

  • For cosine similarity s[1,1]s \in [-1, 1]: d=1s[0,2]d = 1 - s \in [0, 2]
  • For Euclidean distance: dd is used directly
  • For inner product on normalized vectors: d=1q,xd = 1 - \langle q, x \rangle

22.2.2 The Posterior in Log-Odds Form

Given observed distance dd between query and document vectors, the posterior probability of relevance follows from Bayes' theorem:

P(R=1d)=fR(d)P(R=1)fR(d)P(R=1)+fG(d)P(R=0)P(R = 1 \mid d) = \frac{f_R(d) \cdot P(R = 1)}{f_R(d) \cdot P(R = 1) + f_G(d) \cdot P(R = 0)}

where:

  • fR(d)f_R(d) is the probability density of distance dd among relevant documents (local distribution)
  • fG(d)f_G(d) is the probability density of distance dd in the full corpus (global/background distribution)

Converting to log-odds:

logitP(R=1d)=logfR(d)fG(d)vector evidence+logitP0prior\text{logit}\,P(R = 1 \mid d) = \underbrace{\log \frac{f_R(d)}{f_G(d)}}_{\text{vector evidence}} + \underbrace{\text{logit}\,P_0}_{\text{prior}}

The vector evidence is the log density ratio — how much more likely the observed distance is under the relevant distribution versus the background distribution.

22.2.3 Vector Evidence

We define the vector evidence function:

evvec(d)=logfR(d)fG(d)\text{ev}_{\text{vec}}(d) = \log \frac{f_R(d)}{f_G(d)}

This has a natural interpretation:

  • evvec(d)>0\text{ev}_{\text{vec}}(d) > 0: distance dd is more likely for relevant documents — evidence for relevance
  • evvec(d)=0\text{ev}_{\text{vec}}(d) = 0: distance dd is equally likely for relevant and non-relevant documents — no evidence
  • evvec(d)<0\text{ev}_{\text{vec}}(d) < 0: distance dd is more likely for non-relevant documents — evidence against relevance

The vector evidence is structurally identical to the IDF-based evidence in BM25: both are log likelihood ratios over their respective index distributions. This is not a coincidence — it reflects the Neyman-Pearson lemma, which states that the likelihood ratio is a sufficient statistic for binary classification.

22.2.4 Distribution-Free Formulation

The framework makes no assumptions about the parametric form of fRf_R or fGf_G. The densities can be Gaussian, uniform, or any other distribution — only their ratio matters. This generality is important because distance distributions vary across embedding models, distance metrics, and corpus characteristics.

22.2.5 Concentration of Measure

In high-dimensional spaces, the background distribution of distances concentrates around a characteristic value. For random unit vectors in Rd\mathbb{R}^d, the cosine similarity distribution converges to:

fG(s)N ⁣(0,1d)as df_G(s) \to \mathcal{N}\!\left(0, \frac{1}{d}\right) \quad \text{as } d \to \infty

This concentration means fGf_G becomes increasingly well-characterized with dimensionality — a favorable property for calibration, since the background distribution stabilizes and can be estimated with high confidence from relatively few samples.

22.3 Breaking Circularity via Cross-Modal Estimation

22.3.1 The Circularity Problem

Estimating the vector evidence requires knowing fR(d)f_R(d) — the distance distribution among relevant documents. But identifying which documents are relevant is precisely the retrieval problem we are trying to solve. This creates a circularity:

  1. To calibrate vector scores, we need fR(d)f_R(d)
  2. To estimate fR(d)f_R(d), we need to know which documents are relevant
  3. To know which documents are relevant, we need calibrated scores

Naive approaches — using the top-kk results as "relevant" — introduce confirmation bias: the calibration would merely reinforce whatever the uncalibrated scores already believe.

22.3.2 Conditional Independence Assumption

The key insight: if we have an external relevance signal that is conditionally independent of vector distance given true relevance, we can use it to break the circularity.

Assumption (Cross-Modal Conditional Independence). For vector distance DD and external signal WW (e.g., BM25 score):

P(D,WR)=P(DR)P(WR)P(D, W \mid R) = P(D \mid R) \cdot P(W \mid R)

This assumes that given a document's true relevance status, knowing its BM25 score provides no additional information about its vector distance (and vice versa). The assumption is reasonable when the signals capture different aspects of relevance — lexical match versus semantic similarity.

Under this assumption, the BM25-derived relevance probability P(RW)P(R \mid W) can serve as importance weights for estimating fRf_R.

22.3.3 Importance-Weighted Kernel Density Estimation

Given KK nearest neighbors with distances d1,,dKd_1, \ldots, d_K and external relevance weights w1,,wKw_1, \ldots, w_K (e.g., Bayesian BM25 probabilities from Chapter 20), the local distribution is estimated by:

f^R(d)=1iwii=1KwiKh(ddi)\hat{f}_R(d) = \frac{1}{\sum_i w_i} \sum_{i=1}^{K} w_i \cdot \mathcal{K}_h(d - d_i)

where Kh\mathcal{K}_h is a kernel function with bandwidth hh. Documents with higher BM25 relevance probability contribute more to the local density estimate — exactly the weighting needed to approximate fRf_R without knowing the true relevance labels.

22.3.4 Bandwidth Selection

The bandwidth hh controls the smoothness of the density estimate. Following weighted Silverman's rule:

h=(4σ^53Keff)1/5h^* = \left(\frac{4\hat{\sigma}^5}{3K_{\text{eff}}}\right)^{1/5}

where σ^\hat{\sigma} is the weighted standard deviation of the distances and Keff=(wi)2/wi2K_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2 is the effective sample size.

For high-dimensional embedding spaces (d>100d > 100), the bandwidth should be scaled by a factor of d1/(d+4)d^{-1/(d+4)} to account for the curse of dimensionality.

22.3.5 Estimation Without External Signals

When no external signal is available (pure vector search without a lexical index), three fallback strategies exist:

Distance Gap Detection: Relevant documents tend to cluster at smaller distances, creating a gap between the relevant and non-relevant distance distributions. Finding this gap — via kernel density estimation on the distance distribution and identifying the first local minimum — provides a natural threshold for separating the two populations.

Index-Derived Density Priors: IVF cell populations serve as a proxy for local density. A query falling in a sparsely populated cell has fewer candidate relevant documents, while dense cells may contain many. The cell-aware base rate adjusts the prior:

Pbase(j)=PbaseN/CnjP_{\text{base}}^{(j)} = P_{\text{base}} \cdot \frac{N/C}{n_j}

where njn_j is the population of cell jj and CC is the total number of cells.

Multi-Model Cross-Calibration: If two independent embedding models are available, each can serve as the external signal for calibrating the other, breaking the circularity without any lexical signal.

22.4 Parametric Estimation via Gaussian Mixture Models

22.4.1 Two-Component Distance Mixture

As an alternative to kernel density estimation, the distance distribution can be modeled as a two-component Gaussian mixture:

f(d)=wRN(dμR,σR2)+(1wR)N(dμG,σG2)f(d) = w_R \cdot \mathcal{N}(d \mid \mu_R, \sigma_R^2) + (1 - w_R) \cdot \mathcal{N}(d \mid \mu_G, \sigma_G^2)

where:

  • (μR,σR2)(\mu_R, \sigma_R^2) are the mean and variance of the relevant document distances (small μR\mu_R)
  • (μG,σG2)(\mu_G, \sigma_G^2) are the mean and variance of the background distances
  • wRw_R is the mixing weight (proportion of relevant documents)

22.4.2 EM Algorithm with Informed Initialization

Standard EM for Gaussian mixtures is sensitive to initialization. Using external relevance weights from BM25 provides informed initialization:

Input: Distances d_1, ..., d_K, weights w_1, ..., w_K
Output: Parameters (mu_R, sigma_R, mu_G, sigma_G, w_R)

// Informed initialization
mu_R = weighted_mean(d, w)
sigma_R = weighted_std(d, w)
mu_G = mean(d)
sigma_G = std(d)
w_R = mean(w)

// EM iterations
for iter = 1 to max_iterations:
    // E-step: posterior responsibilities
    for i = 1 to K:
        gamma_i = w_R * N(d_i | mu_R, sigma_R) /
                  (w_R * N(d_i | mu_R, sigma_R) +
                   (1 - w_R) * N(d_i | mu_G, sigma_G))

    // M-step: update parameters (fix background component)
    w_R = mean(gamma)
    mu_R = sum(gamma * d) / sum(gamma)
    sigma_R = sqrt(sum(gamma * (d - mu_R)^2) / sum(gamma))

    if converged: break

The background component (μG,σG)(\mu_G, \sigma_G) is held fixed because it represents the corpus-level distance distribution, which is stable and well-characterized by the ANN index statistics.

22.4.3 Nonparametric vs. Parametric Comparison

AspectKDE (Section 22.3)GMM (Section 22.4)
AssumptionsDistribution-freeGaussian components
Sample efficiencyNeeds more neighborsFewer neighbors suffice
Boundary handlingNaturalGaussian tails may leak
ComputationO(K)O(K) per query pointEM iterations needed
AdaptabilityLocal density naturallyFixed parametric form
Recommended forLarge KK, complex distributionsSmall KK, well-separated clusters

22.5 Index-Aware Statistics Extraction

22.5.1 The Zero Additional Cost Principle

ANN indexes already compute and store distributional statistics during construction and search. Extracting these statistics for calibration requires no additional computation — they are byproducts of operations that must be performed anyway.

22.5.2 IVF Index Statistics

The Inverted File (IVF) index partitions vectors into CC cells (Voronoi regions) around centroids. During construction and search, the following statistics are available:

Global statistics (computed at index construction time):

  • Total vector count NN
  • Number of cells CC
  • Global distance distribution moments (μG\mu_G, σG\sigma_G)

Local statistics (available during search):

  • Cell population njn_j for the query's assigned cell
  • Within-cell distance distribution (distances from cell centroid to members)
  • Query-to-centroid distance

Density proxy:

  • Cell population as relevance density signal: sparse cells (low njn_j) suggest the query is in an unusual region; dense cells suggest many potentially relevant documents

22.5.3 HNSW Index Statistics

The Hierarchical Navigable Small World graph maintains navigable layers with edges representing neighborhood relationships. During search traversal, the following statistics are available:

Global statistics (computed at construction time):

  • Mean edge distance per layer
  • Edge distance distribution parameters

Local statistics (available during search):

  • Search trajectory distances (distances to nodes visited during greedy search)
  • Neighborhood distances at the result layer
  • Number of distance computations per search

Density proxy:

  • Search trajectory length (number of hops to reach a result): longer trajectories suggest the query is far from the graph's dense regions

22.5.4 Unified Statistics Mapping

Both IVF and HNSW provide the same abstract statistics needed for calibration:

Calibration NeedIVF SourceHNSW Source
fGf_G (global density)Cross-cell distance distributionEdge distance distribution
Local density estimateCell population njn_jSearch trajectory length
Background momentsCell centroid distancesLayer-0 edge statistics
Anomaly detectionQuery-to-centroid distanceHop count to nearest

The mathematical framework is index-agnostic — only the source of the statistics changes.

22.6 Unified Hybrid Search Fusion

22.6.1 The Complete Log-Odds Decomposition

Combining the vector calibration (this chapter) with the Bayesian BM25 calibration (Chapter 20), the full posterior for a document given both lexical and semantic evidence is:

logitP(R=1sbm25,dvec)=logf^R(dvec)fG(dvec)calibrated vector evidence+α(sbm25β)calibrated lexical evidence+logitPbasecorpus prior\text{logit}\,P(R = 1 \mid s_{\text{bm25}}, d_{\text{vec}}) = \underbrace{\log \frac{\hat{f}_R(d_{\text{vec}})}{f_G(d_{\text{vec}})}}_{\text{calibrated vector evidence}} + \underbrace{\alpha(s_{\text{bm25}} - \beta)}_{\text{calibrated lexical evidence}} + \underbrace{\text{logit}\,P_{\text{base}}}_{\text{corpus prior}}

Each term contributes independent Bayesian evidence in log-odds space, where updates are naturally additive. This is the principled alternative to RRF and weighted combination — every signal is calibrated through the same likelihood ratio structure, each drawing on the statistics of its native index.

22.6.2 Structural Unification

The key insight: both sparse (BM25) and dense (vector) retrieval implement the same abstract pattern:

evidence=logP(signalrelevant)P(signalnon-relevant)\text{evidence} = \log \frac{P(\text{signal} \mid \text{relevant})}{P(\text{signal} \mid \text{non-relevant})}

For BM25, the signal is the term occurrence and the likelihood ratio derives from document frequency statistics in the inverted index. For vector search, the signal is the embedding distance and the likelihood ratio derives from distance distribution statistics in the ANN index.

The unification is not merely conceptual — it has a concrete computational consequence. The log-odds decomposition means that adding a new signal type (e.g., a second embedding model, a click-through signal, or a knowledge graph proximity score) requires only:

  1. Calibrating the new signal through its own likelihood ratio
  2. Adding the resulting evidence term to the log-odds sum

No retuning of existing weights is needed, because each evidence term is independently calibrated.

22.6.3 Extension to Multiple Signals

For nn conditionally independent signals:

logitP(Rs1,,sn)=i=1nlogfR,i(si)fG,i(si)+logitPbase\text{logit}\,P(R \mid s_1, \ldots, s_n) = \sum_{i=1}^{n} \log \frac{f_{R,i}(s_i)}{f_{G,i}(s_i)} + \text{logit}\,P_{\text{base}}

Each signal contributes its own evidence term, computed from its own index statistics. The connection to neural network structure — the fact that this computation has the form of a feedforward network — is developed in Chapter 23.

22.7 Summary

Vector score calibration completes the probabilistic bridge from raw similarity scores to calibrated relevance probabilities. The key concepts covered in this chapter are:

Likelihood Ratio Foundation: Vector calibration is formulated as the ratio of local (relevant) to global (background) distance densities, grounded in Bayes' theorem and the Neyman-Pearson lemma.

Normalization Inadequacy: Fixed monotonic transformations (min-max, arctangent, linear rescaling) cannot account for local density variation in the embedding space.

Circularity Resolution: The self-referential problem of estimating the local distribution without knowing relevance labels is broken through cross-modal conditional independence — using BM25 relevance probabilities as importance weights for kernel density estimation.

Index-Aware Statistics: IVF and HNSW indexes already compute the distributional statistics needed for calibration at negligible additional cost — the index is not merely an algorithmic structure but an implicit statistical model.

Unified Log-Odds Fusion: Both sparse (BM25) and dense (vector) signals calibrate through the same likelihood ratio structure, combining additively in log-odds space. This is the principled replacement for ad-hoc fusion methods like RRF and weighted combination.

Structural Duality: IDF in the inverted index and density ratios in the ANN index are instances of the same mathematical pattern — log likelihood ratios over native index statistics.

The next chapter reveals a surprising consequence of this unified framework: when multiple calibrated probability signals are combined through the log-odds conjunction, the resulting computation has the exact structure of a feedforward neural network — derived from first principles rather than designed.

References

  1. Jeong, J. (2026). Vector Scores as Likelihood Ratios: Index-Derived Bayesian Calibration for Hybrid Search. Zenodo preprint.
  2. Jeong, J. (2026). Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search. Zenodo preprint.
  3. Jeong, J. (2026). From Bayesian Inference to Neural Computation. Zenodo preprint.
  4. Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
  5. Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI.
  6. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.
  7. Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society.

Copyright (c) 2023-2026 Cognica, Inc.