Chapter 22: Vector Score Calibration
22.1 Introduction
Chapter 20 established Bayesian BM25 as a framework for transforming lexical scores into calibrated probabilities. Chapter 21 covered HNSW-based vector search, which produces similarity scores — cosine similarity, inner product, or Euclidean distance — that rank documents by semantic proximity.
This chapter addresses a fundamental question: how do we transform vector similarity scores into calibrated relevance probabilities? A cosine similarity of 0.85 does not mean an 85% chance of relevance. The answer requires a likelihood ratio framework that exploits the distributional statistics already computed during ANN index construction and search.
22.1.1 The Vector Score Interpretation Problem
Vector similarity scores suffer from four interpretability limitations:
- Not Probabilities: A cosine similarity is a geometric quantity — the cosine of the angle between two vectors — not a probability of relevance.
- Distribution Dependence: Score distributions vary with the embedding model, corpus, and query distribution. A score of 0.7 may be highly discriminative in one corpus and uninformative in another.
- Local Density Variation: The same similarity score carries different information in dense versus sparse regions of the embedding space. In a dense cluster, a nearby document may be unremarkable; in a sparse region, the same distance implies strong relevance.
- Scale Incompatibility: Direct combination with calibrated lexical scores (e.g., Bayesian BM25 probabilities from Chapter 20) is unprincipled without shared probabilistic semantics.
22.1.2 The Normalization Illusion
Rescaling vector scores to — for instance via — creates the appearance of probabilities without their substance. Common normalization methods include:
- Min-max normalization:
- Arctangent normalization:
- Linear rescaling: for
Theorem (Normalization Inadequacy). All query-independent normalization functions fail to account for the local density structure of the embedding space. For any fixed monotonic transformation :
but the true relevance probabilities may differ:
even when both documents have the same similarity score.
Proof. A fixed transformation depends only on the score and is blind to the local density. Two documents equidistant from a query carry different relevance information depending on whether they are in a region with many nearby documents (dense) or few (sparse). No query-independent function can capture this distinction.
22.1.3 Structural Parallel with Lexical Retrieval
Bayesian BM25 (Chapter 20) calibrates lexical scores using statistics from the inverted index: document frequency, term frequency, and average document length — all computed at index time. The IDF component is itself a log likelihood ratio:
This is the log ratio of the probability of not containing term to the probability of containing it — a density ratio over the term occurrence distribution.
ANN indexes (IVF, HNSW) similarly compute and store distributional statistics during construction and search. If lexical calibration exploits inverted index statistics, vector calibration should exploit ANN index statistics. The mathematical structure — a likelihood ratio over corpus distributions — is identical in both cases.
22.2 Likelihood Ratio Calibration
22.2.1 Distance Orientation Convention
Throughout this chapter, denotes a distance-like quantity where smaller values indicate greater similarity:
- For cosine similarity :
- For Euclidean distance: is used directly
- For inner product on normalized vectors:
22.2.2 The Posterior in Log-Odds Form
Given observed distance between query and document vectors, the posterior probability of relevance follows from Bayes' theorem:
where:
- is the probability density of distance among relevant documents (local distribution)
- is the probability density of distance in the full corpus (global/background distribution)
Converting to log-odds:
The vector evidence is the log density ratio — how much more likely the observed distance is under the relevant distribution versus the background distribution.
22.2.3 Vector Evidence
We define the vector evidence function:
This has a natural interpretation:
- : distance is more likely for relevant documents — evidence for relevance
- : distance is equally likely for relevant and non-relevant documents — no evidence
- : distance is more likely for non-relevant documents — evidence against relevance
The vector evidence is structurally identical to the IDF-based evidence in BM25: both are log likelihood ratios over their respective index distributions. This is not a coincidence — it reflects the Neyman-Pearson lemma, which states that the likelihood ratio is a sufficient statistic for binary classification.
22.2.4 Distribution-Free Formulation
The framework makes no assumptions about the parametric form of or . The densities can be Gaussian, uniform, or any other distribution — only their ratio matters. This generality is important because distance distributions vary across embedding models, distance metrics, and corpus characteristics.
22.2.5 Concentration of Measure
In high-dimensional spaces, the background distribution of distances concentrates around a characteristic value. For random unit vectors in , the cosine similarity distribution converges to:
This concentration means becomes increasingly well-characterized with dimensionality — a favorable property for calibration, since the background distribution stabilizes and can be estimated with high confidence from relatively few samples.
22.3 Breaking Circularity via Cross-Modal Estimation
22.3.1 The Circularity Problem
Estimating the vector evidence requires knowing — the distance distribution among relevant documents. But identifying which documents are relevant is precisely the retrieval problem we are trying to solve. This creates a circularity:
- To calibrate vector scores, we need
- To estimate , we need to know which documents are relevant
- To know which documents are relevant, we need calibrated scores
Naive approaches — using the top- results as "relevant" — introduce confirmation bias: the calibration would merely reinforce whatever the uncalibrated scores already believe.
22.3.2 Conditional Independence Assumption
The key insight: if we have an external relevance signal that is conditionally independent of vector distance given true relevance, we can use it to break the circularity.
Assumption (Cross-Modal Conditional Independence). For vector distance and external signal (e.g., BM25 score):
This assumes that given a document's true relevance status, knowing its BM25 score provides no additional information about its vector distance (and vice versa). The assumption is reasonable when the signals capture different aspects of relevance — lexical match versus semantic similarity.
Under this assumption, the BM25-derived relevance probability can serve as importance weights for estimating .
22.3.3 Importance-Weighted Kernel Density Estimation
Given nearest neighbors with distances and external relevance weights (e.g., Bayesian BM25 probabilities from Chapter 20), the local distribution is estimated by:
where is a kernel function with bandwidth . Documents with higher BM25 relevance probability contribute more to the local density estimate — exactly the weighting needed to approximate without knowing the true relevance labels.
22.3.4 Bandwidth Selection
The bandwidth controls the smoothness of the density estimate. Following weighted Silverman's rule:
where is the weighted standard deviation of the distances and is the effective sample size.
For high-dimensional embedding spaces (), the bandwidth should be scaled by a factor of to account for the curse of dimensionality.
22.3.5 Estimation Without External Signals
When no external signal is available (pure vector search without a lexical index), three fallback strategies exist:
Distance Gap Detection: Relevant documents tend to cluster at smaller distances, creating a gap between the relevant and non-relevant distance distributions. Finding this gap — via kernel density estimation on the distance distribution and identifying the first local minimum — provides a natural threshold for separating the two populations.
Index-Derived Density Priors: IVF cell populations serve as a proxy for local density. A query falling in a sparsely populated cell has fewer candidate relevant documents, while dense cells may contain many. The cell-aware base rate adjusts the prior:
where is the population of cell and is the total number of cells.
Multi-Model Cross-Calibration: If two independent embedding models are available, each can serve as the external signal for calibrating the other, breaking the circularity without any lexical signal.
22.4 Parametric Estimation via Gaussian Mixture Models
22.4.1 Two-Component Distance Mixture
As an alternative to kernel density estimation, the distance distribution can be modeled as a two-component Gaussian mixture:
where:
- are the mean and variance of the relevant document distances (small )
- are the mean and variance of the background distances
- is the mixing weight (proportion of relevant documents)
22.4.2 EM Algorithm with Informed Initialization
Standard EM for Gaussian mixtures is sensitive to initialization. Using external relevance weights from BM25 provides informed initialization:
Input: Distances d_1, ..., d_K, weights w_1, ..., w_K
Output: Parameters (mu_R, sigma_R, mu_G, sigma_G, w_R)
// Informed initialization
mu_R = weighted_mean(d, w)
sigma_R = weighted_std(d, w)
mu_G = mean(d)
sigma_G = std(d)
w_R = mean(w)
// EM iterations
for iter = 1 to max_iterations:
// E-step: posterior responsibilities
for i = 1 to K:
gamma_i = w_R * N(d_i | mu_R, sigma_R) /
(w_R * N(d_i | mu_R, sigma_R) +
(1 - w_R) * N(d_i | mu_G, sigma_G))
// M-step: update parameters (fix background component)
w_R = mean(gamma)
mu_R = sum(gamma * d) / sum(gamma)
sigma_R = sqrt(sum(gamma * (d - mu_R)^2) / sum(gamma))
if converged: break
The background component is held fixed because it represents the corpus-level distance distribution, which is stable and well-characterized by the ANN index statistics.
22.4.3 Nonparametric vs. Parametric Comparison
| Aspect | KDE (Section 22.3) | GMM (Section 22.4) |
|---|---|---|
| Assumptions | Distribution-free | Gaussian components |
| Sample efficiency | Needs more neighbors | Fewer neighbors suffice |
| Boundary handling | Natural | Gaussian tails may leak |
| Computation | per query point | EM iterations needed |
| Adaptability | Local density naturally | Fixed parametric form |
| Recommended for | Large , complex distributions | Small , well-separated clusters |
22.5 Index-Aware Statistics Extraction
22.5.1 The Zero Additional Cost Principle
ANN indexes already compute and store distributional statistics during construction and search. Extracting these statistics for calibration requires no additional computation — they are byproducts of operations that must be performed anyway.
22.5.2 IVF Index Statistics
The Inverted File (IVF) index partitions vectors into cells (Voronoi regions) around centroids. During construction and search, the following statistics are available:
Global statistics (computed at index construction time):
- Total vector count
- Number of cells
- Global distance distribution moments (, )
Local statistics (available during search):
- Cell population for the query's assigned cell
- Within-cell distance distribution (distances from cell centroid to members)
- Query-to-centroid distance
Density proxy:
- Cell population as relevance density signal: sparse cells (low ) suggest the query is in an unusual region; dense cells suggest many potentially relevant documents
22.5.3 HNSW Index Statistics
The Hierarchical Navigable Small World graph maintains navigable layers with edges representing neighborhood relationships. During search traversal, the following statistics are available:
Global statistics (computed at construction time):
- Mean edge distance per layer
- Edge distance distribution parameters
Local statistics (available during search):
- Search trajectory distances (distances to nodes visited during greedy search)
- Neighborhood distances at the result layer
- Number of distance computations per search
Density proxy:
- Search trajectory length (number of hops to reach a result): longer trajectories suggest the query is far from the graph's dense regions
22.5.4 Unified Statistics Mapping
Both IVF and HNSW provide the same abstract statistics needed for calibration:
| Calibration Need | IVF Source | HNSW Source |
|---|---|---|
| (global density) | Cross-cell distance distribution | Edge distance distribution |
| Local density estimate | Cell population | Search trajectory length |
| Background moments | Cell centroid distances | Layer-0 edge statistics |
| Anomaly detection | Query-to-centroid distance | Hop count to nearest |
The mathematical framework is index-agnostic — only the source of the statistics changes.
22.6 Unified Hybrid Search Fusion
22.6.1 The Complete Log-Odds Decomposition
Combining the vector calibration (this chapter) with the Bayesian BM25 calibration (Chapter 20), the full posterior for a document given both lexical and semantic evidence is:
Each term contributes independent Bayesian evidence in log-odds space, where updates are naturally additive. This is the principled alternative to RRF and weighted combination — every signal is calibrated through the same likelihood ratio structure, each drawing on the statistics of its native index.
22.6.2 Structural Unification
The key insight: both sparse (BM25) and dense (vector) retrieval implement the same abstract pattern:
For BM25, the signal is the term occurrence and the likelihood ratio derives from document frequency statistics in the inverted index. For vector search, the signal is the embedding distance and the likelihood ratio derives from distance distribution statistics in the ANN index.
The unification is not merely conceptual — it has a concrete computational consequence. The log-odds decomposition means that adding a new signal type (e.g., a second embedding model, a click-through signal, or a knowledge graph proximity score) requires only:
- Calibrating the new signal through its own likelihood ratio
- Adding the resulting evidence term to the log-odds sum
No retuning of existing weights is needed, because each evidence term is independently calibrated.
22.6.3 Extension to Multiple Signals
For conditionally independent signals:
Each signal contributes its own evidence term, computed from its own index statistics. The connection to neural network structure — the fact that this computation has the form of a feedforward network — is developed in Chapter 23.
22.7 Summary
Vector score calibration completes the probabilistic bridge from raw similarity scores to calibrated relevance probabilities. The key concepts covered in this chapter are:
Likelihood Ratio Foundation: Vector calibration is formulated as the ratio of local (relevant) to global (background) distance densities, grounded in Bayes' theorem and the Neyman-Pearson lemma.
Normalization Inadequacy: Fixed monotonic transformations (min-max, arctangent, linear rescaling) cannot account for local density variation in the embedding space.
Circularity Resolution: The self-referential problem of estimating the local distribution without knowing relevance labels is broken through cross-modal conditional independence — using BM25 relevance probabilities as importance weights for kernel density estimation.
Index-Aware Statistics: IVF and HNSW indexes already compute the distributional statistics needed for calibration at negligible additional cost — the index is not merely an algorithmic structure but an implicit statistical model.
Unified Log-Odds Fusion: Both sparse (BM25) and dense (vector) signals calibrate through the same likelihood ratio structure, combining additively in log-odds space. This is the principled replacement for ad-hoc fusion methods like RRF and weighted combination.
Structural Duality: IDF in the inverted index and density ratios in the ANN index are instances of the same mathematical pattern — log likelihood ratios over native index statistics.
The next chapter reveals a surprising consequence of this unified framework: when multiple calibrated probability signals are combined through the log-odds conjunction, the resulting computation has the exact structure of a feedforward neural network — derived from first principles rather than designed.
References
- Jeong, J. (2026). Vector Scores as Likelihood Ratios: Index-Derived Bayesian Calibration for Hybrid Search. Zenodo preprint.
- Jeong, J. (2026). Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search. Zenodo preprint.
- Jeong, J. (2026). From Bayesian Inference to Neural Computation. Zenodo preprint.
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
- Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI.
- Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.
- Neyman, J., & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society.