Chapter 19: BM25 Scoring

19.1 Introduction

After text analysis transforms documents into searchable tokens, the next challenge is ranking — determining which documents best match a query. A user searching for "machine learning algorithms" expects results about machine learning to appear before documents that merely mention "algorithms" in passing.

BM25 (Best Match 25) is the most widely used ranking function in modern information retrieval systems. Developed by Stephen Robertson and Karen Sparck Jones in the 1990s, BM25 combines term frequency, inverse document frequency, and document length normalization into a principled scoring formula derived from probabilistic relevance models.

This chapter explores Cognica's BM25 implementation in detail. The next chapter introduces Bayesian BM25, which transforms unbounded BM25 scores into calibrated probabilities suitable for multi-signal fusion in hybrid search systems.

19.1.1 The Ranking Problem

Given a query $q = \{t_1, t_2, \ldots, t_n\}$ and a document collection $D$ , the ranking problem is to order documents by relevance:

P(R \mid q, d) \propto \text{score}(q, d)

where $R$ denotes relevance. A good scoring function must:

Reward term matches: Documents containing query terms should score higher
Weight rare terms: Uncommon terms are more informative than common ones
Handle term frequency: Multiple occurrences indicate topicality, but with diminishing returns
Normalize for length: Longer documents naturally contain more terms

19.1.2 From TF-IDF to BM25

The classic TF-IDF formula combines term frequency and inverse document frequency:

\text{TF-IDF}(t, d) = \text{tf}(t, d) \cdot \log\frac{N}{\text{df}(t)}

However, TF-IDF has limitations:

Unbounded TF: Score increases linearly with frequency
No length normalization: Long documents are favored
Ad-hoc combination: No theoretical justification

BM25 addresses these issues through a probabilistic framework, producing a formula with principled parameter choices and saturation behavior.

19.2 The BM25 Formula

19.2.1 Standard BM25

The BM25 scoring function for a single term $t$ in document $d$ is:

\text{BM25}(t, d) = \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

where:

$f(t, d)$ = frequency of term $t$ in document $d$
$|d|$ = length of document $d$ (in tokens)
$\text{avgdl}$ = average document length in the collection
$k_1$ = term frequency saturation parameter (typically 1.2)
$b$ = length normalization parameter (typically 0.75)

For a multi-term query, scores are summed:

\text{BM25}(q, d) = \sum_{t \in q} \text{BM25}(t, d)

19.2.2 Inverse Document Frequency

The IDF component uses the Robertson-Sparck Jones formula:

\text{IDF}(t) = \ln\left(\frac{N - \text{df}(t) + 0.5}{\text{df}(t) + 0.5} + 1\right)

where:

$N$ = total number of documents
$\text{df}(t)$ = number of documents containing term $t$

Properties:

Rare terms (low $\text{df}$ ) get high IDF
Common terms (high $\text{df}$ ) get low IDF
The +0.5 smoothing prevents division by zero
The final +1 ensures IDF is always positive

Example: In a collection of 1,000,000 documents:

Term appearing in 100 docs: $\text{IDF} \approx 9.2$
Term appearing in 10,000 docs: $\text{IDF} \approx 4.6$
Term appearing in 500,000 docs: $\text{IDF} \approx 0.7$

19.2.3 Term Frequency Saturation

The term frequency component exhibits saturation — the score increases with frequency but approaches an upper bound:

\text{TF}_{\text{BM25}}(f) = \frac{f \cdot (k_1 + 1)}{f + k_1 \cdot K}

where $K = 1 - b + b \cdot \frac{|d|}{\text{avgdl}}$ is the length normalization factor.

As $f \to \infty$ :

\lim_{f \to \infty} \text{TF}_{\text{BM25}}(f) = k_1 + 1

This saturation prevents a single term from dominating the score through excessive repetition.

Loading diagram...

19.2.4 Length Normalization

The $b$ parameter controls how document length affects scoring:

$b = 0$ : No length normalization (favor long documents)
$b = 1$ : Full length normalization (penalize long documents)
$b = 0.75$ : Default balance

The normalization factor:

K = 1 - b + b \cdot \frac{|d|}{\text{avgdl}}

For $|d| = \text{avgdl}$ : $K = 1$ (no adjustment)
For $|d| < \text{avgdl}$ : $K < 1$ (boost short documents)
For $|d| > \text{avgdl}$ : $K > 1$ (penalize long documents)

19.3 Cognica's BM25 Implementation

19.3.1 Similarity Interface

Cognica uses type erasure for scoring flexibility:

class Similarity {
public:
  auto compute_norm(std::string_view field_name, int64_t term_count) const
      -> float;

  auto compute_idf(int64_t doc_freq, int64_t doc_count) const -> float;

  auto create_scorer(const IndexStatsSnapshot& index_stats,
                     const std::vector<std::optional<TermStatsSnapshot>>& term_stats) const
      -> SimScorerType;

  auto get_bm25_params() const -> std::optional<std::pair<float, float>>;
};

using SimilarityType = te::poly<Similarity, te::local_storage<32>>;

The te::poly type erasure provides polymorphism without virtual function overhead, with 32 bytes of inline storage.

19.3.2 BM25 Similarity Class

class BM25Similarity {
  float k1_ = 1.2f;   // Term frequency saturation
  float b_ = 0.75f;   // Length normalization

public:
  BM25Similarity() = default;
  BM25Similarity(float k1, float b) : k1_(k1), b_(b) {}

  auto compute_norm(std::string_view field_name, int64_t term_count) const
      -> float {
    return static_cast<float>(term_count);
  }

  auto compute_idf(int64_t doc_freq, int64_t doc_count) const -> float {
    return std::log(
      (static_cast<float>(doc_count - doc_freq) + 0.5f) /
      (static_cast<float>(doc_freq) + 0.5f) + 1.0f
    );
  }

  auto create_scorer(const IndexStatsSnapshot& index_stats,
                     const std::vector<std::optional<TermStatsSnapshot>>& term_stats) const
      -> SimScorerType;
};

19.3.3 BM25 Scorer Implementation

The scorer computes BM25 scores for individual term-document pairs:

class BM25SimScorer {
  float k1_;
  float b_;
  float weight_;        // boost * IDF
  float avg_doc_size_;  // Average document length

public:
  BM25SimScorer(float k1, float b, float boost, float idf, float avg_doc_size)
      : k1_(k1), b_(b),
        weight_(boost * idf),
        avg_doc_size_(avg_doc_size) {}

  auto score(float freq, float norm) const -> float {
    // Rewritten for numerical stability and monotonicity
    auto inv_norm = 1.0f / (k1_ * ((1.0f - b_) + (b_ * norm / avg_doc_size_)));
    return weight_ - weight_ / (1.0f + freq * inv_norm);
  }

  auto is_probabilistic() const -> bool { return false; }
};

Numerical Stability: The score formula is rewritten from the standard form to ensure monotonicity despite floating-point precision:

Standard form:

\text{score} = w \cdot \frac{f}{f + K}

Rewritten form:

\text{score} = w - \frac{w}{1 + f \cdot K^{-1}}

where $w = \text{boost} \cdot \text{IDF}$ and $K = k_1 \cdot ((1-b) + b \cdot \frac{\text{norm}}{\text{avgdl}})$ .

This rewriting avoids the subtraction $f / (f + K)$ which can lose precision when $f \gg K$ .

19.3.4 Statistics Collection

Scoring requires collection-level statistics:

struct IndexStatsSnapshot {
  std::string field;
  int64_t total_doc_count;   // Total documents in collection
  int64_t total_doc_size;    // Total size of all documents
  int64_t doc_count;         // Documents containing this field
  int64_t doc_size;          // Size of documents with field
  int64_t sum_term_freq;     // Total tokens in field
  int64_t sum_doc_freq;      // Sum of unique term counts
};

struct TermStatsSnapshot {
  Term term;
  int64_t doc_freq;          // Documents containing term
  int64_t total_term_freq;   // Total term occurrences
};

Average document length is computed as:

\text{avgdl} = \frac{\text{sum\_term\_freq}}{\text{doc\_count}}

19.4 Multi-Term Query Scoring

19.4.1 Conjunction (AND) Queries

For queries requiring all terms to match, scores are combined differently based on scorer type:

Non-Probabilistic (BM25):

\text{score}(q, d) = \sum_{t \in q} \text{BM25}(t, d)

For probabilistic scoring with Bayesian BM25, scores are combined using a log-odds conjunction framework described in Chapter 20.

class ConjunctionScorer {
  std::vector<ScorerType> scorers_;

public:
  auto score() const -> float {
    // Simple sum for BM25
    float total = 0.0f;
    for (const auto& scorer : scorers_) {
      total += scorer.score();
    }
    return total;
  }
};

19.4.2 Disjunction (OR) Queries

For queries where any term match suffices:

Non-Probabilistic (BM25):

\text{score}(q, d) = \sum_{t \in q \cap d} \text{BM25}(t, d)

Only matching terms contribute to the sum.

For probabilistic disjunction with Bayesian BM25, see Chapter 20.

class DisjunctionScorer {
  std::vector<ScorerType> scorers_;

public:
  auto score() const -> float {
    // Sum of matching scores for BM25
    float total = 0.0f;
    for (const auto& scorer : scorers_) {
      if (scorer.matches_current_doc()) {
        total += scorer.score();
      }
    }
    return total;
  }
};

19.5 Score Normalization Utilities

19.5.1 Softmax Normalization

Converts scores to a probability distribution:

P_i = \frac{e^{s_i / T}}{\sum_j e^{s_j / T}}

where $T$ is the temperature parameter.

auto compute_softmax(const std::vector<float>& scores, float temperature = 1.0f)
    -> std::vector<float> {
  // Numerical stability: subtract max
  float max_score = *std::max_element(scores.begin(), scores.end());

  std::vector<float> exp_scores(scores.size());
  float sum = 0.0f;

  for (size_t i = 0; i < scores.size(); ++i) {
    exp_scores[i] = std::exp((scores[i] - max_score) / temperature);
    sum += exp_scores[i];
  }

  for (auto& s : exp_scores) {
    s /= sum;
  }

  return exp_scores;
}

Temperature Effects:

$T \to 0$ : Winner-take-all (hard max)
$T = 1$ : Standard softmax
$T \to \infty$ : Uniform distribution

19.5.2 Min-Max Normalization

Linear scaling to a target range:

\text{norm}(s) = \frac{s - s_{\min}}{s_{\max} - s_{\min}} \cdot (t_{\max} - t_{\min}) + t_{\min}

auto compute_min_max_norm(const std::vector<float>& scores,
                          float target_min = 0.0f,
                          float target_max = 1.0f) -> std::vector<float> {
  float min_score = *std::min_element(scores.begin(), scores.end());
  float max_score = *std::max_element(scores.begin(), scores.end());
  float range = max_score - min_score;

  std::vector<float> normalized(scores.size());
  for (size_t i = 0; i < scores.size(); ++i) {
    if (range > 0) {
      normalized[i] = (scores[i] - min_score) / range *
                      (target_max - target_min) + target_min;
    } else {
      normalized[i] = (target_min + target_max) / 2.0f;
    }
  }

  return normalized;
}

19.5.3 Sigmoid Transform

Maps any score to $(0, 1)$ :

\sigma(s) = \frac{1}{1 + e^{-s}}

auto compute_sigmoid_transform(const std::vector<float>& scores)
    -> std::vector<float> {
  std::vector<float> transformed(scores.size());
  for (size_t i = 0; i < scores.size(); ++i) {
    transformed[i] = 1.0f / (1.0f + std::exp(-scores[i]));
  }
  return transformed;
}

19.6 WAND Optimization Support

19.6.1 Upper Bound Computation

WAND (Weak AND) query evaluation requires score upper bounds for early termination (covered in Chapter 25). BM25 provides tight upper bounds:

\text{UB}(t) = \text{boost} \cdot \text{IDF}(t)

This is the maximum possible score when $f \to \infty$ and $|d| \to 0$ :

\lim_{f \to \infty, |d| \to 0} \text{BM25}(t, d) = \text{IDF}(t) \cdot (k_1 + 1)

For practical purposes, the weight $w = \text{boost} \cdot \text{IDF}$ serves as the upper bound.

class BM25Similarity {
public:
  auto get_bm25_params() const -> std::optional<std::pair<float, float>> {
    return std::make_pair(k1_, b_);
  }
};

// In WAND evaluator
float upper_bound = weight_;  // boost * IDF

19.6.2 Bayesian BM25 and WAND

Bayesian BM25 (Chapter 20) preserves relative ordering, so WAND optimization remains valid. The monotonic sigmoid transformation ensures that pruning decisions based on BM25 upper bounds remain safe. Chapter 20 provides the detailed analysis of modified upper bounds under Bayesian scoring.

19.7 Performance Characteristics

19.7.1 Computational Complexity

Operation	Complexity	Notes
IDF computation	$O(1)$	Single log operation
Term score	$O(1)$	Constant arithmetic
Multi-term query	$O(n)$	Linear in query terms

19.7.2 Thread Safety

Similarity objects: Immutable after construction
SimScorer objects: Stateless scoring — safe for concurrent use from multiple threads

19.7.3 Memory Efficiency

Type-erased pointers use inline storage:

using SimilarityType = te::poly<Similarity, te::local_storage<32>>;
using SimScorerType = te::poly<SimScorer, te::local_storage<32>>;

Both BM25Similarity and BayesianBM25Similarity fit within 32 bytes, avoiding heap allocation.

19.8 Practical Considerations

19.8.1 Parameter Tuning

BM25 Parameters:

Parameter	Default	Typical Range	Effect
$k_1$	1.2	0.5 - 2.0	Higher = more weight on frequency
$b$	0.75	0.0 - 1.0	Higher = more length normalization

Tuning Guidelines:

Short documents (tweets, titles): Lower $b$ (0.3-0.5)
Long documents (articles, books): Higher $b$ (0.75-0.9)
Verbose queries: Lower $k_1$ (0.5-1.0)
Keyword queries: Higher $k_1$ (1.2-2.0)

19.8.2 Choosing Between BM25 and Bayesian BM25

For hybrid search, multi-signal fusion, or threshold-based filtering, Bayesian BM25 (Chapter 20) provides calibrated probabilities in $[0, 1]$ . Standard BM25 is sufficient for pure text search or when only relative ranking matters.

19.8.3 Similarity Factory

auto SimilarityFactory::create(std::string_view type) -> SimilarityType {
  if (type == "bayesian-bm25" || type == "BayesianBM25Similarity") {
    return BayesianBM25Similarity{};
  } else if (type == "bm25" || type == "BM25Similarity") {
    return BM25Similarity{};
  } else if (type == "boolean" || type == "BooleanSimilarity") {
    return BooleanSimilarity{};  // Constant score 1.0
  } else if (type == "tf-idf" || type == "TFIDFSimilarity") {
    return TFIDFSimilarity{};
  }

  return BM25Similarity{};  // Default
}

19.9 Summary

BM25 provides the mathematical foundation for ranking documents in full-text search. The key concepts covered in this chapter are:

BM25 Formula: Combines term frequency, inverse document frequency, and document length normalization into a principled scoring function. The $k_1$ parameter controls TF saturation while $b$ controls length normalization.

IDF Computation: The Robertson-Sparck Jones formula assigns higher weights to rare terms, with smoothing to ensure positive values for all terms.

Term Frequency Saturation: BM25's sublinear TF component prevents excessive repetition from dominating scores, with the saturation rate controlled by $k_1$ .

Numerical Stability: The score formula is rewritten from $w \cdot f/(f+K)$ to $w - w/(1 + f/K)$ to avoid precision loss when $f \gg K$ .

WAND Support: BM25 provides tight score upper bounds $\text{UB}(t) = \text{boost} \cdot \text{IDF}(t)$ for early termination in top- $k$ retrieval (Chapter 25).

Multi-Term Scoring: Conjunction queries sum per-term BM25 scores, while disjunction queries sum scores from matching terms.

The next chapter introduces Bayesian BM25, which transforms unbounded BM25 scores into calibrated probabilities for principled multi-signal fusion in hybrid search systems.

References

Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
Robertson, S. E., & Walker, S. (1994). Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. SIGIR.
Sparck Jones, K. (1972). A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation.
Lv, Y., & Zhai, C. (2011). Lower-Bounding Term Frequency Normalization. CIKM.
Trotman, A., Puurula, A., & Burgess, B. (2014). Improvements to BM25 and Language Models Examined. ADCS.