Chapter 19: BM25 Scoring
19.1 Introduction
After text analysis transforms documents into searchable tokens, the next challenge is ranking — determining which documents best match a query. A user searching for "machine learning algorithms" expects results about machine learning to appear before documents that merely mention "algorithms" in passing.
BM25 (Best Match 25) is the most widely used ranking function in modern information retrieval systems. Developed by Stephen Robertson and Karen Sparck Jones in the 1990s, BM25 combines term frequency, inverse document frequency, and document length normalization into a principled scoring formula derived from probabilistic relevance models.
This chapter explores Cognica's BM25 implementation in detail. The next chapter introduces Bayesian BM25, which transforms unbounded BM25 scores into calibrated probabilities suitable for multi-signal fusion in hybrid search systems.
19.1.1 The Ranking Problem
Given a query and a document collection , the ranking problem is to order documents by relevance:
where denotes relevance. A good scoring function must:
- Reward term matches: Documents containing query terms should score higher
- Weight rare terms: Uncommon terms are more informative than common ones
- Handle term frequency: Multiple occurrences indicate topicality, but with diminishing returns
- Normalize for length: Longer documents naturally contain more terms
19.1.2 From TF-IDF to BM25
The classic TF-IDF formula combines term frequency and inverse document frequency:
However, TF-IDF has limitations:
- Unbounded TF: Score increases linearly with frequency
- No length normalization: Long documents are favored
- Ad-hoc combination: No theoretical justification
BM25 addresses these issues through a probabilistic framework, producing a formula with principled parameter choices and saturation behavior.
19.2 The BM25 Formula
19.2.1 Standard BM25
The BM25 scoring function for a single term in document is:
where:
- = frequency of term in document
- = length of document (in tokens)
- = average document length in the collection
- = term frequency saturation parameter (typically 1.2)
- = length normalization parameter (typically 0.75)
For a multi-term query, scores are summed:
19.2.2 Inverse Document Frequency
The IDF component uses the Robertson-Sparck Jones formula:
where:
- = total number of documents
- = number of documents containing term
Properties:
- Rare terms (low ) get high IDF
- Common terms (high ) get low IDF
- The +0.5 smoothing prevents division by zero
- The final +1 ensures IDF is always positive
Example: In a collection of 1,000,000 documents:
- Term appearing in 100 docs:
- Term appearing in 10,000 docs:
- Term appearing in 500,000 docs:
19.2.3 Term Frequency Saturation
The term frequency component exhibits saturation — the score increases with frequency but approaches an upper bound:
where is the length normalization factor.
As :
This saturation prevents a single term from dominating the score through excessive repetition.
19.2.4 Length Normalization
The parameter controls how document length affects scoring:
- : No length normalization (favor long documents)
- : Full length normalization (penalize long documents)
- : Default balance
The normalization factor:
- For : (no adjustment)
- For : (boost short documents)
- For : (penalize long documents)
19.3 Cognica's BM25 Implementation
19.3.1 Similarity Interface
Cognica uses type erasure for scoring flexibility:
class Similarity {
public:
auto compute_norm(std::string_view field_name, int64_t term_count) const
-> float;
auto compute_idf(int64_t doc_freq, int64_t doc_count) const -> float;
auto create_scorer(const IndexStatsSnapshot& index_stats,
const std::vector<std::optional<TermStatsSnapshot>>& term_stats) const
-> SimScorerType;
auto get_bm25_params() const -> std::optional<std::pair<float, float>>;
};
using SimilarityType = te::poly<Similarity, te::local_storage<32>>;
The te::poly type erasure provides polymorphism without virtual function overhead, with 32 bytes of inline storage.
19.3.2 BM25 Similarity Class
class BM25Similarity {
float k1_ = 1.2f; // Term frequency saturation
float b_ = 0.75f; // Length normalization
public:
BM25Similarity() = default;
BM25Similarity(float k1, float b) : k1_(k1), b_(b) {}
auto compute_norm(std::string_view field_name, int64_t term_count) const
-> float {
return static_cast<float>(term_count);
}
auto compute_idf(int64_t doc_freq, int64_t doc_count) const -> float {
return std::log(
(static_cast<float>(doc_count - doc_freq) + 0.5f) /
(static_cast<float>(doc_freq) + 0.5f) + 1.0f
);
}
auto create_scorer(const IndexStatsSnapshot& index_stats,
const std::vector<std::optional<TermStatsSnapshot>>& term_stats) const
-> SimScorerType;
};
19.3.3 BM25 Scorer Implementation
The scorer computes BM25 scores for individual term-document pairs:
class BM25SimScorer {
float k1_;
float b_;
float weight_; // boost * IDF
float avg_doc_size_; // Average document length
public:
BM25SimScorer(float k1, float b, float boost, float idf, float avg_doc_size)
: k1_(k1), b_(b),
weight_(boost * idf),
avg_doc_size_(avg_doc_size) {}
auto score(float freq, float norm) const -> float {
// Rewritten for numerical stability and monotonicity
auto inv_norm = 1.0f / (k1_ * ((1.0f - b_) + (b_ * norm / avg_doc_size_)));
return weight_ - weight_ / (1.0f + freq * inv_norm);
}
auto is_probabilistic() const -> bool { return false; }
};
Numerical Stability: The score formula is rewritten from the standard form to ensure monotonicity despite floating-point precision:
Standard form:
Rewritten form:
where and .
This rewriting avoids the subtraction which can lose precision when .
19.3.4 Statistics Collection
Scoring requires collection-level statistics:
struct IndexStatsSnapshot {
std::string field;
int64_t total_doc_count; // Total documents in collection
int64_t total_doc_size; // Total size of all documents
int64_t doc_count; // Documents containing this field
int64_t doc_size; // Size of documents with field
int64_t sum_term_freq; // Total tokens in field
int64_t sum_doc_freq; // Sum of unique term counts
};
struct TermStatsSnapshot {
Term term;
int64_t doc_freq; // Documents containing term
int64_t total_term_freq; // Total term occurrences
};
Average document length is computed as:
19.4 Multi-Term Query Scoring
19.4.1 Conjunction (AND) Queries
For queries requiring all terms to match, scores are combined differently based on scorer type:
Non-Probabilistic (BM25):
For probabilistic scoring with Bayesian BM25, scores are combined using a log-odds conjunction framework described in Chapter 20.
class ConjunctionScorer {
std::vector<ScorerType> scorers_;
public:
auto score() const -> float {
// Simple sum for BM25
float total = 0.0f;
for (const auto& scorer : scorers_) {
total += scorer.score();
}
return total;
}
};
19.4.2 Disjunction (OR) Queries
For queries where any term match suffices:
Non-Probabilistic (BM25):
Only matching terms contribute to the sum.
For probabilistic disjunction with Bayesian BM25, see Chapter 20.
class DisjunctionScorer {
std::vector<ScorerType> scorers_;
public:
auto score() const -> float {
// Sum of matching scores for BM25
float total = 0.0f;
for (const auto& scorer : scorers_) {
if (scorer.matches_current_doc()) {
total += scorer.score();
}
}
return total;
}
};
19.5 Score Normalization Utilities
19.5.1 Softmax Normalization
Converts scores to a probability distribution:
where is the temperature parameter.
auto compute_softmax(const std::vector<float>& scores, float temperature = 1.0f)
-> std::vector<float> {
// Numerical stability: subtract max
float max_score = *std::max_element(scores.begin(), scores.end());
std::vector<float> exp_scores(scores.size());
float sum = 0.0f;
for (size_t i = 0; i < scores.size(); ++i) {
exp_scores[i] = std::exp((scores[i] - max_score) / temperature);
sum += exp_scores[i];
}
for (auto& s : exp_scores) {
s /= sum;
}
return exp_scores;
}
Temperature Effects:
- : Winner-take-all (hard max)
- : Standard softmax
- : Uniform distribution
19.5.2 Min-Max Normalization
Linear scaling to a target range:
auto compute_min_max_norm(const std::vector<float>& scores,
float target_min = 0.0f,
float target_max = 1.0f) -> std::vector<float> {
float min_score = *std::min_element(scores.begin(), scores.end());
float max_score = *std::max_element(scores.begin(), scores.end());
float range = max_score - min_score;
std::vector<float> normalized(scores.size());
for (size_t i = 0; i < scores.size(); ++i) {
if (range > 0) {
normalized[i] = (scores[i] - min_score) / range *
(target_max - target_min) + target_min;
} else {
normalized[i] = (target_min + target_max) / 2.0f;
}
}
return normalized;
}
19.5.3 Sigmoid Transform
Maps any score to :
auto compute_sigmoid_transform(const std::vector<float>& scores)
-> std::vector<float> {
std::vector<float> transformed(scores.size());
for (size_t i = 0; i < scores.size(); ++i) {
transformed[i] = 1.0f / (1.0f + std::exp(-scores[i]));
}
return transformed;
}
19.6 WAND Optimization Support
19.6.1 Upper Bound Computation
WAND (Weak AND) query evaluation requires score upper bounds for early termination (covered in Chapter 25). BM25 provides tight upper bounds:
This is the maximum possible score when and :
For practical purposes, the weight serves as the upper bound.
class BM25Similarity {
public:
auto get_bm25_params() const -> std::optional<std::pair<float, float>> {
return std::make_pair(k1_, b_);
}
};
// In WAND evaluator
float upper_bound = weight_; // boost * IDF
19.6.2 Bayesian BM25 and WAND
Bayesian BM25 (Chapter 20) preserves relative ordering, so WAND optimization remains valid. The monotonic sigmoid transformation ensures that pruning decisions based on BM25 upper bounds remain safe. Chapter 20 provides the detailed analysis of modified upper bounds under Bayesian scoring.
19.7 Performance Characteristics
19.7.1 Computational Complexity
| Operation | Complexity | Notes |
|---|---|---|
| IDF computation | Single log operation | |
| Term score | Constant arithmetic | |
| Multi-term query | Linear in query terms |
19.7.2 Thread Safety
- Similarity objects: Immutable after construction
- SimScorer objects: Stateless scoring — safe for concurrent use from multiple threads
19.7.3 Memory Efficiency
Type-erased pointers use inline storage:
using SimilarityType = te::poly<Similarity, te::local_storage<32>>;
using SimScorerType = te::poly<SimScorer, te::local_storage<32>>;
Both BM25Similarity and BayesianBM25Similarity fit within 32 bytes, avoiding heap allocation.
19.8 Practical Considerations
19.8.1 Parameter Tuning
BM25 Parameters:
| Parameter | Default | Typical Range | Effect |
|---|---|---|---|
| 1.2 | 0.5 - 2.0 | Higher = more weight on frequency | |
| 0.75 | 0.0 - 1.0 | Higher = more length normalization |
Tuning Guidelines:
- Short documents (tweets, titles): Lower (0.3-0.5)
- Long documents (articles, books): Higher (0.75-0.9)
- Verbose queries: Lower (0.5-1.0)
- Keyword queries: Higher (1.2-2.0)
19.8.2 Choosing Between BM25 and Bayesian BM25
For hybrid search, multi-signal fusion, or threshold-based filtering, Bayesian BM25 (Chapter 20) provides calibrated probabilities in . Standard BM25 is sufficient for pure text search or when only relative ranking matters.
19.8.3 Similarity Factory
auto SimilarityFactory::create(std::string_view type) -> SimilarityType {
if (type == "bayesian-bm25" || type == "BayesianBM25Similarity") {
return BayesianBM25Similarity{};
} else if (type == "bm25" || type == "BM25Similarity") {
return BM25Similarity{};
} else if (type == "boolean" || type == "BooleanSimilarity") {
return BooleanSimilarity{}; // Constant score 1.0
} else if (type == "tf-idf" || type == "TFIDFSimilarity") {
return TFIDFSimilarity{};
}
return BM25Similarity{}; // Default
}
19.9 Summary
BM25 provides the mathematical foundation for ranking documents in full-text search. The key concepts covered in this chapter are:
BM25 Formula: Combines term frequency, inverse document frequency, and document length normalization into a principled scoring function. The parameter controls TF saturation while controls length normalization.
IDF Computation: The Robertson-Sparck Jones formula assigns higher weights to rare terms, with smoothing to ensure positive values for all terms.
Term Frequency Saturation: BM25's sublinear TF component prevents excessive repetition from dominating scores, with the saturation rate controlled by .
Numerical Stability: The score formula is rewritten from to to avoid precision loss when .
WAND Support: BM25 provides tight score upper bounds for early termination in top- retrieval (Chapter 25).
Multi-Term Scoring: Conjunction queries sum per-term BM25 scores, while disjunction queries sum scores from matching terms.
The next chapter introduces Bayesian BM25, which transforms unbounded BM25 scores into calibrated probabilities for principled multi-signal fusion in hybrid search systems.
References
- Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
- Robertson, S. E., & Walker, S. (1994). Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. SIGIR.
- Sparck Jones, K. (1972). A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation.
- Lv, Y., & Zhai, C. (2011). Lower-Bounding Term Frequency Normalization. CIKM.
- Trotman, A., Puurula, A., & Burgess, B. (2014). Improvements to BM25 and Language Models Examined. ADCS.