Chapter 7: Inverted Index Architecture
This chapter explores the inverted index, the fundamental data structure enabling full-text search. We examine how Cognica implements posting lists, the revolutionary clustered term index optimization that reduces key counts by 62,500x, and the text analysis pipeline that transforms documents into searchable terms.
7.1 Information Retrieval Fundamentals
Full-text search differs fundamentally from structured queries. Rather than matching exact values, text search must handle:
- Vocabulary mismatch: Users search "car" but documents contain "automobile"
- Ranking: Multiple documents match; which is most relevant?
- Linguistic variation: "running", "runs", "ran" should match "run"
- Scale: Billions of documents, millions of unique terms
7.1.1 The Term-Document Matrix
The conceptual foundation of text retrieval is the term-document matrix:
where if term appears in document , else .
Problem: For terms and documents, this matrix has entries - far too large to store.
Observation: The matrix is extremely sparse. A typical document contains 100-1000 unique terms out of millions possible. Sparsity is often 99.99%+.
7.1.2 Inverted Index as Sparse Representation
The inverted index stores only non-zero entries, organized by term:
Each term maps to a posting list - the set of documents containing that term.
Example:
| Term | Posting List |
|---|---|
| "database" | {1, 5, 12, 47, 103, ...} |
| "query" | {1, 12, 89, 156, ...} |
| "optimization" | {5, 47, 89, 201, ...} |
Query Processing:
To find documents containing "database AND query":
Set intersection replaces matrix multiplication, achieving dramatic speedup.
7.1.3 Posting List Complexity
| Operation | Full Matrix | Inverted Index |
|---|---|---|
| Storage | ||
| Term lookup | dictionary + | |
| AND query | ||
| OR query |
The inverted index transforms text search from infeasible to practical.
7.2 Posting List Structure
Cognica's posting lists store rich metadata beyond simple document identifiers, enabling advanced ranking and phrase queries.
7.2.1 Posting Entry Components
Each posting entry contains:
struct PostingEntry {
DocID doc_id; // Document identifier
std::string pointer; // External document reference
int32_t term_freq; // Occurrences in document
int32_t term_count; // Total terms in field
float field_norm; // Length normalization factor
Positions positions; // Term positions for phrases
Offsets offsets; // Byte offsets for highlighting
};
Field Descriptions:
| Field | Purpose | Used For |
|---|---|---|
doc_id | Internal identifier | Index lookups |
pointer | External key (e.g., primary key) | Result retrieval |
term_freq | Count in this document | TF-IDF, BM25 scoring |
term_count | Total tokens in field | Length normalization |
field_norm | Pre-computed | BM25 parameter |
positions | Token positions [0, 5, 12, ...] | Phrase queries |
offsets | Character/byte ranges | Snippet highlighting |
7.2.2 Positions for Phrase Queries
Positions enable phrase matching by recording where each term appears:
Document: "The quick brown fox jumps over the lazy dog"
| Term | Positions |
|---|---|
| "the" | [0, 6] |
| "quick" | [1] |
| "brown" | [2] |
| "fox" | [3] |
| "jumps" | [4] |
| "over" | [5] |
| "lazy" | [7] |
| "dog" | [8] |
Phrase Query: "quick brown fox"
Check if positions are consecutive:
- "quick" at position 1
- "brown" at position 2 (= 1 + 1)
- "fox" at position 3 (= 2 + 1)
Match confirmed - all positions are adjacent.
7.2.3 Offsets for Highlighting
Offsets enable precise text highlighting without re-tokenizing:
struct Offset {
uint32_t begin; // Start position (byte or char)
uint32_t end; // End position
uint32_t size; // Length (for validation)
};
Example for "quick" in "The quick brown fox":
| Type | Begin | End |
|---|---|---|
| Character | 4 | 9 |
| Byte (UTF-8) | 4 | 9 |
For multi-byte characters, byte and character offsets differ, requiring both for correct highlighting in UTF-8 text.
7.2.4 Aggregate Posting List
The complete posting list aggregates all entries for a term:
struct Postings {
std::vector<DocID> doc_ids;
std::vector<std::string> pointers;
std::vector<int32_t> term_freqs;
std::vector<int32_t> term_counts;
std::vector<float> field_norms;
std::vector<Positions> positions;
std::vector<Offsets> offsets_bytes;
};
This structure-of-arrays layout optimizes cache utilization when scanning only specific fields (e.g., doc_ids for boolean queries, term_freqs for scoring).
7.3 The Clustered Term Index
Traditional inverted indexes suffer from key explosion in LSM-tree storage. Cognica's clustered term index addresses this through document clustering, achieving 62,500x key reduction.
7.3.1 The Key Explosion Problem
In a traditional inverted index stored in an LSM-tree:
For a corpus with:
- 10 million documents
- 1000 unique terms per document
- Average 100 matching documents per query term
A single query might require:
Each key requires a separate RocksDB lookup, overwhelming the storage layer.
7.3.2 Clustering Solution
The clustered term index groups documents into clusters of 65,536 ():
All posting entries within a cluster are stored in a single key-value pair:
7.3.3 Key Reduction Analysis
Traditional Index:
For 100,000 postings per term across 5 query terms: 500,000 keys.
Clustered Index:
For 100,000 postings spread across clusters: ~8 keys per term, 40 keys total.
Reduction Factor:
In practice, with locality (documents indexed together have similar doc_ids), reduction reaches 62,500x.
7.3.4 Clustered Key Format
The key encoding uses a three-layer structure:
This hierarchical structure enables:
- Prefix iteration: Scan all clusters for a term
- Direct lookup: Jump to specific cluster
- Range scans: Efficient cluster range queries
7.3.5 Clustered Value Encoding
Each cluster value contains multiple posting entries:
Gap Encoding for Offsets:
Offsets within a cluster are stored as deltas from previous:
Since offsets are sorted, deltas are small positive integers that compress well with variable-length encoding.
7.3.6 Performance Comparison
| Metric | Traditional | Clustered | Improvement |
|---|---|---|---|
| Keys per query | 500,000 | 8 | 62,500x |
| Read latency | 45 ms | 13 ms | 3.5x |
| Write latency | 120 ms | 35 ms | 3.4x |
| Storage overhead | 1.0x | 1.05x | ~5% increase |
The storage overhead comes from cluster metadata, but the dramatic reduction in key operations more than compensates.
7.4 Posting List Encoding
Efficient encoding minimizes storage and I/O while enabling fast decoding.
7.4.1 Variable-Length Integer Encoding
Small integers are common in posting lists. Variable-length encoding uses fewer bytes for smaller values:
| Value Range | Bytes |
|---|---|
| 0 - 127 | 1 |
| 128 - 16,383 | 2 |
| 16,384 - 2,097,151 | 3 |
Encoding Algorithm:
encode_varint(n):
while n >= 128:
emit(n & 0x7F | 0x80) // Low 7 bits + continuation flag
n >>= 7
emit(n) // Final byte (no continuation)
7.4.2 Gap Encoding for Positions
Positions are strictly increasing, so deltas (gaps) are always positive and typically small:
Original positions: [0, 5, 12, 18, 25, 33]
Gap-encoded: [0, 5, 7, 6, 7, 8]
Compression Analysis:
| Encoding | Bytes for example |
|---|---|
| Raw int32 | 24 bytes |
| Gap + varint | 6 bytes |
| Compression | 4x |
7.4.3 Offset Encoding
Offsets encode as (begin_delta, length) pairs:
void encode_offsets(const Offsets& offsets) {
encode_varint(offsets.size());
uint32_t prev_end = 0;
for (const auto& offset : offsets) {
encode_varint(offset.begin - prev_end); // Gap from previous end
encode_varint(offset.end - offset.begin); // Length
encode_varint(offset.size); // Redundant but useful
prev_end = offset.end;
}
}
This exploits the fact that tokens are typically adjacent or near-adjacent in text.
7.4.4 Complete Entry Encoding
A full posting entry encodes as:
Typical Compression:
For a posting entry with 3 positions and 3 offsets:
| Component | Raw Size | Encoded Size |
|---|---|---|
| Metadata | 16 bytes | 6 bytes |
| Positions | 12 bytes | 4 bytes |
| Offsets | 36 bytes | 10 bytes |
| Total | 64 bytes | 20 bytes |
Overall compression: 3.2x
7.5 Text Analysis Pipeline
The analysis pipeline transforms raw text into indexable terms through a series of transformations.
7.5.1 Pipeline Architecture
7.5.2 Character Filters
Character filters preprocess text before tokenization:
HTML Stripping:
Unicode Normalization (NFC/NFD):
Pattern Replacement:
7.5.3 Tokenizers
Tokenizers split text into individual tokens:
Standard Tokenizer (Unicode word boundaries):
"The quick-brown fox" -> ["The", "quick", "brown", "fox"]
Whitespace Tokenizer (simple splitting):
"The quick-brown fox" -> ["The", "quick-brown", "fox"]
N-gram Tokenizer (character n-grams):
"fox" with n=2 -> ["fo", "ox"]
Path Hierarchy Tokenizer (file paths):
"/usr/local/bin" -> ["/usr", "/usr/local", "/usr/local/bin"]
ICU Tokenizer (internationalization):
Handles CJK languages, Thai, and other scripts without whitespace word boundaries.
7.5.4 Token Structure
Each token carries metadata:
struct Token {
TokenType type; // String, numeric, date, vector
std::string value; // Processed token text
std::string original; // Pre-normalization text
int32_t position; // Position in field
Offset offset_chars; // Character boundaries
Offset offset_bytes; // Byte boundaries (UTF-8)
};
The original value enables highlighting even after normalization transforms the text.
7.5.5 Token Filters
Token filters transform individual tokens:
Lowercase Filter:
ASCII Folding (accent removal):
Stop Word Filter (remove common words):
Snowball Stemmer (reduce to root form):
Edge N-gram Filter (prefix indexing):
Double Metaphone (phonetic matching):
7.5.6 Analyzer Composition
Analyzers combine components into reusable pipelines:
Standard Analyzer:
analyzer:
type: standard
char_filters: []
tokenizer: standard
token_filters:
- lowercase
- stop_words
Custom Analyzer:
analyzer:
type: custom
char_filters:
- html_strip
tokenizer: standard
token_filters:
- lowercase
- ascii_folding
- snowball:
language: english
7.5.7 Index vs Query Analysis
Different analyzers can be used at index and query time:
Index Time: Full normalization
Query Time: Match index normalization
The same stemmer ensures "runs" matches documents containing "running".
7.6 Document ID Management
The DocID index maintains bidirectional mappings between external document references and internal numeric identifiers.
7.6.1 DocID Structure
using DocID = uint64_t;
constexpr DocID kInvalidDocID = std::numeric_limits<DocID>::max();
DocIDs are 64-bit unsigned integers, supporting up to documents per index.
7.6.2 DocID Index Operations
class DocIDIndex {
public:
// Forward mapping: pointer -> DocID
auto get_doc_id(const std::string& pointer) -> std::optional<DocID>;
// Reverse mapping: DocID -> pointer
auto get_pointer(DocID doc_id) -> std::optional<std::string>;
// Allocation
auto allocate(const std::string& pointer) -> DocID;
// Deletion
void remove(DocID doc_id);
void remove(const std::string& pointer);
};
7.6.3 Cluster Assignment
DocIDs are assigned sequentially within each collection, ensuring documents indexed together share clusters:
This locality maximizes cluster efficiency - a batch insert of 100,000 documents uses only:
7.6.4 DocID Reuse
Deleted DocIDs can be reused to prevent unbounded growth:
class DocIDGenerator {
std::queue<DocID> free_list_;
DocID next_id_;
public:
DocID allocate() {
if (!free_list_.empty()) {
auto id = free_list_.front();
free_list_.pop();
return id;
}
return next_id_++;
}
void release(DocID id) {
free_list_.push(id);
}
};
7.7 Index Statistics
Statistics enable cost-based query optimization and relevance scoring.
7.7.1 Collection Statistics
enum class IndexStatsType {
kDocCount, // Total documents in collection
kDocSize, // Total bytes of documents
kFieldCount, // Documents containing field
kFieldSize, // Bytes in field across all docs
kFieldTermCount, // Posting entries for field
kFieldTermSize, // Bytes of postings for field
kTermCount, // Unique terms
kTermSize, // Bytes of term dictionary
kTokenCount, // Total token occurrences
kTokenSize, // Bytes of tokens
};
7.7.2 Term Statistics
struct TermStatistics {
int64_t doc_freq; // Documents containing term
int64_t term_freq; // Total occurrences across collection
};
Document Frequency (df): Used for IDF calculation in TF-IDF and BM25.
Term Frequency (tf): Used for collection-wide statistics.
7.7.3 Statistics Storage
Statistics are stored in a separate namespace with atomic counters:
class ShardedInt64CounterMap {
static constexpr size_t kShardCount = 64;
std::array<std::unordered_map<Key, std::atomic<int64_t>>, kShardCount> shards_;
public:
void increment(const Key& key, int64_t delta) {
auto& shard = shards_[hash(key) % kShardCount];
shard[key].fetch_add(delta, std::memory_order_relaxed);
}
int64_t get(const Key& key) const {
auto& shard = shards_[hash(key) % kShardCount];
auto it = shard.find(key);
return it != shard.end() ? it->second.load() : 0;
}
};
Sharding prevents contention during concurrent indexing.
7.7.4 Average Field Length
BM25 requires average field length for normalization:
Computed incrementally:
7.8 Fast Update Architecture
The pending list system batches index updates to reduce write amplification.
7.8.1 Write Amplification Problem
Without batching, each term generates a separate RocksDB merge operation:
For a document with 500 unique terms: 500 merge operations.
7.8.2 Pending List Solution
The pending list accumulates updates in memory before flushing:
7.8.3 Pending Entry Format
struct PendingEntry {
enum Operation { kAdd, kRemove };
Operation op;
std::string field;
std::string term_value;
ClusteredPostingEntry entry;
};
Entries are grouped by (field, term_value) before flushing.
7.8.4 Flush Trigger Conditions
struct FTSPendingConfig {
bool fast_update = true;
size_t batch_size_threshold = 64 * 1024; // 64 KB
size_t pending_list_limit = 16 * 1024 * 1024; // 16 MB
size_t pending_entry_limit = 500'000;
size_t cleanup_batch_size = 100'000;
};
Flush occurs when:
- Batch size exceeds 64 KB
- Total pending data exceeds 16 MB
- Entry count exceeds 500,000
- Explicit flush requested
7.8.5 Background Cleanup
A background process merges pending entries into the main index:
cleanup():
1. Scan pending batches in order
2. Decompress and decode entries
3. Group by (field, term_value)
4. For each group:
- Merge add/remove operations
- Issue single merge to clustered index
5. Delete processed pending keys
6. Update statistics
7.8.6 Write Amplification Reduction
| Scenario | Without Batching | With Batching |
|---|---|---|
| 1000 docs, 500 terms each | 500,000 merges | ~1,000 merges |
| Reduction | 1x | 500x |
The pending list dramatically reduces write amplification for bulk indexing.
7.9 Posting List Iteration
Efficient iteration over posting lists is critical for query performance.
7.9.1 Iterator Interface
class DocIDSetIterator {
public:
// Current position
virtual DocID doc_id() const = 0;
// Validity check
virtual bool is_valid() const = 0;
virtual bool has_next() const = 0;
// Movement
virtual DocID next() = 0;
virtual DocID advance(DocID target) = 0;
// Cost estimation
virtual int64_t cost() const = 0;
};
7.9.2 Advance Operation
The advance(target) operation skips to the first DocID >= target:
This enables efficient intersection without scanning all entries:
intersect(P1, P2):
result = []
while P1.valid() and P2.valid():
if P1.doc_id() == P2.doc_id():
result.append(P1.doc_id())
P1.next()
P2.next()
elif P1.doc_id() < P2.doc_id():
P1.advance(P2.doc_id())
else:
P2.advance(P1.doc_id())
return result
7.9.3 Specialized Iterators
Conjunction DISI (AND queries):
Maintains multiple iterators, advancing the minimum:
class ConjunctionDISI : public DocIDSetIterator {
std::vector<std::unique_ptr<DocIDSetIterator>> iterators_;
DocID advance(DocID target) override {
while (true) {
// Advance all iterators to target
for (auto& it : iterators_) {
target = it->advance(target);
}
// Check if all converged
if (all_equal(iterators_)) {
return target;
}
// Set new target to maximum
target = max_doc_id(iterators_);
}
}
};
Phrase Match DISI:
Extends conjunction with position checking:
class PhraseMatchDISI : public DocIDSetIterator {
bool check_phrase_positions() {
// Load positions for all terms at current doc
// Check if positions are consecutive
for (size_t i = 1; i < positions_.size(); i++) {
if (!has_adjacent_position(positions_[i-1], positions_[i], i)) {
return false;
}
}
return true;
}
};
7.9.4 Lazy Position Loading
Positions are loaded only when needed:
class LazyPostingEntry {
bool positions_loaded_ = false;
Positions positions_;
const Positions& get_positions() {
if (!positions_loaded_) {
positions_ = decode_positions(raw_data_);
positions_loaded_ = true;
}
return positions_;
}
};
This optimization avoids decoding positions for documents that fail earlier filters.
7.10 Merge Operator
The clustered term index uses RocksDB merge operators for efficient concurrent updates.
7.10.1 Merge vs Put
Put semantics: Overwrites previous value
Merge semantics: Combines with previous value
7.10.2 Clustered Term Index Merge
For posting lists, merge combines entries from multiple documents:
With deduplication for the same DocID:
7.10.3 Merge Implementation
class ClusteredTermIndexMergeOperator : public MergeOperator {
bool FullMerge(
const Slice& key,
const Slice* existing_value,
const std::deque<std::string>& operands,
std::string* new_value
) override {
// Start with existing entries
auto entries = existing_value
? decode_cluster(*existing_value)
: ClusterEntries{};
// Apply each operand
for (const auto& operand : operands) {
auto delta = decode_delta(operand);
apply_delta(entries, delta);
}
// Encode result
*new_value = encode_cluster(entries);
return true;
}
};
7.10.4 Partial Merge Optimization
Partial merges combine operands without reading existing value:
bool PartialMerge(
const Slice& key,
const Slice& left_operand,
const Slice& right_operand,
std::string* new_value
) override {
// Combine two deltas without reading base value
auto left = decode_delta(left_operand);
auto right = decode_delta(right_operand);
*new_value = encode_delta(merge_deltas(left, right));
return true;
}
This reduces read amplification during compaction.
7.11 Integration with Document Store
The FTS index integrates with the document store through a unified context.
7.11.1 FTS Context
class FTSContext {
std::unique_ptr<DocIDIndex> doc_id_index_;
std::unique_ptr<TermIndex> term_index_;
std::unique_ptr<ClusteredTermIndex> clustered_term_index_;
std::unique_ptr<FacetDocValuesIndex> facet_index_;
std::unique_ptr<NumericDocValuesIndex> numeric_index_;
std::unique_ptr<QueryCache> query_cache_;
AnalyzerMap analyzer_map_; // Per-field analyzers
SimilarityMap similarity_map_; // Scoring models
FieldOptionsMap field_options_; // Index configuration
};
7.11.2 Document Indexing Flow
7.11.3 Field Options
Each field can have custom indexing options:
struct FieldOptions {
std::string analyzer; // Analyzer name
bool index_positions = true; // Store positions
bool index_offsets = true; // Store offsets
bool store_term_vectors = false;// Store per-doc term stats
float boost = 1.0f; // Field-level boost
};
7.11.4 Multi-Field Documents
Documents can have multiple searchable fields:
{
"_id": "doc1",
"title": "Database Systems",
"body": "An introduction to database internals...",
"tags": ["database", "systems"]
}
Each field is indexed separately with its own analyzer:
| Field | Analyzer | Indexed Terms |
|---|---|---|
| title | standard | ["database", "systems"] |
| body | english | ["introduct", "databas", "intern", ...] |
| tags | keyword | ["database", "systems"] |
7.12 Summary
This chapter explored the inverted index architecture that powers Cognica's full-text search capabilities. Key takeaways:
-
Inverted indexes transform the sparse term-document matrix into an efficient structure where each term maps to a posting list of containing documents.
-
Posting lists store rich metadata including term frequencies, positions, and offsets, enabling both boolean matching and relevance ranking.
-
The clustered term index achieves 62,500x key reduction by grouping documents into clusters of 65,536, storing all postings for a cluster in a single key-value pair.
-
Gap encoding and variable-length integers compress posting lists by 3-4x while maintaining fast decoding.
-
The text analysis pipeline transforms raw text through character filters, tokenizers, and token filters into normalized, searchable terms.
-
The pending list architecture batches index updates to reduce write amplification by up to 500x during bulk indexing.
-
Merge operators enable efficient concurrent updates without read-before-write overhead.
-
Iterator abstraction with advance operations enables efficient query execution through posting list intersection.
The inverted index provides the foundation for text search; the next part of this book explores query processing, where we'll see how SQL queries are parsed, optimized, and executed against this index structure.