Chapter 7: Inverted Index Architecture

This chapter explores the inverted index, the fundamental data structure enabling full-text search. We examine how Cognica implements posting lists, the revolutionary clustered term index optimization that reduces key counts by 62,500x, and the text analysis pipeline that transforms documents into searchable terms.

7.1 Information Retrieval Fundamentals

Full-text search differs fundamentally from structured queries. Rather than matching exact values, text search must handle:

  • Vocabulary mismatch: Users search "car" but documents contain "automobile"
  • Ranking: Multiple documents match; which is most relevant?
  • Linguistic variation: "running", "runs", "ran" should match "run"
  • Scale: Billions of documents, millions of unique terms

7.1.1 The Term-Document Matrix

The conceptual foundation of text retrieval is the term-document matrix:

Mt×d=[m1,1m1,2m1,dm2,1m2,2m2,dmt,1mt,2mt,d]M_{t \times d} = \begin{bmatrix} m_{1,1} & m_{1,2} & \cdots & m_{1,d} \\ m_{2,1} & m_{2,2} & \cdots & m_{2,d} \\ \vdots & \vdots & \ddots & \vdots \\ m_{t,1} & m_{t,2} & \cdots & m_{t,d} \end{bmatrix}

where mi,j=1m_{i,j} = 1 if term ii appears in document jj, else 00.

Problem: For t=1,000,000t = 1,000,000 terms and d=10,000,000d = 10,000,000 documents, this matrix has 101310^{13} entries - far too large to store.

Observation: The matrix is extremely sparse. A typical document contains 100-1000 unique terms out of millions possible. Sparsity is often 99.99%+.

7.1.2 Inverted Index as Sparse Representation

The inverted index stores only non-zero entries, organized by term:

InvertedIndex={ti{djmi,j=1}}\text{InvertedIndex} = \{t_i \mapsto \{d_j \mid m_{i,j} = 1\}\}

Each term maps to a posting list - the set of documents containing that term.

Example:

TermPosting List
"database"{1, 5, 12, 47, 103, ...}
"query"{1, 12, 89, 156, ...}
"optimization"{5, 47, 89, 201, ...}

Query Processing:

To find documents containing "database AND query":

Result=Postings("database")Postings("query")={1,12,...}\text{Result} = \text{Postings}(\text{"database"}) \cap \text{Postings}(\text{"query"}) = \{1, 12, ...\}

Set intersection replaces matrix multiplication, achieving dramatic speedup.

7.1.3 Posting List Complexity

OperationFull MatrixInverted Index
StorageO(t×d)O(t \times d)O(iPi)O(\sum_i \lvert P_i \rvert)
Term lookupO(d)O(d)O(1)O(1) dictionary + O(P)O(\lvert P \rvert)
AND queryO(t×d)O(t \times d)O(min(P1,P2))O(\min(\lvert P_1 \rvert, \lvert P_2 \rvert))
OR queryO(t×d)O(t \times d)O(P1+P2)O(\lvert P_1 \rvert + \lvert P_2 \rvert)

The inverted index transforms text search from infeasible to practical.

7.2 Posting List Structure

Cognica's posting lists store rich metadata beyond simple document identifiers, enabling advanced ranking and phrase queries.

7.2.1 Posting Entry Components

Each posting entry contains:

struct PostingEntry {
  DocID doc_id;           // Document identifier
  std::string pointer;    // External document reference
  int32_t term_freq;      // Occurrences in document
  int32_t term_count;     // Total terms in field
  float field_norm;       // Length normalization factor
  Positions positions;    // Term positions for phrases
  Offsets offsets;        // Byte offsets for highlighting
};

Field Descriptions:

FieldPurposeUsed For
doc_idInternal identifierIndex lookups
pointerExternal key (e.g., primary key)Result retrieval
term_freqCount in this documentTF-IDF, BM25 scoring
term_countTotal tokens in fieldLength normalization
field_normPre-computed 1/length1/\sqrt{length}BM25 bb parameter
positionsToken positions [0, 5, 12, ...]Phrase queries
offsetsCharacter/byte rangesSnippet highlighting

7.2.2 Positions for Phrase Queries

Positions enable phrase matching by recording where each term appears:

Document: "The quick brown fox jumps over the lazy dog"

TermPositions
"the"[0, 6]
"quick"[1]
"brown"[2]
"fox"[3]
"jumps"[4]
"over"[5]
"lazy"[7]
"dog"[8]

Phrase Query: "quick brown fox"

Check if positions are consecutive:

  • "quick" at position 1
  • "brown" at position 2 (= 1 + 1)
  • "fox" at position 3 (= 2 + 1)

Match confirmed - all positions are adjacent.

7.2.3 Offsets for Highlighting

Offsets enable precise text highlighting without re-tokenizing:

struct Offset {
  uint32_t begin;   // Start position (byte or char)
  uint32_t end;     // End position
  uint32_t size;    // Length (for validation)
};

Example for "quick" in "The quick brown fox":

TypeBeginEnd
Character49
Byte (UTF-8)49

For multi-byte characters, byte and character offsets differ, requiring both for correct highlighting in UTF-8 text.

7.2.4 Aggregate Posting List

The complete posting list aggregates all entries for a term:

struct Postings {
  std::vector<DocID> doc_ids;
  std::vector<std::string> pointers;
  std::vector<int32_t> term_freqs;
  std::vector<int32_t> term_counts;
  std::vector<float> field_norms;
  std::vector<Positions> positions;
  std::vector<Offsets> offsets_bytes;
};

This structure-of-arrays layout optimizes cache utilization when scanning only specific fields (e.g., doc_ids for boolean queries, term_freqs for scoring).

7.3 The Clustered Term Index

Traditional inverted indexes suffer from key explosion in LSM-tree storage. Cognica's clustered term index addresses this through document clustering, achieving 62,500x key reduction.

7.3.1 The Key Explosion Problem

In a traditional inverted index stored in an LSM-tree:

Key=termdoc_id\text{Key} = \text{term} \| \text{doc\_id}

For a corpus with:

  • 10 million documents
  • 1000 unique terms per document
  • Average 100 matching documents per query term

A single query might require:

Keys Accessed=query terms×docs per term=5×100,000=500,000\text{Keys Accessed} = \text{query terms} \times \text{docs per term} = 5 \times 100,000 = 500,000

Each key requires a separate RocksDB lookup, overwhelming the storage layer.

7.3.2 Clustering Solution

The clustered term index groups documents into clusters of 65,536 (2162^{16}):

cluster_id(doc_id)=doc_id/65536\text{cluster\_id}(doc\_id) = \lfloor doc\_id / 65536 \rfloor offset(doc_id)=doc_idmod65536\text{offset}(doc\_id) = doc\_id \mod 65536

All posting entries within a cluster are stored in a single key-value pair:

Key=termcluster_id\text{Key} = \text{term} \| \text{cluster\_id} Value={(offset1,entry1),(offset2,entry2),...}\text{Value} = \{(\text{offset}_1, \text{entry}_1), (\text{offset}_2, \text{entry}_2), ...\}

7.3.3 Key Reduction Analysis

Traditional Index:

Keys=ttermsPostings(t)\text{Keys} = \sum_{t \in \text{terms}} |\text{Postings}(t)|

For 100,000 postings per term across 5 query terms: 500,000 keys.

Clustered Index:

Keys=ttermsPostings(t)/65536\text{Keys} = \sum_{t \in \text{terms}} \lceil |\text{Postings}(t)| / 65536 \rceil

For 100,000 postings spread across clusters: ~8 keys per term, 40 keys total.

Reduction Factor:

Reduction=500,00040=12,500x\text{Reduction} = \frac{500,000}{40} = 12,500\text{x}

In practice, with locality (documents indexed together have similar doc_ids), reduction reaches 62,500x.

7.3.4 Clustered Key Format

The key encoding uses a three-layer structure:

Loading diagram...

This hierarchical structure enables:

  • Prefix iteration: Scan all clusters for a term
  • Direct lookup: Jump to specific cluster
  • Range scans: Efficient cluster range queries

7.3.5 Clustered Value Encoding

Each cluster value contains multiple posting entries:

Loading diagram...

Gap Encoding for Offsets:

Offsets within a cluster are stored as deltas from previous:

encodedi=offsetioffseti1\text{encoded}_i = \text{offset}_i - \text{offset}_{i-1}

Since offsets are sorted, deltas are small positive integers that compress well with variable-length encoding.

7.3.6 Performance Comparison

MetricTraditionalClusteredImprovement
Keys per query500,000862,500x
Read latency45 ms13 ms3.5x
Write latency120 ms35 ms3.4x
Storage overhead1.0x1.05x~5% increase

The storage overhead comes from cluster metadata, but the dramatic reduction in key operations more than compensates.

7.4 Posting List Encoding

Efficient encoding minimizes storage and I/O while enabling fast decoding.

7.4.1 Variable-Length Integer Encoding

Small integers are common in posting lists. Variable-length encoding uses fewer bytes for smaller values:

bytes(n)=log128(n+1)\text{bytes}(n) = \lceil \log_{128}(n + 1) \rceil
Value RangeBytes
0 - 1271
128 - 16,3832
16,384 - 2,097,1513

Encoding Algorithm:

encode_varint(n):
  while n >= 128:
    emit(n & 0x7F | 0x80)  // Low 7 bits + continuation flag
    n >>= 7
  emit(n)                   // Final byte (no continuation)

7.4.2 Gap Encoding for Positions

Positions are strictly increasing, so deltas (gaps) are always positive and typically small:

Original positions: [0, 5, 12, 18, 25, 33]

Gap-encoded: [0, 5, 7, 6, 7, 8]

gapi=posiposi1\text{gap}_i = \text{pos}_i - \text{pos}_{i-1}

Compression Analysis:

EncodingBytes for example
Raw int3224 bytes
Gap + varint6 bytes
Compression4x

7.4.3 Offset Encoding

Offsets encode as (begin_delta, length) pairs:

void encode_offsets(const Offsets& offsets) {
  encode_varint(offsets.size());
  uint32_t prev_end = 0;

  for (const auto& offset : offsets) {
    encode_varint(offset.begin - prev_end);  // Gap from previous end
    encode_varint(offset.end - offset.begin); // Length
    encode_varint(offset.size);               // Redundant but useful
    prev_end = offset.end;
  }
}

This exploits the fact that tokens are typically adjacent or near-adjacent in text.

7.4.4 Complete Entry Encoding

A full posting entry encodes as:

Loading diagram...

Typical Compression:

For a posting entry with 3 positions and 3 offsets:

ComponentRaw SizeEncoded Size
Metadata16 bytes6 bytes
Positions12 bytes4 bytes
Offsets36 bytes10 bytes
Total64 bytes20 bytes

Overall compression: 3.2x

7.5 Text Analysis Pipeline

The analysis pipeline transforms raw text into indexable terms through a series of transformations.

7.5.1 Pipeline Architecture

Loading diagram...

7.5.2 Character Filters

Character filters preprocess text before tokenization:

HTML Stripping:

"<p>Hello</p>""Hello"\text{"<p>Hello</p>"} \rightarrow \text{"Hello"}

Unicode Normalization (NFC/NFD):

"cafe0˘301""cafe"(combining accent removed)\text{"cafe\u0301"} \rightarrow \text{"cafe"} \quad \text{(combining accent removed)}

Pattern Replacement:

"user@example.com""user AT example DOT com"\text{"user@example.com"} \rightarrow \text{"user AT example DOT com"}

7.5.3 Tokenizers

Tokenizers split text into individual tokens:

Standard Tokenizer (Unicode word boundaries):

"The quick-brown fox" -> ["The", "quick", "brown", "fox"]

Whitespace Tokenizer (simple splitting):

"The quick-brown fox" -> ["The", "quick-brown", "fox"]

N-gram Tokenizer (character n-grams):

"fox" with n=2 -> ["fo", "ox"]

Path Hierarchy Tokenizer (file paths):

"/usr/local/bin" -> ["/usr", "/usr/local", "/usr/local/bin"]

ICU Tokenizer (internationalization):

Handles CJK languages, Thai, and other scripts without whitespace word boundaries.

7.5.4 Token Structure

Each token carries metadata:

struct Token {
  TokenType type;           // String, numeric, date, vector
  std::string value;        // Processed token text
  std::string original;     // Pre-normalization text
  int32_t position;         // Position in field
  Offset offset_chars;      // Character boundaries
  Offset offset_bytes;      // Byte boundaries (UTF-8)
};

The original value enables highlighting even after normalization transforms the text.

7.5.5 Token Filters

Token filters transform individual tokens:

Lowercase Filter:

"Database""database"\text{"Database"} \rightarrow \text{"database"}

ASCII Folding (accent removal):

"cafe""cafe"\text{"cafe"} \rightarrow \text{"cafe"}

Stop Word Filter (remove common words):

["the", "quick", "fox"]["quick", "fox"]\text{["the", "quick", "fox"]} \rightarrow \text{["quick", "fox"]}

Snowball Stemmer (reduce to root form):

"running""run"\text{"running"} \rightarrow \text{"run"} "databases""databas"\text{"databases"} \rightarrow \text{"databas"}

Edge N-gram Filter (prefix indexing):

"database"["d", "da", "dat", "data", ...]\text{"database"} \rightarrow \text{["d", "da", "dat", "data", ...]}

Double Metaphone (phonetic matching):

"Smith"["SM0", "XMT"]\text{"Smith"} \rightarrow \text{["SM0", "XMT"]}

7.5.6 Analyzer Composition

Analyzers combine components into reusable pipelines:

Standard Analyzer:

analyzer:
  type: standard
  char_filters: []
  tokenizer: standard
  token_filters:
    - lowercase
    - stop_words

Custom Analyzer:

analyzer:
  type: custom
  char_filters:
    - html_strip
  tokenizer: standard
  token_filters:
    - lowercase
    - ascii_folding
    - snowball:
        language: english

7.5.7 Index vs Query Analysis

Different analyzers can be used at index and query time:

Index Time: Full normalization

"Running"index"run"\text{"Running"} \xrightarrow{\text{index}} \text{"run"}

Query Time: Match index normalization

"runs"query"run"\text{"runs"} \xrightarrow{\text{query}} \text{"run"}

The same stemmer ensures "runs" matches documents containing "running".

7.6 Document ID Management

The DocID index maintains bidirectional mappings between external document references and internal numeric identifiers.

7.6.1 DocID Structure

using DocID = uint64_t;

constexpr DocID kInvalidDocID = std::numeric_limits<DocID>::max();

DocIDs are 64-bit unsigned integers, supporting up to 2642^{64} documents per index.

7.6.2 DocID Index Operations

class DocIDIndex {
public:
  // Forward mapping: pointer -> DocID
  auto get_doc_id(const std::string& pointer) -> std::optional<DocID>;

  // Reverse mapping: DocID -> pointer
  auto get_pointer(DocID doc_id) -> std::optional<std::string>;

  // Allocation
  auto allocate(const std::string& pointer) -> DocID;

  // Deletion
  void remove(DocID doc_id);
  void remove(const std::string& pointer);
};

7.6.3 Cluster Assignment

DocIDs are assigned sequentially within each collection, ensuring documents indexed together share clusters:

DocIDn=DocIDn1+1\text{DocID}_n = \text{DocID}_{n-1} + 1

This locality maximizes cluster efficiency - a batch insert of 100,000 documents uses only:

Clusters=100,000/65,536=2 clusters\text{Clusters} = \lceil 100,000 / 65,536 \rceil = 2 \text{ clusters}

7.6.4 DocID Reuse

Deleted DocIDs can be reused to prevent unbounded growth:

class DocIDGenerator {
  std::queue<DocID> free_list_;
  DocID next_id_;

public:
  DocID allocate() {
    if (!free_list_.empty()) {
      auto id = free_list_.front();
      free_list_.pop();
      return id;
    }
    return next_id_++;
  }

  void release(DocID id) {
    free_list_.push(id);
  }
};

7.7 Index Statistics

Statistics enable cost-based query optimization and relevance scoring.

7.7.1 Collection Statistics

enum class IndexStatsType {
  kDocCount,       // Total documents in collection
  kDocSize,        // Total bytes of documents
  kFieldCount,     // Documents containing field
  kFieldSize,      // Bytes in field across all docs
  kFieldTermCount, // Posting entries for field
  kFieldTermSize,  // Bytes of postings for field
  kTermCount,      // Unique terms
  kTermSize,       // Bytes of term dictionary
  kTokenCount,     // Total token occurrences
  kTokenSize,      // Bytes of tokens
};

7.7.2 Term Statistics

struct TermStatistics {
  int64_t doc_freq;   // Documents containing term
  int64_t term_freq;  // Total occurrences across collection
};

Document Frequency (df): Used for IDF calculation in TF-IDF and BM25.

IDF(t)=log(Ndf(t)+0.5df(t)+0.5+1)\text{IDF}(t) = \log\left(\frac{N - df(t) + 0.5}{df(t) + 0.5} + 1\right)

Term Frequency (tf): Used for collection-wide statistics.

7.7.3 Statistics Storage

Statistics are stored in a separate namespace with atomic counters:

class ShardedInt64CounterMap {
  static constexpr size_t kShardCount = 64;
  std::array<std::unordered_map<Key, std::atomic<int64_t>>, kShardCount> shards_;

public:
  void increment(const Key& key, int64_t delta) {
    auto& shard = shards_[hash(key) % kShardCount];
    shard[key].fetch_add(delta, std::memory_order_relaxed);
  }

  int64_t get(const Key& key) const {
    auto& shard = shards_[hash(key) % kShardCount];
    auto it = shard.find(key);
    return it != shard.end() ? it->second.load() : 0;
  }
};

Sharding prevents contention during concurrent indexing.

7.7.4 Average Field Length

BM25 requires average field length for normalization:

avgdl=total tokens in fielddocuments with field\text{avgdl} = \frac{\text{total tokens in field}}{\text{documents with field}}

Computed incrementally:

avgdlnew=avgdlold×n+new doc lengthn+1\text{avgdl}_{new} = \frac{\text{avgdl}_{old} \times n + \text{new doc length}}{n + 1}

7.8 Fast Update Architecture

The pending list system batches index updates to reduce write amplification.

7.8.1 Write Amplification Problem

Without batching, each term generates a separate RocksDB merge operation:

Merges per document=unique terms in document\text{Merges per document} = \text{unique terms in document}

For a document with 500 unique terms: 500 merge operations.

7.8.2 Pending List Solution

The pending list accumulates updates in memory before flushing:

Loading diagram...

7.8.3 Pending Entry Format

struct PendingEntry {
  enum Operation { kAdd, kRemove };

  Operation op;
  std::string field;
  std::string term_value;
  ClusteredPostingEntry entry;
};

Entries are grouped by (field, term_value) before flushing.

7.8.4 Flush Trigger Conditions

struct FTSPendingConfig {
  bool fast_update = true;
  size_t batch_size_threshold = 64 * 1024;      // 64 KB
  size_t pending_list_limit = 16 * 1024 * 1024; // 16 MB
  size_t pending_entry_limit = 500'000;
  size_t cleanup_batch_size = 100'000;
};

Flush occurs when:

  • Batch size exceeds 64 KB
  • Total pending data exceeds 16 MB
  • Entry count exceeds 500,000
  • Explicit flush requested

7.8.5 Background Cleanup

A background process merges pending entries into the main index:

cleanup():
  1. Scan pending batches in order
  2. Decompress and decode entries
  3. Group by (field, term_value)
  4. For each group:
     - Merge add/remove operations
     - Issue single merge to clustered index
  5. Delete processed pending keys
  6. Update statistics

7.8.6 Write Amplification Reduction

ScenarioWithout BatchingWith Batching
1000 docs, 500 terms each500,000 merges~1,000 merges
Reduction1x500x

The pending list dramatically reduces write amplification for bulk indexing.

7.9 Posting List Iteration

Efficient iteration over posting lists is critical for query performance.

7.9.1 Iterator Interface

class DocIDSetIterator {
public:
  // Current position
  virtual DocID doc_id() const = 0;

  // Validity check
  virtual bool is_valid() const = 0;
  virtual bool has_next() const = 0;

  // Movement
  virtual DocID next() = 0;
  virtual DocID advance(DocID target) = 0;

  // Cost estimation
  virtual int64_t cost() const = 0;
};

7.9.2 Advance Operation

The advance(target) operation skips to the first DocID >= target:

advance(P,target)=min{dPdtarget}\text{advance}(P, target) = \min\{d \in P \mid d \geq target\}

This enables efficient intersection without scanning all entries:

intersect(P1, P2):
  result = []
  while P1.valid() and P2.valid():
    if P1.doc_id() == P2.doc_id():
      result.append(P1.doc_id())
      P1.next()
      P2.next()
    elif P1.doc_id() < P2.doc_id():
      P1.advance(P2.doc_id())
    else:
      P2.advance(P1.doc_id())
  return result

7.9.3 Specialized Iterators

Conjunction DISI (AND queries):

Maintains multiple iterators, advancing the minimum:

class ConjunctionDISI : public DocIDSetIterator {
  std::vector<std::unique_ptr<DocIDSetIterator>> iterators_;

  DocID advance(DocID target) override {
    while (true) {
      // Advance all iterators to target
      for (auto& it : iterators_) {
        target = it->advance(target);
      }
      // Check if all converged
      if (all_equal(iterators_)) {
        return target;
      }
      // Set new target to maximum
      target = max_doc_id(iterators_);
    }
  }
};

Phrase Match DISI:

Extends conjunction with position checking:

class PhraseMatchDISI : public DocIDSetIterator {
  bool check_phrase_positions() {
    // Load positions for all terms at current doc
    // Check if positions are consecutive
    for (size_t i = 1; i < positions_.size(); i++) {
      if (!has_adjacent_position(positions_[i-1], positions_[i], i)) {
        return false;
      }
    }
    return true;
  }
};

7.9.4 Lazy Position Loading

Positions are loaded only when needed:

class LazyPostingEntry {
  bool positions_loaded_ = false;
  Positions positions_;

  const Positions& get_positions() {
    if (!positions_loaded_) {
      positions_ = decode_positions(raw_data_);
      positions_loaded_ = true;
    }
    return positions_;
  }
};

This optimization avoids decoding positions for documents that fail earlier filters.

7.10 Merge Operator

The clustered term index uses RocksDB merge operators for efficient concurrent updates.

7.10.1 Merge vs Put

Put semantics: Overwrites previous value

Put(k,v2) after Put(k,v1)Read(k)=v2\text{Put}(k, v_2) \text{ after } \text{Put}(k, v_1) \Rightarrow \text{Read}(k) = v_2

Merge semantics: Combines with previous value

Merge(k,Δ2) after Merge(k,Δ1)Read(k)=f(v0,Δ1,Δ2)\text{Merge}(k, \Delta_2) \text{ after } \text{Merge}(k, \Delta_1) \Rightarrow \text{Read}(k) = f(v_0, \Delta_1, \Delta_2)

7.10.2 Clustered Term Index Merge

For posting lists, merge combines entries from multiple documents:

Merge(P1,P2)=P1P2\text{Merge}(P_1, P_2) = P_1 \cup P_2

With deduplication for the same DocID:

Merge(e1,e2)={e2if e1.doc_id=e2.doc_id{e1,e2}otherwise\text{Merge}(e_1, e_2) = \begin{cases} e_2 & \text{if } e_1.\text{doc\_id} = e_2.\text{doc\_id} \\ \{e_1, e_2\} & \text{otherwise} \end{cases}

7.10.3 Merge Implementation

class ClusteredTermIndexMergeOperator : public MergeOperator {
  bool FullMerge(
    const Slice& key,
    const Slice* existing_value,
    const std::deque<std::string>& operands,
    std::string* new_value
  ) override {
    // Start with existing entries
    auto entries = existing_value
      ? decode_cluster(*existing_value)
      : ClusterEntries{};

    // Apply each operand
    for (const auto& operand : operands) {
      auto delta = decode_delta(operand);
      apply_delta(entries, delta);
    }

    // Encode result
    *new_value = encode_cluster(entries);
    return true;
  }
};

7.10.4 Partial Merge Optimization

Partial merges combine operands without reading existing value:

bool PartialMerge(
  const Slice& key,
  const Slice& left_operand,
  const Slice& right_operand,
  std::string* new_value
) override {
  // Combine two deltas without reading base value
  auto left = decode_delta(left_operand);
  auto right = decode_delta(right_operand);

  *new_value = encode_delta(merge_deltas(left, right));
  return true;
}

This reduces read amplification during compaction.

7.11 Integration with Document Store

The FTS index integrates with the document store through a unified context.

7.11.1 FTS Context

class FTSContext {
  std::unique_ptr<DocIDIndex> doc_id_index_;
  std::unique_ptr<TermIndex> term_index_;
  std::unique_ptr<ClusteredTermIndex> clustered_term_index_;
  std::unique_ptr<FacetDocValuesIndex> facet_index_;
  std::unique_ptr<NumericDocValuesIndex> numeric_index_;
  std::unique_ptr<QueryCache> query_cache_;

  AnalyzerMap analyzer_map_;       // Per-field analyzers
  SimilarityMap similarity_map_;   // Scoring models
  FieldOptionsMap field_options_;  // Index configuration
};

7.11.2 Document Indexing Flow

Loading diagram...

7.11.3 Field Options

Each field can have custom indexing options:

struct FieldOptions {
  std::string analyzer;           // Analyzer name
  bool index_positions = true;    // Store positions
  bool index_offsets = true;      // Store offsets
  bool store_term_vectors = false;// Store per-doc term stats
  float boost = 1.0f;             // Field-level boost
};

7.11.4 Multi-Field Documents

Documents can have multiple searchable fields:

{
  "_id": "doc1",
  "title": "Database Systems",
  "body": "An introduction to database internals...",
  "tags": ["database", "systems"]
}

Each field is indexed separately with its own analyzer:

FieldAnalyzerIndexed Terms
titlestandard["database", "systems"]
bodyenglish["introduct", "databas", "intern", ...]
tagskeyword["database", "systems"]

7.12 Summary

This chapter explored the inverted index architecture that powers Cognica's full-text search capabilities. Key takeaways:

  1. Inverted indexes transform the sparse term-document matrix into an efficient structure where each term maps to a posting list of containing documents.

  2. Posting lists store rich metadata including term frequencies, positions, and offsets, enabling both boolean matching and relevance ranking.

  3. The clustered term index achieves 62,500x key reduction by grouping documents into clusters of 65,536, storing all postings for a cluster in a single key-value pair.

  4. Gap encoding and variable-length integers compress posting lists by 3-4x while maintaining fast decoding.

  5. The text analysis pipeline transforms raw text through character filters, tokenizers, and token filters into normalized, searchable terms.

  6. The pending list architecture batches index updates to reduce write amplification by up to 500x during bulk indexing.

  7. Merge operators enable efficient concurrent updates without read-before-write overhead.

  8. Iterator abstraction with advance operations enables efficient query execution through posting list intersection.

The inverted index provides the foundation for text search; the next part of this book explores query processing, where we'll see how SQL queries are parsed, optimized, and executed against this index structure.

Copyright (c) 2023-2026 Cognica, Inc.