Chapter 5: LSM-Tree Storage Architecture

This chapter explores the Log-Structured Merge-tree (LSM-tree) storage architecture that forms the foundation of Cognica's persistence layer. We examine the theoretical principles behind LSM-trees, their trade-offs compared to B-tree structures, and the specific implementation choices that optimize for unified query workloads spanning documents, full-text search, and vector operations.

5.1 Storage Engine Fundamentals

Every database system must answer a fundamental question: how should data be organized on persistent storage to optimize both read and write operations? The answer to this question shapes the entire system architecture.

5.1.1 The Read-Write Trade-off

Storage engines face an inherent tension between read and write performance. Consider two extreme strategies:

Write-Optimized (Append-Only Log):

T_{write} = O(1) \quad \text{(append to end)}

T_{read} = O(n) \quad \text{(scan entire log)}

An append-only log achieves constant-time writes by simply appending new records. However, reads require scanning the entire log to find relevant records, resulting in linear time complexity.

Read-Optimized (Sorted Array):

T_{write} = O(n) \quad \text{(maintain sorted order)}

T_{read} = O(\log n) \quad \text{(binary search)}

A sorted array enables logarithmic-time lookups via binary search. However, maintaining sorted order requires shifting elements on every insert, yielding linear-time writes.

5.1.2 B-Trees: The Traditional Solution

B-trees achieve a balance by maintaining sorted data in a tree structure with high fanout:

T_{write} = O(\log_B n) \quad T_{read} = O(\log_B n)

where $B$ is the branching factor (typically 100-1000). B-trees have dominated database storage for decades due to their balanced read/write performance.

However, B-trees suffer from write amplification - the ratio of bytes written to storage versus bytes written by the application:

W_{amp}^{B-tree} = O(\log_B n)

Each update may modify multiple tree nodes from leaf to root, with each node requiring a full page write (typically 4-16 KB) even for small changes.

5.1.3 LSM-Trees: Write-Optimized Alternative

Log-Structured Merge-trees (LSM-trees), introduced by O'Neil et al. in 1996, take a different approach: buffer writes in memory, then flush sorted runs to disk, and periodically merge runs to maintain read performance.

Key insight: Sequential I/O is 100-1000x faster than random I/O on both HDDs and SSDs. LSM-trees convert random writes to sequential writes at the cost of additional read overhead.

W_{amp}^{LSM} = O\left(\frac{L \cdot T}{B}\right)

where $L$ is the number of levels, $T$ is the size ratio between levels, and $B$ is the block size. With proper tuning, LSM write amplification can be 10-100x lower than B-trees.

5.2 LSM-Tree Architecture

An LSM-tree consists of multiple components organized hierarchically from fast volatile memory to slow persistent storage.

5.2.1 Component Hierarchy

Loading diagram...

Memtable: An in-memory sorted data structure (typically a skip list or red-black tree) that buffers incoming writes. When the memtable reaches a size threshold, it becomes immutable and a new active memtable is created.

Write-Ahead Log (WAL): A persistent append-only log that records every write before it enters the memtable. The WAL ensures durability - if the system crashes before a memtable is flushed, the WAL can replay the writes during recovery.

Sorted String Table (SSTable): An immutable, sorted file containing key-value pairs. SSTables are organized into levels, with each level containing increasingly larger amounts of data.

5.2.2 Size Ratio and Level Capacity

The size ratio $T$ determines how much larger each level is compared to the previous level:

\text{Size}(L_i) = T \times \text{Size}(L_{i-1})

For a size ratio $T = 10$ and initial size $S_0$ :

Level	Size	Typical Value
$L_0$	$S_0$	256 MB
$L_1$	$T \cdot S_0$	2.5 GB
$L_2$	$T^2 \cdot S_0$	25 GB
$L_3$	$T^3 \cdot S_0$	250 GB
$L_4$	$T^4 \cdot S_0$	2.5 TB

The total capacity with $L$ levels is:

\text{Total Capacity} = S_0 \cdot \sum_{i=0}^{L-1} T^i = S_0 \cdot \frac{T^L - 1}{T - 1}

5.2.3 Write Amplification Analysis

Write amplification measures how many times data is written to storage over its lifetime. In an LSM-tree, data moves through levels via compaction:

W_{amp} = \frac{\text{Total bytes written to storage}}{\text{Bytes written by application}}

For leveled compaction with size ratio $T$ :

W_{amp} = O(T \cdot L) = O\left(T \cdot \log_T \frac{N}{S_0}\right)

where $N$ is total data size. With $T = 10$ and 1 TB of data:

W_{amp} \approx 10 \times 4 = 40

Each byte written by the application results in approximately 40 bytes written to storage over its lifetime.

5.2.4 Read Amplification Analysis

Read amplification measures how many storage locations must be checked to satisfy a point query:

R_{amp} = \text{Number of locations checked per query}

In the worst case, a key might exist only in the oldest level, requiring checks at every level. With bloom filters (false positive rate $p$ ), the expected read amplification is:

R_{amp} = 1 + (L - 1) \cdot p

For $L = 4$ levels and $p = 1\%$ :

R_{amp} = 1 + 3 \times 0.01 = 1.03

Bloom filters dramatically reduce read amplification by eliminating unnecessary SSTable searches.

5.2.5 Space Amplification

Space amplification measures the ratio of storage used to logical data size:

S_{amp} = \frac{\text{Storage space used}}{\text{Logical data size}}

LSM-trees may temporarily store multiple versions of the same key across levels until compaction merges them. In the worst case:

S_{amp} = 1 + \frac{1}{T}

With $T = 10$ , space amplification is bounded by 1.1x (10% overhead).

5.3 Cognica's RocksDB Integration

Cognica builds its storage layer on RocksDB, a high-performance LSM-tree implementation originally developed at Facebook. This section examines how Cognica configures and extends RocksDB for unified query processing.

5.3.1 Storage Engine Architecture

The storage engine provides a layered abstraction over RocksDB:

Loading diagram...

The storage engine initializes RocksDB with carefully tuned parameters:

Thread Pool Configuration:

\text{High Priority Threads} = \max(4, \lfloor \text{cores} / 2 \rfloor)

\text{Low Priority Threads} = \max(2, \lfloor \text{cores} / 4 \rfloor)

High-priority threads handle flushes (converting memtables to SSTables), while low-priority threads handle compaction (merging SSTables across levels).

5.3.2 Database Category Hierarchy

Cognica organizes data into three database categories:

Category	ID	Purpose
System	0	Metadata, configuration, schema definitions
KeyValue	1	Application key-value data with TTL support
Document	2	Collections, documents, indexes

Each category operates as a logical partition within the same RocksDB instance, distinguished by key prefixes:

\text{Key} = \text{CategoryID}(1) \| \text{WorkspaceID}(4) \| \text{CollectionID}(4) \| \text{UserKey}

This prefix scheme enables:

Isolation: Different categories never conflict
Efficient Scans: Prefix iterators scan only relevant data
Bloom Filters: Prefix-based bloom filters accelerate lookups

5.3.3 Workspace Multi-Tenancy

Cognica supports multi-tenant deployments where each tenant (workspace) has isolated data:

Loading diagram...

The key encoding ensures workspace isolation without requiring separate database instances:

\text{Prefix} = \text{DB}(1) \| \text{Category}(1) \| \text{Workspace}(4) \| \text{Collection}(4) \| \text{Index}(4)

Total prefix length: 14 bytes, enabling efficient prefix extraction for bloom filters.

5.4 Write Path Implementation

Understanding the write path is essential for optimizing write-heavy workloads common in document ingestion and full-text indexing.

5.4.1 Write Flow

Loading diagram...

5.4.2 Write-Ahead Log Configuration

The WAL ensures durability with configurable trade-offs:

Parameter	Value	Purpose
`max_total_wal_size`	4 GB	Total WAL size limit across all column families
`wal_bytes_per_sync`	128 MB	Bytes between fsync calls
`wal_compression`	ZSTD	Compression algorithm for WAL entries
`wal_ttl_seconds`	3600	Automatic cleanup after 1 hour
`recycle_log_file_num`	16	Reuse log files to reduce allocation overhead

Durability Modes:

Synchronous (sync = true): Every write waits for fsync. Maximum durability, highest latency.
Asynchronous (sync = false): Writes return immediately. Risk of losing recent writes on crash.
Group Commit: Multiple writes share a single fsync, amortizing overhead.

Cognica defaults to asynchronous writes with periodic sync (every 128 MB), balancing throughput and durability.

5.4.3 Memtable Configuration

The memtable acts as the write buffer, absorbing writes until flushed:

Parameter	Value	Impact
`write_buffer_size`	256 MB	Size of each memtable
`max_write_buffer_number`	16	Maximum concurrent memtables
`min_write_buffer_number_to_merge`	1	Merge threshold before flush
`arena_block_size`	16 MB	Memory allocation granularity

Total Write Buffer Capacity:

\text{Max Memory} = \text{write\_buffer\_size} \times \text{max\_write\_buffer\_number}

= 256\text{ MB} \times 16 = 4\text{ GB}

This allows up to 4 GB of writes to be buffered in memory, enabling high write throughput for bulk ingestion.

5.4.4 Memtable Prefix Bloom Filters

Cognica enables prefix bloom filters on memtables to accelerate point queries before data reaches SSTables:

\text{bloom\_size} = \text{memtable\_size} \times \text{prefix\_bloom\_ratio}

= 256\text{ MB} \times 0.2 = 51.2\text{ MB per memtable}

The bloom filter answers the question "might this key exist in this memtable?" with a configurable false positive rate:

p = \left(1 - e^{-kn/m}\right)^k

where $k$ is the number of hash functions, $n$ is the number of keys, and $m$ is the bloom filter size in bits.

5.5 Read Path Implementation

The read path must check multiple locations, making optimization critical for query performance.

5.5.1 Read Flow

Loading diagram...

5.5.2 Block Cache Architecture

Cognica employs a multi-level cache hierarchy to minimize disk I/O:

Primary Block Cache (HyperClockCache):

Parameter	Value	Purpose
`cache_capacity`	16 GB	Total cache size
`cache_shard_bits`	4	16 shards for concurrent access
`strict_capacity_limit`	true	Never exceed capacity

The HyperClockCache uses a clock-based eviction algorithm optimized for high concurrency:

\text{Shards} = 2^{\text{shard\_bits}} = 2^4 = 16

\text{Shard Size} = \frac{\text{Total Capacity}}{\text{Shards}} = \frac{16\text{ GB}}{16} = 1\text{ GB}

Index and Filter Caching:

cache_index_and_filter_blocks: true
cache_index_and_filter_blocks_with_high_priority: true

Index and filter blocks receive high priority in the cache because their eviction causes disproportionate performance degradation - every subsequent read must reload them from disk.

5.5.3 Bloom Filter Configuration

Cognica uses Ribbon filters, an advanced alternative to traditional Bloom filters:

Ribbon Filter Advantages:

20-30% less space than Bloom filters for same false positive rate
Faster construction for large key sets
Cache-friendly query pattern

Configuration:

\text{Bits per Key} \approx -\frac{\ln(p)}{\ln(2)^2}

For false positive rate $p = 1\%$ :

\text{Bits per Key} \approx -\frac{\ln(0.01)}{0.48} \approx 9.6 \text{ bits}

5.5.4 Read Options Optimization

Cognica configures read operations for optimal performance:

Option	Value	Impact
`auto_prefix_mode`	true	Use prefix extractors for bloom filters
`verify_checksums`	false	Skip checksum verification for speed
`readahead_size`	512 KB	Sequential read buffer
`adaptive_readahead`	true	Dynamically adjust based on access pattern
`async_io`	true	Enable asynchronous I/O

Read Ahead Strategy:

For sequential scans, readahead prefetches data before it is requested:

\text{Effective Bandwidth} = \frac{\text{Block Size}}{\text{Seek Time} + \text{Transfer Time}}

With readahead:

\text{Effective Bandwidth} \approx \frac{\text{Readahead Size}}{\text{Seek Time} + \text{Transfer Time}}

A 512 KB readahead can improve sequential read performance by 10-100x compared to reading individual blocks.

5.6 Compaction Strategies

Compaction is the process of merging SSTables to maintain read performance and reclaim space from deleted or overwritten keys. The choice of compaction strategy significantly impacts system behavior.

5.6.1 Leveled Compaction

Cognica uses leveled compaction, where each level (except L0) contains non-overlapping SSTables:

Properties:

L0: Overlapping SSTables (direct memtable flushes)
L1+: Non-overlapping, sorted SSTables
Size ratio $T = 10$ between levels

Compaction Trigger:

When level $L_i$ exceeds its size limit:

\text{Size}(L_i) > T^i \times \text{Target Base Size}

SSTables from $L_i$ are merged with overlapping SSTables in $L_{i+1}$ .

Loading diagram...

5.6.2 Compression Strategy

Cognica employs a tiered compression strategy optimized for the access patterns at each level:

Level	Algorithm	Rationale
0-4	LZ4	Fast compression/decompression for hot data
5-6	ZSTD	High compression ratio for cold data

Compression Trade-offs:

\text{Read Latency} = \text{Disk Read Time} + \text{Decompression Time}

For hot data (L0-L4), LZ4's fast decompression minimizes latency:

LZ4: ~4 GB/s decompression
ZSTD: ~1 GB/s decompression

For cold data (L5-L6), ZSTD's superior compression ratio reduces storage:

LZ4: ~2.5x compression
ZSTD: ~4x compression

Dictionary Compression:

ZSTD supports dictionary compression, where common patterns are pre-computed:

max_dict_bytes: 32_KB
zstd_max_train_bytes: 3_MB

Dictionary compression can improve ratios by 20-50% for structured data like JSON documents.

5.6.3 Custom Compaction Filter

Cognica implements a custom compaction filter that runs during compaction to:

Expire TTL Data: Remove key-value pairs past their time-to-live
Detect Migration: Identify data requiring schema migration
Validate Structure: Ensure keys match expected category and database type

Filter Decision Logic:

\text{Decision}(key, value) = \begin{cases} \text{Remove} & \text{if } \text{TTL}(value) < \text{now} \\ \text{Remove} & \text{if } \text{tombstone}(value) \\ \text{Keep} & \text{otherwise} \end{cases}

The filter runs during compaction, making TTL expiration essentially "free" - data is cleaned up as part of the normal compaction process without additional I/O.

5.6.4 Compaction Tuning

Cognica's compaction configuration balances write amplification, space amplification, and read performance:

Parameter	Value	Purpose
`target_file_size_base`	64 MB	SSTable size at L1
`max_bytes_for_level_base`	512 MB	Size limit for L1
`level0_file_num_compaction_trigger`	4	L0 files before compaction
`level0_slowdown_writes_trigger`	20	L0 files before write slowdown
`level0_stop_writes_trigger`	36	L0 files before write stop

Write Stall Prevention:

When L0 accumulates too many files, writes must slow down to allow compaction to catch up:

\text{Write Rate} = \begin{cases} \text{Full Speed} & \text{if } |L_0| < 4 \\ \text{Throttled} & \text{if } 4 \leq |L_0| < 20 \\ \text{Severely Throttled} & \text{if } 20 \leq |L_0| < 36 \\ \text{Stopped} & \text{if } |L_0| \geq 36 \end{cases}

5.7 Transaction Support

Cognica provides ACID transactions through RocksDB's TransactionDB, with extensions for distributed consensus.

5.7.1 Transaction Abstraction

The transaction interface supports multiple implementation strategies:

Loading diagram...

5.7.2 Isolation Levels

Snapshot Isolation:

Each transaction sees a consistent snapshot of the database at its start time:

\text{Read}(T, k) = \text{Version}(k, \text{start\_time}(T))

Snapshot isolation prevents dirty reads and non-repeatable reads but allows write skew anomalies.

Serializable Snapshot Isolation (SSI):

RocksDB supports SSI through conflict detection:

\text{Conflict}(T_1, T_2) = \text{ReadSet}(T_1) \cap \text{WriteSet}(T_2) \neq \emptyset

When conflicts are detected, one transaction aborts to maintain serializability.

5.7.3 Write Batch Optimization

For write-heavy workloads, Cognica uses indexed write batches that buffer writes in memory:

Benefits:

Read-your-writes: Queries see uncommitted changes within the transaction
Reduced lock contention: No locks until commit
Atomic commit: All changes apply atomically

Index Structure:

The tsl::htrie_map provides efficient prefix-based lookups:

T_{lookup} = O(|key|) \text{ (key length, not number of entries)}

This enables efficient iteration over key ranges within a transaction's write set.

5.7.4 Savepoints

Savepoints enable partial rollback within a transaction:

Loading diagram...

Savepoints are implemented as markers in the write batch, enabling efficient partial undo.

5.8 Custom Extensions

Cognica extends RocksDB with custom components for unified query processing.

5.8.1 Custom Comparators

Key ordering determines SSTable organization and iteration behavior. Cognica provides two comparators:

Ascending Comparator (default):

\text{Compare}(a, b) = \begin{cases} -1 & \text{if } a < b \text{ (lexicographically)} \\ 0 & \text{if } a = b \\ 1 & \text{if } a > b \end{cases}

Descending Comparator:

\text{Compare}_{desc}(a, b) = -\text{Compare}_{asc}(a, b)

The descending comparator enables efficient "ORDER BY DESC" queries by storing data in reverse order.

Key Compression Optimization:

The comparators implement FindShortestSeparator() to minimize index block size:

Given keys $a$ and $b$ where $a < b$ , find the shortest $s$ such that $a \leq s < b$ .

Example: For $a = \text{"application"}$ and $b = \text{"apply"}$ , $s = \text{"applj"}$ .

5.8.2 Prefix Extraction

Prefix extractors enable bloom filters and prefix-based iteration:

Capped Prefix Transform (14 bytes):

\text{Prefix}(key) = key[0:14]

The 14-byte prefix captures:

Database type (1 byte)
Category ID (1 byte)
Collection ID (4 bytes)
Index ID (4 bytes)
Workspace ID (4 bytes)

This enables efficient filtering: "Find all documents in collection X" requires only prefix-matching bloom filter lookups.

5.8.3 Merge Operators

Merge operators enable atomic read-modify-write operations without read locks:

Counter Merge Operator:

\text{Merge}_{counter}(v_{old}, \Delta) = v_{old} + \Delta

Multiple increments merge during compaction:

\text{Merge}(\text{Merge}(v, \Delta_1), \Delta_2) = v + \Delta_1 + \Delta_2

Clustered Term Index Merge Operator:

For full-text search, posting lists must merge efficiently:

\text{Merge}_{posting}(P_1, P_2) = P_1 \cup P_2

The clustered term index stores multiple terms per key, requiring custom merge logic to maintain sorted order and handle deletions.

5.9 Backup and Recovery

Cognica provides backup and recovery mechanisms built on RocksDB's backup engine.

5.9.1 Backup Architecture

Loading diagram...

Incremental Backups:

Subsequent backups only copy new SSTables:

\text{Backup Size}_n = \text{New SSTables since Backup}_{n-1}

For append-heavy workloads, incremental backups are dramatically smaller than full backups.

5.9.2 Point-in-Time Recovery

The backup engine maintains multiple backup versions:

Backup ID	Timestamp	Files	Size
1	2024-01-01	100	10 GB
2	2024-01-02	15	1.5 GB
3	2024-01-03	20	2 GB

Recovery restores to any backup point:

\text{Restore}(backup\_id) \rightarrow \text{Database state at backup time}

5.9.3 Encryption Support

Cognica supports encryption at rest through RocksDB's encrypted environment:

Encryption Flow:

\text{Ciphertext} = E_{key}(\text{Plaintext})

\text{Plaintext} = D_{key}(\text{Ciphertext})

Backups preserve encryption, requiring the same key for restoration.

5.10 Performance Characteristics

This section summarizes the performance characteristics of Cognica's LSM-tree storage.

5.10.1 Throughput Bounds

Write Throughput:

\text{Max Write Throughput} = \min\left(\frac{\text{Memtable Size}}{\text{Flush Time}}, \frac{\text{Disk Bandwidth}}{\text{Write Amp}}\right)

With 256 MB memtables, 100 ms flush time, 500 MB/s disk, and 40x write amp:

\text{Max Write} = \min(2.56 \text{ GB/s}, 12.5 \text{ MB/s}) \approx 12.5 \text{ MB/s sustained}

Read Throughput:

\text{Max Read Throughput} = \text{Cache Hit Rate} \times \text{Memory Bandwidth} + (1 - \text{Cache Hit Rate}) \times \text{Disk Bandwidth}

With 99% cache hit rate, 100 GB/s memory, 500 MB/s disk:

\text{Max Read} = 0.99 \times 100 + 0.01 \times 0.5 \approx 99 \text{ GB/s}

5.10.2 Latency Distribution

Point query latency depends on data location:

Location	Latency	Probability
Block Cache	1-10 us	99% (with good caching)
Memtable	10-100 us	Depends on recency
L0 SSTables	100 us - 1 ms	Low (bloom filters)
L1+ SSTables	1-10 ms	Very low (bloom filters)

P99 Latency:

P_{99} \approx \text{Disk Read Latency} \times (1 - \text{Bloom Filter Effectiveness})

With 1% bloom filter false positive rate and 1 ms disk latency:

P_{99} \approx 1\text{ ms} \times 0.01 = 10 \text{ us}

5.10.3 Space Efficiency

Effective Compression Ratio:

\text{Compression Ratio} = \frac{\text{Logical Data Size}}{\text{Physical Storage}}

With tiered compression (LZ4 for hot, ZSTD for cold):

\text{Effective Ratio} \approx 0.3 \times 2.5 + 0.7 \times 4.0 = 3.55x

Assuming 30% of data is hot (recent) and 70% is cold (historical).

5.11 Summary

This chapter examined the LSM-tree storage architecture underlying Cognica's persistence layer. Key takeaways:

LSM-trees optimize for write throughput by converting random writes to sequential I/O through the memtable/SSTable hierarchy.
Write amplification is the primary cost, with leveled compaction yielding $O(T \cdot L)$ amplification. Cognica tunes this through compression tiers and careful level sizing.
Read amplification is controlled through bloom filters (Ribbon filters in Cognica), achieving near-optimal single-read performance for point queries.
The multi-level cache hierarchy (block cache, row cache, OS page cache) minimizes disk I/O for hot data.
Transaction support through RocksDB's TransactionDB provides ACID guarantees with configurable isolation levels.
Custom extensions (comparators, merge operators, compaction filters) adapt the generic LSM-tree for unified query processing.

The storage layer provides the foundation upon which Cognica builds document storage, full-text indexes, and vector indexes - topics we explore in the following chapters.