Chapter 5: LSM-Tree Storage Architecture
This chapter explores the Log-Structured Merge-tree (LSM-tree) storage architecture that forms the foundation of Cognica's persistence layer. We examine the theoretical principles behind LSM-trees, their trade-offs compared to B-tree structures, and the specific implementation choices that optimize for unified query workloads spanning documents, full-text search, and vector operations.
5.1 Storage Engine Fundamentals
Every database system must answer a fundamental question: how should data be organized on persistent storage to optimize both read and write operations? The answer to this question shapes the entire system architecture.
5.1.1 The Read-Write Trade-off
Storage engines face an inherent tension between read and write performance. Consider two extreme strategies:
Write-Optimized (Append-Only Log):
An append-only log achieves constant-time writes by simply appending new records. However, reads require scanning the entire log to find relevant records, resulting in linear time complexity.
Read-Optimized (Sorted Array):
A sorted array enables logarithmic-time lookups via binary search. However, maintaining sorted order requires shifting elements on every insert, yielding linear-time writes.
5.1.2 B-Trees: The Traditional Solution
B-trees achieve a balance by maintaining sorted data in a tree structure with high fanout:
where is the branching factor (typically 100-1000). B-trees have dominated database storage for decades due to their balanced read/write performance.
However, B-trees suffer from write amplification - the ratio of bytes written to storage versus bytes written by the application:
Each update may modify multiple tree nodes from leaf to root, with each node requiring a full page write (typically 4-16 KB) even for small changes.
5.1.3 LSM-Trees: Write-Optimized Alternative
Log-Structured Merge-trees (LSM-trees), introduced by O'Neil et al. in 1996, take a different approach: buffer writes in memory, then flush sorted runs to disk, and periodically merge runs to maintain read performance.
Key insight: Sequential I/O is 100-1000x faster than random I/O on both HDDs and SSDs. LSM-trees convert random writes to sequential writes at the cost of additional read overhead.
where is the number of levels, is the size ratio between levels, and is the block size. With proper tuning, LSM write amplification can be 10-100x lower than B-trees.
5.2 LSM-Tree Architecture
An LSM-tree consists of multiple components organized hierarchically from fast volatile memory to slow persistent storage.
5.2.1 Component Hierarchy
Memtable: An in-memory sorted data structure (typically a skip list or red-black tree) that buffers incoming writes. When the memtable reaches a size threshold, it becomes immutable and a new active memtable is created.
Write-Ahead Log (WAL): A persistent append-only log that records every write before it enters the memtable. The WAL ensures durability - if the system crashes before a memtable is flushed, the WAL can replay the writes during recovery.
Sorted String Table (SSTable): An immutable, sorted file containing key-value pairs. SSTables are organized into levels, with each level containing increasingly larger amounts of data.
5.2.2 Size Ratio and Level Capacity
The size ratio determines how much larger each level is compared to the previous level:
For a size ratio and initial size :
| Level | Size | Typical Value |
|---|---|---|
| 256 MB | ||
| 2.5 GB | ||
| 25 GB | ||
| 250 GB | ||
| 2.5 TB |
The total capacity with levels is:
5.2.3 Write Amplification Analysis
Write amplification measures how many times data is written to storage over its lifetime. In an LSM-tree, data moves through levels via compaction:
For leveled compaction with size ratio :
where is total data size. With and 1 TB of data:
Each byte written by the application results in approximately 40 bytes written to storage over its lifetime.
5.2.4 Read Amplification Analysis
Read amplification measures how many storage locations must be checked to satisfy a point query:
In the worst case, a key might exist only in the oldest level, requiring checks at every level. With bloom filters (false positive rate ), the expected read amplification is:
For levels and :
Bloom filters dramatically reduce read amplification by eliminating unnecessary SSTable searches.
5.2.5 Space Amplification
Space amplification measures the ratio of storage used to logical data size:
LSM-trees may temporarily store multiple versions of the same key across levels until compaction merges them. In the worst case:
With , space amplification is bounded by 1.1x (10% overhead).
5.3 Cognica's RocksDB Integration
Cognica builds its storage layer on RocksDB, a high-performance LSM-tree implementation originally developed at Facebook. This section examines how Cognica configures and extends RocksDB for unified query processing.
5.3.1 Storage Engine Architecture
The storage engine provides a layered abstraction over RocksDB:
The storage engine initializes RocksDB with carefully tuned parameters:
Thread Pool Configuration:
High-priority threads handle flushes (converting memtables to SSTables), while low-priority threads handle compaction (merging SSTables across levels).
5.3.2 Database Category Hierarchy
Cognica organizes data into three database categories:
| Category | ID | Purpose |
|---|---|---|
| System | 0 | Metadata, configuration, schema definitions |
| KeyValue | 1 | Application key-value data with TTL support |
| Document | 2 | Collections, documents, indexes |
Each category operates as a logical partition within the same RocksDB instance, distinguished by key prefixes:
This prefix scheme enables:
- Isolation: Different categories never conflict
- Efficient Scans: Prefix iterators scan only relevant data
- Bloom Filters: Prefix-based bloom filters accelerate lookups
5.3.3 Workspace Multi-Tenancy
Cognica supports multi-tenant deployments where each tenant (workspace) has isolated data:
The key encoding ensures workspace isolation without requiring separate database instances:
Total prefix length: 14 bytes, enabling efficient prefix extraction for bloom filters.
5.4 Write Path Implementation
Understanding the write path is essential for optimizing write-heavy workloads common in document ingestion and full-text indexing.
5.4.1 Write Flow
5.4.2 Write-Ahead Log Configuration
The WAL ensures durability with configurable trade-offs:
| Parameter | Value | Purpose |
|---|---|---|
max_total_wal_size | 4 GB | Total WAL size limit across all column families |
wal_bytes_per_sync | 128 MB | Bytes between fsync calls |
wal_compression | ZSTD | Compression algorithm for WAL entries |
wal_ttl_seconds | 3600 | Automatic cleanup after 1 hour |
recycle_log_file_num | 16 | Reuse log files to reduce allocation overhead |
Durability Modes:
-
Synchronous (
sync = true): Every write waits for fsync. Maximum durability, highest latency. -
Asynchronous (
sync = false): Writes return immediately. Risk of losing recent writes on crash. -
Group Commit: Multiple writes share a single fsync, amortizing overhead.
Cognica defaults to asynchronous writes with periodic sync (every 128 MB), balancing throughput and durability.
5.4.3 Memtable Configuration
The memtable acts as the write buffer, absorbing writes until flushed:
| Parameter | Value | Impact |
|---|---|---|
write_buffer_size | 256 MB | Size of each memtable |
max_write_buffer_number | 16 | Maximum concurrent memtables |
min_write_buffer_number_to_merge | 1 | Merge threshold before flush |
arena_block_size | 16 MB | Memory allocation granularity |
Total Write Buffer Capacity:
This allows up to 4 GB of writes to be buffered in memory, enabling high write throughput for bulk ingestion.
5.4.4 Memtable Prefix Bloom Filters
Cognica enables prefix bloom filters on memtables to accelerate point queries before data reaches SSTables:
The bloom filter answers the question "might this key exist in this memtable?" with a configurable false positive rate:
where is the number of hash functions, is the number of keys, and is the bloom filter size in bits.
5.5 Read Path Implementation
The read path must check multiple locations, making optimization critical for query performance.
5.5.1 Read Flow
5.5.2 Block Cache Architecture
Cognica employs a multi-level cache hierarchy to minimize disk I/O:
Primary Block Cache (HyperClockCache):
| Parameter | Value | Purpose |
|---|---|---|
cache_capacity | 16 GB | Total cache size |
cache_shard_bits | 4 | 16 shards for concurrent access |
strict_capacity_limit | true | Never exceed capacity |
The HyperClockCache uses a clock-based eviction algorithm optimized for high concurrency:
Index and Filter Caching:
cache_index_and_filter_blocks: true
cache_index_and_filter_blocks_with_high_priority: true
Index and filter blocks receive high priority in the cache because their eviction causes disproportionate performance degradation - every subsequent read must reload them from disk.
5.5.3 Bloom Filter Configuration
Cognica uses Ribbon filters, an advanced alternative to traditional Bloom filters:
Ribbon Filter Advantages:
- 20-30% less space than Bloom filters for same false positive rate
- Faster construction for large key sets
- Cache-friendly query pattern
Configuration:
For false positive rate :
5.5.4 Read Options Optimization
Cognica configures read operations for optimal performance:
| Option | Value | Impact |
|---|---|---|
auto_prefix_mode | true | Use prefix extractors for bloom filters |
verify_checksums | false | Skip checksum verification for speed |
readahead_size | 512 KB | Sequential read buffer |
adaptive_readahead | true | Dynamically adjust based on access pattern |
async_io | true | Enable asynchronous I/O |
Read Ahead Strategy:
For sequential scans, readahead prefetches data before it is requested:
With readahead:
A 512 KB readahead can improve sequential read performance by 10-100x compared to reading individual blocks.
5.6 Compaction Strategies
Compaction is the process of merging SSTables to maintain read performance and reclaim space from deleted or overwritten keys. The choice of compaction strategy significantly impacts system behavior.
5.6.1 Leveled Compaction
Cognica uses leveled compaction, where each level (except L0) contains non-overlapping SSTables:
Properties:
- L0: Overlapping SSTables (direct memtable flushes)
- L1+: Non-overlapping, sorted SSTables
- Size ratio between levels
Compaction Trigger:
When level exceeds its size limit:
SSTables from are merged with overlapping SSTables in .
5.6.2 Compression Strategy
Cognica employs a tiered compression strategy optimized for the access patterns at each level:
| Level | Algorithm | Rationale |
|---|---|---|
| 0-4 | LZ4 | Fast compression/decompression for hot data |
| 5-6 | ZSTD | High compression ratio for cold data |
Compression Trade-offs:
For hot data (L0-L4), LZ4's fast decompression minimizes latency:
- LZ4: ~4 GB/s decompression
- ZSTD: ~1 GB/s decompression
For cold data (L5-L6), ZSTD's superior compression ratio reduces storage:
- LZ4: ~2.5x compression
- ZSTD: ~4x compression
Dictionary Compression:
ZSTD supports dictionary compression, where common patterns are pre-computed:
max_dict_bytes: 32_KB
zstd_max_train_bytes: 3_MB
Dictionary compression can improve ratios by 20-50% for structured data like JSON documents.
5.6.3 Custom Compaction Filter
Cognica implements a custom compaction filter that runs during compaction to:
- Expire TTL Data: Remove key-value pairs past their time-to-live
- Detect Migration: Identify data requiring schema migration
- Validate Structure: Ensure keys match expected category and database type
Filter Decision Logic:
The filter runs during compaction, making TTL expiration essentially "free" - data is cleaned up as part of the normal compaction process without additional I/O.
5.6.4 Compaction Tuning
Cognica's compaction configuration balances write amplification, space amplification, and read performance:
| Parameter | Value | Purpose |
|---|---|---|
target_file_size_base | 64 MB | SSTable size at L1 |
max_bytes_for_level_base | 512 MB | Size limit for L1 |
level0_file_num_compaction_trigger | 4 | L0 files before compaction |
level0_slowdown_writes_trigger | 20 | L0 files before write slowdown |
level0_stop_writes_trigger | 36 | L0 files before write stop |
Write Stall Prevention:
When L0 accumulates too many files, writes must slow down to allow compaction to catch up:
5.7 Transaction Support
Cognica provides ACID transactions through RocksDB's TransactionDB, with extensions for distributed consensus.
5.7.1 Transaction Abstraction
The transaction interface supports multiple implementation strategies:
5.7.2 Isolation Levels
Snapshot Isolation:
Each transaction sees a consistent snapshot of the database at its start time:
Snapshot isolation prevents dirty reads and non-repeatable reads but allows write skew anomalies.
Serializable Snapshot Isolation (SSI):
RocksDB supports SSI through conflict detection:
When conflicts are detected, one transaction aborts to maintain serializability.
5.7.3 Write Batch Optimization
For write-heavy workloads, Cognica uses indexed write batches that buffer writes in memory:
Benefits:
- Read-your-writes: Queries see uncommitted changes within the transaction
- Reduced lock contention: No locks until commit
- Atomic commit: All changes apply atomically
Index Structure:
The tsl::htrie_map provides efficient prefix-based lookups:
This enables efficient iteration over key ranges within a transaction's write set.
5.7.4 Savepoints
Savepoints enable partial rollback within a transaction:
Savepoints are implemented as markers in the write batch, enabling efficient partial undo.
5.8 Custom Extensions
Cognica extends RocksDB with custom components for unified query processing.
5.8.1 Custom Comparators
Key ordering determines SSTable organization and iteration behavior. Cognica provides two comparators:
Ascending Comparator (default):
Descending Comparator:
The descending comparator enables efficient "ORDER BY DESC" queries by storing data in reverse order.
Key Compression Optimization:
The comparators implement FindShortestSeparator() to minimize index block size:
Given keys and where , find the shortest such that .
Example: For and , .
5.8.2 Prefix Extraction
Prefix extractors enable bloom filters and prefix-based iteration:
Capped Prefix Transform (14 bytes):
The 14-byte prefix captures:
- Database type (1 byte)
- Category ID (1 byte)
- Collection ID (4 bytes)
- Index ID (4 bytes)
- Workspace ID (4 bytes)
This enables efficient filtering: "Find all documents in collection X" requires only prefix-matching bloom filter lookups.
5.8.3 Merge Operators
Merge operators enable atomic read-modify-write operations without read locks:
Counter Merge Operator:
Multiple increments merge during compaction:
Clustered Term Index Merge Operator:
For full-text search, posting lists must merge efficiently:
The clustered term index stores multiple terms per key, requiring custom merge logic to maintain sorted order and handle deletions.
5.9 Backup and Recovery
Cognica provides backup and recovery mechanisms built on RocksDB's backup engine.
5.9.1 Backup Architecture
Incremental Backups:
Subsequent backups only copy new SSTables:
For append-heavy workloads, incremental backups are dramatically smaller than full backups.
5.9.2 Point-in-Time Recovery
The backup engine maintains multiple backup versions:
| Backup ID | Timestamp | Files | Size |
|---|---|---|---|
| 1 | 2024-01-01 | 100 | 10 GB |
| 2 | 2024-01-02 | 15 | 1.5 GB |
| 3 | 2024-01-03 | 20 | 2 GB |
Recovery restores to any backup point:
5.9.3 Encryption Support
Cognica supports encryption at rest through RocksDB's encrypted environment:
Encryption Flow:
Backups preserve encryption, requiring the same key for restoration.
5.10 Performance Characteristics
This section summarizes the performance characteristics of Cognica's LSM-tree storage.
5.10.1 Throughput Bounds
Write Throughput:
With 256 MB memtables, 100 ms flush time, 500 MB/s disk, and 40x write amp:
Read Throughput:
With 99% cache hit rate, 100 GB/s memory, 500 MB/s disk:
5.10.2 Latency Distribution
Point query latency depends on data location:
| Location | Latency | Probability |
|---|---|---|
| Block Cache | 1-10 us | 99% (with good caching) |
| Memtable | 10-100 us | Depends on recency |
| L0 SSTables | 100 us - 1 ms | Low (bloom filters) |
| L1+ SSTables | 1-10 ms | Very low (bloom filters) |
P99 Latency:
With 1% bloom filter false positive rate and 1 ms disk latency:
5.10.3 Space Efficiency
Effective Compression Ratio:
With tiered compression (LZ4 for hot, ZSTD for cold):
Assuming 30% of data is hot (recent) and 70% is cold (historical).
5.11 Summary
This chapter examined the LSM-tree storage architecture underlying Cognica's persistence layer. Key takeaways:
-
LSM-trees optimize for write throughput by converting random writes to sequential I/O through the memtable/SSTable hierarchy.
-
Write amplification is the primary cost, with leveled compaction yielding amplification. Cognica tunes this through compression tiers and careful level sizing.
-
Read amplification is controlled through bloom filters (Ribbon filters in Cognica), achieving near-optimal single-read performance for point queries.
-
The multi-level cache hierarchy (block cache, row cache, OS page cache) minimizes disk I/O for hot data.
-
Transaction support through RocksDB's TransactionDB provides ACID guarantees with configurable isolation levels.
-
Custom extensions (comparators, merge operators, compaction filters) adapt the generic LSM-tree for unified query processing.
The storage layer provides the foundation upon which Cognica builds document storage, full-text indexes, and vector indexes - topics we explore in the following chapters.