Chapter 5: LSM-Tree Storage Architecture

This chapter explores the Log-Structured Merge-tree (LSM-tree) storage architecture that forms the foundation of Cognica's persistence layer. We examine the theoretical principles behind LSM-trees, their trade-offs compared to B-tree structures, and the specific implementation choices that optimize for unified query workloads spanning documents, full-text search, and vector operations.

5.1 Storage Engine Fundamentals

Every database system must answer a fundamental question: how should data be organized on persistent storage to optimize both read and write operations? The answer to this question shapes the entire system architecture.

5.1.1 The Read-Write Trade-off

Storage engines face an inherent tension between read and write performance. Consider two extreme strategies:

Write-Optimized (Append-Only Log):

Twrite=O(1)(append to end)T_{write} = O(1) \quad \text{(append to end)} Tread=O(n)(scan entire log)T_{read} = O(n) \quad \text{(scan entire log)}

An append-only log achieves constant-time writes by simply appending new records. However, reads require scanning the entire log to find relevant records, resulting in linear time complexity.

Read-Optimized (Sorted Array):

Twrite=O(n)(maintain sorted order)T_{write} = O(n) \quad \text{(maintain sorted order)} Tread=O(logn)(binary search)T_{read} = O(\log n) \quad \text{(binary search)}

A sorted array enables logarithmic-time lookups via binary search. However, maintaining sorted order requires shifting elements on every insert, yielding linear-time writes.

5.1.2 B-Trees: The Traditional Solution

B-trees achieve a balance by maintaining sorted data in a tree structure with high fanout:

Twrite=O(logBn)Tread=O(logBn)T_{write} = O(\log_B n) \quad T_{read} = O(\log_B n)

where BB is the branching factor (typically 100-1000). B-trees have dominated database storage for decades due to their balanced read/write performance.

However, B-trees suffer from write amplification - the ratio of bytes written to storage versus bytes written by the application:

WampBtree=O(logBn)W_{amp}^{B-tree} = O(\log_B n)

Each update may modify multiple tree nodes from leaf to root, with each node requiring a full page write (typically 4-16 KB) even for small changes.

5.1.3 LSM-Trees: Write-Optimized Alternative

Log-Structured Merge-trees (LSM-trees), introduced by O'Neil et al. in 1996, take a different approach: buffer writes in memory, then flush sorted runs to disk, and periodically merge runs to maintain read performance.

Key insight: Sequential I/O is 100-1000x faster than random I/O on both HDDs and SSDs. LSM-trees convert random writes to sequential writes at the cost of additional read overhead.

WampLSM=O(LTB)W_{amp}^{LSM} = O\left(\frac{L \cdot T}{B}\right)

where LL is the number of levels, TT is the size ratio between levels, and BB is the block size. With proper tuning, LSM write amplification can be 10-100x lower than B-trees.

5.2 LSM-Tree Architecture

An LSM-tree consists of multiple components organized hierarchically from fast volatile memory to slow persistent storage.

5.2.1 Component Hierarchy

Loading diagram...

Memtable: An in-memory sorted data structure (typically a skip list or red-black tree) that buffers incoming writes. When the memtable reaches a size threshold, it becomes immutable and a new active memtable is created.

Write-Ahead Log (WAL): A persistent append-only log that records every write before it enters the memtable. The WAL ensures durability - if the system crashes before a memtable is flushed, the WAL can replay the writes during recovery.

Sorted String Table (SSTable): An immutable, sorted file containing key-value pairs. SSTables are organized into levels, with each level containing increasingly larger amounts of data.

5.2.2 Size Ratio and Level Capacity

The size ratio TT determines how much larger each level is compared to the previous level:

Size(Li)=T×Size(Li1)\text{Size}(L_i) = T \times \text{Size}(L_{i-1})

For a size ratio T=10T = 10 and initial size S0S_0:

LevelSizeTypical Value
L0L_0S0S_0256 MB
L1L_1TS0T \cdot S_02.5 GB
L2L_2T2S0T^2 \cdot S_025 GB
L3L_3T3S0T^3 \cdot S_0250 GB
L4L_4T4S0T^4 \cdot S_02.5 TB

The total capacity with LL levels is:

Total Capacity=S0i=0L1Ti=S0TL1T1\text{Total Capacity} = S_0 \cdot \sum_{i=0}^{L-1} T^i = S_0 \cdot \frac{T^L - 1}{T - 1}

5.2.3 Write Amplification Analysis

Write amplification measures how many times data is written to storage over its lifetime. In an LSM-tree, data moves through levels via compaction:

Wamp=Total bytes written to storageBytes written by applicationW_{amp} = \frac{\text{Total bytes written to storage}}{\text{Bytes written by application}}

For leveled compaction with size ratio TT:

Wamp=O(TL)=O(TlogTNS0)W_{amp} = O(T \cdot L) = O\left(T \cdot \log_T \frac{N}{S_0}\right)

where NN is total data size. With T=10T = 10 and 1 TB of data:

Wamp10×4=40W_{amp} \approx 10 \times 4 = 40

Each byte written by the application results in approximately 40 bytes written to storage over its lifetime.

5.2.4 Read Amplification Analysis

Read amplification measures how many storage locations must be checked to satisfy a point query:

Ramp=Number of locations checked per queryR_{amp} = \text{Number of locations checked per query}

In the worst case, a key might exist only in the oldest level, requiring checks at every level. With bloom filters (false positive rate pp), the expected read amplification is:

Ramp=1+(L1)pR_{amp} = 1 + (L - 1) \cdot p

For L=4L = 4 levels and p=1%p = 1\%:

Ramp=1+3×0.01=1.03R_{amp} = 1 + 3 \times 0.01 = 1.03

Bloom filters dramatically reduce read amplification by eliminating unnecessary SSTable searches.

5.2.5 Space Amplification

Space amplification measures the ratio of storage used to logical data size:

Samp=Storage space usedLogical data sizeS_{amp} = \frac{\text{Storage space used}}{\text{Logical data size}}

LSM-trees may temporarily store multiple versions of the same key across levels until compaction merges them. In the worst case:

Samp=1+1TS_{amp} = 1 + \frac{1}{T}

With T=10T = 10, space amplification is bounded by 1.1x (10% overhead).

5.3 Cognica's RocksDB Integration

Cognica builds its storage layer on RocksDB, a high-performance LSM-tree implementation originally developed at Facebook. This section examines how Cognica configures and extends RocksDB for unified query processing.

5.3.1 Storage Engine Architecture

The storage engine provides a layered abstraction over RocksDB:

Loading diagram...

The storage engine initializes RocksDB with carefully tuned parameters:

Thread Pool Configuration:

High Priority Threads=max(4,cores/2)\text{High Priority Threads} = \max(4, \lfloor \text{cores} / 2 \rfloor) Low Priority Threads=max(2,cores/4)\text{Low Priority Threads} = \max(2, \lfloor \text{cores} / 4 \rfloor)

High-priority threads handle flushes (converting memtables to SSTables), while low-priority threads handle compaction (merging SSTables across levels).

5.3.2 Database Category Hierarchy

Cognica organizes data into three database categories:

CategoryIDPurpose
System0Metadata, configuration, schema definitions
KeyValue1Application key-value data with TTL support
Document2Collections, documents, indexes

Each category operates as a logical partition within the same RocksDB instance, distinguished by key prefixes:

Key=CategoryID(1)WorkspaceID(4)CollectionID(4)UserKey\text{Key} = \text{CategoryID}(1) \| \text{WorkspaceID}(4) \| \text{CollectionID}(4) \| \text{UserKey}

This prefix scheme enables:

  • Isolation: Different categories never conflict
  • Efficient Scans: Prefix iterators scan only relevant data
  • Bloom Filters: Prefix-based bloom filters accelerate lookups

5.3.3 Workspace Multi-Tenancy

Cognica supports multi-tenant deployments where each tenant (workspace) has isolated data:

Loading diagram...

The key encoding ensures workspace isolation without requiring separate database instances:

Prefix=DB(1)Category(1)Workspace(4)Collection(4)Index(4)\text{Prefix} = \text{DB}(1) \| \text{Category}(1) \| \text{Workspace}(4) \| \text{Collection}(4) \| \text{Index}(4)

Total prefix length: 14 bytes, enabling efficient prefix extraction for bloom filters.

5.4 Write Path Implementation

Understanding the write path is essential for optimizing write-heavy workloads common in document ingestion and full-text indexing.

5.4.1 Write Flow

Loading diagram...

5.4.2 Write-Ahead Log Configuration

The WAL ensures durability with configurable trade-offs:

ParameterValuePurpose
max_total_wal_size4 GBTotal WAL size limit across all column families
wal_bytes_per_sync128 MBBytes between fsync calls
wal_compressionZSTDCompression algorithm for WAL entries
wal_ttl_seconds3600Automatic cleanup after 1 hour
recycle_log_file_num16Reuse log files to reduce allocation overhead

Durability Modes:

  1. Synchronous (sync = true): Every write waits for fsync. Maximum durability, highest latency.

  2. Asynchronous (sync = false): Writes return immediately. Risk of losing recent writes on crash.

  3. Group Commit: Multiple writes share a single fsync, amortizing overhead.

Cognica defaults to asynchronous writes with periodic sync (every 128 MB), balancing throughput and durability.

5.4.3 Memtable Configuration

The memtable acts as the write buffer, absorbing writes until flushed:

ParameterValueImpact
write_buffer_size256 MBSize of each memtable
max_write_buffer_number16Maximum concurrent memtables
min_write_buffer_number_to_merge1Merge threshold before flush
arena_block_size16 MBMemory allocation granularity

Total Write Buffer Capacity:

Max Memory=write_buffer_size×max_write_buffer_number\text{Max Memory} = \text{write\_buffer\_size} \times \text{max\_write\_buffer\_number} =256 MB×16=4 GB= 256\text{ MB} \times 16 = 4\text{ GB}

This allows up to 4 GB of writes to be buffered in memory, enabling high write throughput for bulk ingestion.

5.4.4 Memtable Prefix Bloom Filters

Cognica enables prefix bloom filters on memtables to accelerate point queries before data reaches SSTables:

bloom_size=memtable_size×prefix_bloom_ratio\text{bloom\_size} = \text{memtable\_size} \times \text{prefix\_bloom\_ratio} =256 MB×0.2=51.2 MB per memtable= 256\text{ MB} \times 0.2 = 51.2\text{ MB per memtable}

The bloom filter answers the question "might this key exist in this memtable?" with a configurable false positive rate:

p=(1ekn/m)kp = \left(1 - e^{-kn/m}\right)^k

where kk is the number of hash functions, nn is the number of keys, and mm is the bloom filter size in bits.

5.5 Read Path Implementation

The read path must check multiple locations, making optimization critical for query performance.

5.5.1 Read Flow

Loading diagram...

5.5.2 Block Cache Architecture

Cognica employs a multi-level cache hierarchy to minimize disk I/O:

Primary Block Cache (HyperClockCache):

ParameterValuePurpose
cache_capacity16 GBTotal cache size
cache_shard_bits416 shards for concurrent access
strict_capacity_limittrueNever exceed capacity

The HyperClockCache uses a clock-based eviction algorithm optimized for high concurrency:

Shards=2shard_bits=24=16\text{Shards} = 2^{\text{shard\_bits}} = 2^4 = 16 Shard Size=Total CapacityShards=16 GB16=1 GB\text{Shard Size} = \frac{\text{Total Capacity}}{\text{Shards}} = \frac{16\text{ GB}}{16} = 1\text{ GB}

Index and Filter Caching:

cache_index_and_filter_blocks: true
cache_index_and_filter_blocks_with_high_priority: true

Index and filter blocks receive high priority in the cache because their eviction causes disproportionate performance degradation - every subsequent read must reload them from disk.

5.5.3 Bloom Filter Configuration

Cognica uses Ribbon filters, an advanced alternative to traditional Bloom filters:

Ribbon Filter Advantages:

  • 20-30% less space than Bloom filters for same false positive rate
  • Faster construction for large key sets
  • Cache-friendly query pattern

Configuration:

Bits per Keyln(p)ln(2)2\text{Bits per Key} \approx -\frac{\ln(p)}{\ln(2)^2}

For false positive rate p=1%p = 1\%:

Bits per Keyln(0.01)0.489.6 bits\text{Bits per Key} \approx -\frac{\ln(0.01)}{0.48} \approx 9.6 \text{ bits}

5.5.4 Read Options Optimization

Cognica configures read operations for optimal performance:

OptionValueImpact
auto_prefix_modetrueUse prefix extractors for bloom filters
verify_checksumsfalseSkip checksum verification for speed
readahead_size512 KBSequential read buffer
adaptive_readaheadtrueDynamically adjust based on access pattern
async_iotrueEnable asynchronous I/O

Read Ahead Strategy:

For sequential scans, readahead prefetches data before it is requested:

Effective Bandwidth=Block SizeSeek Time+Transfer Time\text{Effective Bandwidth} = \frac{\text{Block Size}}{\text{Seek Time} + \text{Transfer Time}}

With readahead:

Effective BandwidthReadahead SizeSeek Time+Transfer Time\text{Effective Bandwidth} \approx \frac{\text{Readahead Size}}{\text{Seek Time} + \text{Transfer Time}}

A 512 KB readahead can improve sequential read performance by 10-100x compared to reading individual blocks.

5.6 Compaction Strategies

Compaction is the process of merging SSTables to maintain read performance and reclaim space from deleted or overwritten keys. The choice of compaction strategy significantly impacts system behavior.

5.6.1 Leveled Compaction

Cognica uses leveled compaction, where each level (except L0) contains non-overlapping SSTables:

Properties:

  • L0: Overlapping SSTables (direct memtable flushes)
  • L1+: Non-overlapping, sorted SSTables
  • Size ratio T=10T = 10 between levels

Compaction Trigger:

When level LiL_i exceeds its size limit:

Size(Li)>Ti×Target Base Size\text{Size}(L_i) > T^i \times \text{Target Base Size}

SSTables from LiL_i are merged with overlapping SSTables in Li+1L_{i+1}.

Loading diagram...

5.6.2 Compression Strategy

Cognica employs a tiered compression strategy optimized for the access patterns at each level:

LevelAlgorithmRationale
0-4LZ4Fast compression/decompression for hot data
5-6ZSTDHigh compression ratio for cold data

Compression Trade-offs:

Read Latency=Disk Read Time+Decompression Time\text{Read Latency} = \text{Disk Read Time} + \text{Decompression Time}

For hot data (L0-L4), LZ4's fast decompression minimizes latency:

  • LZ4: ~4 GB/s decompression
  • ZSTD: ~1 GB/s decompression

For cold data (L5-L6), ZSTD's superior compression ratio reduces storage:

  • LZ4: ~2.5x compression
  • ZSTD: ~4x compression

Dictionary Compression:

ZSTD supports dictionary compression, where common patterns are pre-computed:

max_dict_bytes: 32_KB
zstd_max_train_bytes: 3_MB

Dictionary compression can improve ratios by 20-50% for structured data like JSON documents.

5.6.3 Custom Compaction Filter

Cognica implements a custom compaction filter that runs during compaction to:

  1. Expire TTL Data: Remove key-value pairs past their time-to-live
  2. Detect Migration: Identify data requiring schema migration
  3. Validate Structure: Ensure keys match expected category and database type

Filter Decision Logic:

Decision(key,value)={Removeif TTL(value)<nowRemoveif tombstone(value)Keepotherwise\text{Decision}(key, value) = \begin{cases} \text{Remove} & \text{if } \text{TTL}(value) < \text{now} \\ \text{Remove} & \text{if } \text{tombstone}(value) \\ \text{Keep} & \text{otherwise} \end{cases}

The filter runs during compaction, making TTL expiration essentially "free" - data is cleaned up as part of the normal compaction process without additional I/O.

5.6.4 Compaction Tuning

Cognica's compaction configuration balances write amplification, space amplification, and read performance:

ParameterValuePurpose
target_file_size_base64 MBSSTable size at L1
max_bytes_for_level_base512 MBSize limit for L1
level0_file_num_compaction_trigger4L0 files before compaction
level0_slowdown_writes_trigger20L0 files before write slowdown
level0_stop_writes_trigger36L0 files before write stop

Write Stall Prevention:

When L0 accumulates too many files, writes must slow down to allow compaction to catch up:

Write Rate={Full Speedif L0<4Throttledif 4L0<20Severely Throttledif 20L0<36Stoppedif L036\text{Write Rate} = \begin{cases} \text{Full Speed} & \text{if } |L_0| < 4 \\ \text{Throttled} & \text{if } 4 \leq |L_0| < 20 \\ \text{Severely Throttled} & \text{if } 20 \leq |L_0| < 36 \\ \text{Stopped} & \text{if } |L_0| \geq 36 \end{cases}

5.7 Transaction Support

Cognica provides ACID transactions through RocksDB's TransactionDB, with extensions for distributed consensus.

5.7.1 Transaction Abstraction

The transaction interface supports multiple implementation strategies:

Loading diagram...

5.7.2 Isolation Levels

Snapshot Isolation:

Each transaction sees a consistent snapshot of the database at its start time:

Read(T,k)=Version(k,start_time(T))\text{Read}(T, k) = \text{Version}(k, \text{start\_time}(T))

Snapshot isolation prevents dirty reads and non-repeatable reads but allows write skew anomalies.

Serializable Snapshot Isolation (SSI):

RocksDB supports SSI through conflict detection:

Conflict(T1,T2)=ReadSet(T1)WriteSet(T2)\text{Conflict}(T_1, T_2) = \text{ReadSet}(T_1) \cap \text{WriteSet}(T_2) \neq \emptyset

When conflicts are detected, one transaction aborts to maintain serializability.

5.7.3 Write Batch Optimization

For write-heavy workloads, Cognica uses indexed write batches that buffer writes in memory:

Benefits:

  • Read-your-writes: Queries see uncommitted changes within the transaction
  • Reduced lock contention: No locks until commit
  • Atomic commit: All changes apply atomically

Index Structure:

The tsl::htrie_map provides efficient prefix-based lookups:

Tlookup=O(key) (key length, not number of entries)T_{lookup} = O(|key|) \text{ (key length, not number of entries)}

This enables efficient iteration over key ranges within a transaction's write set.

5.7.4 Savepoints

Savepoints enable partial rollback within a transaction:

Loading diagram...

Savepoints are implemented as markers in the write batch, enabling efficient partial undo.

5.8 Custom Extensions

Cognica extends RocksDB with custom components for unified query processing.

5.8.1 Custom Comparators

Key ordering determines SSTable organization and iteration behavior. Cognica provides two comparators:

Ascending Comparator (default):

Compare(a,b)={1if a<b (lexicographically)0if a=b1if a>b\text{Compare}(a, b) = \begin{cases} -1 & \text{if } a < b \text{ (lexicographically)} \\ 0 & \text{if } a = b \\ 1 & \text{if } a > b \end{cases}

Descending Comparator:

Comparedesc(a,b)=Compareasc(a,b)\text{Compare}_{desc}(a, b) = -\text{Compare}_{asc}(a, b)

The descending comparator enables efficient "ORDER BY DESC" queries by storing data in reverse order.

Key Compression Optimization:

The comparators implement FindShortestSeparator() to minimize index block size:

Given keys aa and bb where a<ba < b, find the shortest ss such that as<ba \leq s < b.

Example: For a="application"a = \text{"application"} and b="apply"b = \text{"apply"}, s="applj"s = \text{"applj"}.

5.8.2 Prefix Extraction

Prefix extractors enable bloom filters and prefix-based iteration:

Capped Prefix Transform (14 bytes):

Prefix(key)=key[0:14]\text{Prefix}(key) = key[0:14]

The 14-byte prefix captures:

  • Database type (1 byte)
  • Category ID (1 byte)
  • Collection ID (4 bytes)
  • Index ID (4 bytes)
  • Workspace ID (4 bytes)

This enables efficient filtering: "Find all documents in collection X" requires only prefix-matching bloom filter lookups.

5.8.3 Merge Operators

Merge operators enable atomic read-modify-write operations without read locks:

Counter Merge Operator:

Mergecounter(vold,Δ)=vold+Δ\text{Merge}_{counter}(v_{old}, \Delta) = v_{old} + \Delta

Multiple increments merge during compaction:

Merge(Merge(v,Δ1),Δ2)=v+Δ1+Δ2\text{Merge}(\text{Merge}(v, \Delta_1), \Delta_2) = v + \Delta_1 + \Delta_2

Clustered Term Index Merge Operator:

For full-text search, posting lists must merge efficiently:

Mergeposting(P1,P2)=P1P2\text{Merge}_{posting}(P_1, P_2) = P_1 \cup P_2

The clustered term index stores multiple terms per key, requiring custom merge logic to maintain sorted order and handle deletions.

5.9 Backup and Recovery

Cognica provides backup and recovery mechanisms built on RocksDB's backup engine.

5.9.1 Backup Architecture

Loading diagram...

Incremental Backups:

Subsequent backups only copy new SSTables:

Backup Sizen=New SSTables since Backupn1\text{Backup Size}_n = \text{New SSTables since Backup}_{n-1}

For append-heavy workloads, incremental backups are dramatically smaller than full backups.

5.9.2 Point-in-Time Recovery

The backup engine maintains multiple backup versions:

Backup IDTimestampFilesSize
12024-01-0110010 GB
22024-01-02151.5 GB
32024-01-03202 GB

Recovery restores to any backup point:

Restore(backup_id)Database state at backup time\text{Restore}(backup\_id) \rightarrow \text{Database state at backup time}

5.9.3 Encryption Support

Cognica supports encryption at rest through RocksDB's encrypted environment:

Encryption Flow:

Ciphertext=Ekey(Plaintext)\text{Ciphertext} = E_{key}(\text{Plaintext}) Plaintext=Dkey(Ciphertext)\text{Plaintext} = D_{key}(\text{Ciphertext})

Backups preserve encryption, requiring the same key for restoration.

5.10 Performance Characteristics

This section summarizes the performance characteristics of Cognica's LSM-tree storage.

5.10.1 Throughput Bounds

Write Throughput:

Max Write Throughput=min(Memtable SizeFlush Time,Disk BandwidthWrite Amp)\text{Max Write Throughput} = \min\left(\frac{\text{Memtable Size}}{\text{Flush Time}}, \frac{\text{Disk Bandwidth}}{\text{Write Amp}}\right)

With 256 MB memtables, 100 ms flush time, 500 MB/s disk, and 40x write amp:

Max Write=min(2.56 GB/s,12.5 MB/s)12.5 MB/s sustained\text{Max Write} = \min(2.56 \text{ GB/s}, 12.5 \text{ MB/s}) \approx 12.5 \text{ MB/s sustained}

Read Throughput:

Max Read Throughput=Cache Hit Rate×Memory Bandwidth+(1Cache Hit Rate)×Disk Bandwidth\text{Max Read Throughput} = \text{Cache Hit Rate} \times \text{Memory Bandwidth} + (1 - \text{Cache Hit Rate}) \times \text{Disk Bandwidth}

With 99% cache hit rate, 100 GB/s memory, 500 MB/s disk:

Max Read=0.99×100+0.01×0.599 GB/s\text{Max Read} = 0.99 \times 100 + 0.01 \times 0.5 \approx 99 \text{ GB/s}

5.10.2 Latency Distribution

Point query latency depends on data location:

LocationLatencyProbability
Block Cache1-10 us99% (with good caching)
Memtable10-100 usDepends on recency
L0 SSTables100 us - 1 msLow (bloom filters)
L1+ SSTables1-10 msVery low (bloom filters)

P99 Latency:

P99Disk Read Latency×(1Bloom Filter Effectiveness)P_{99} \approx \text{Disk Read Latency} \times (1 - \text{Bloom Filter Effectiveness})

With 1% bloom filter false positive rate and 1 ms disk latency:

P991 ms×0.01=10 usP_{99} \approx 1\text{ ms} \times 0.01 = 10 \text{ us}

5.10.3 Space Efficiency

Effective Compression Ratio:

Compression Ratio=Logical Data SizePhysical Storage\text{Compression Ratio} = \frac{\text{Logical Data Size}}{\text{Physical Storage}}

With tiered compression (LZ4 for hot, ZSTD for cold):

Effective Ratio0.3×2.5+0.7×4.0=3.55x\text{Effective Ratio} \approx 0.3 \times 2.5 + 0.7 \times 4.0 = 3.55x

Assuming 30% of data is hot (recent) and 70% is cold (historical).

5.11 Summary

This chapter examined the LSM-tree storage architecture underlying Cognica's persistence layer. Key takeaways:

  1. LSM-trees optimize for write throughput by converting random writes to sequential I/O through the memtable/SSTable hierarchy.

  2. Write amplification is the primary cost, with leveled compaction yielding O(TL)O(T \cdot L) amplification. Cognica tunes this through compression tiers and careful level sizing.

  3. Read amplification is controlled through bloom filters (Ribbon filters in Cognica), achieving near-optimal single-read performance for point queries.

  4. The multi-level cache hierarchy (block cache, row cache, OS page cache) minimizes disk I/O for hot data.

  5. Transaction support through RocksDB's TransactionDB provides ACID guarantees with configurable isolation levels.

  6. Custom extensions (comparators, merge operators, compaction filters) adapt the generic LSM-tree for unified query processing.

The storage layer provides the foundation upon which Cognica builds document storage, full-text indexes, and vector indexes - topics we explore in the following chapters.

Copyright (c) 2023-2026 Cognica, Inc.