Chapter 6: Document Storage and Schema Management

This chapter examines Cognica's document storage layer, which provides flexible schema management atop the LSM-tree foundation established in Chapter 5. We explore how JSON documents are encoded for efficient storage, how schemas define structure and constraints, and how indexes accelerate queries across diverse access patterns.

6.1 The Document Model

Document databases emerged from the recognition that many applications work with semi-structured data that doesn't fit neatly into relational tables. Rather than forcing data into rigid schemas, document databases store self-describing records that can vary in structure.

6.1.1 JSON as Universal Data Format

Cognica adopts JSON (JavaScript Object Notation) as its document format. JSON provides:

Simplicity: Human-readable syntax with just six data types:

Objects (key-value maps)
Arrays (ordered sequences)
Strings
Numbers
Booleans
Null

Universality: Native support in every programming language, HTTP APIs, and configuration systems.

Nestability: Documents can contain nested documents and arrays to arbitrary depth.

Example Document:

{
  "_id": "user_12345",
  "name": "Alice Chen",
  "email": "alice@example.com",
  "profile": {
    "bio": "Database enthusiast",
    "location": {
      "city": "San Francisco",
      "country": "USA"
    }
  },
  "tags": ["developer", "researcher"],
  "created_at": "2024-01-15T10:30:00Z"
}

6.1.2 Document vs Relational Trade-offs

The document model trades normalization for locality:

Relational Model:

\text{User} \xrightarrow{\text{JOIN}} \text{Profile} \xrightarrow{\text{JOIN}} \text{Location} \xrightarrow{\text{JOIN}} \text{Tags}

Data is normalized across multiple tables, eliminating redundancy but requiring joins for reconstruction.

Document Model:

\text{User Document} = \text{User} \cup \text{Profile} \cup \text{Location} \cup \text{Tags}

Related data is embedded within a single document, enabling single-read retrieval at the cost of potential redundancy.

Access Pattern Optimization:

Pattern	Relational	Document
Read user with profile	3+ JOINs	1 read
Update user's city	1 update	Read-modify-write
Find users in city	Index scan	Index scan
Aggregate across users	Efficient	Efficient

Documents excel when data is read together more often than updated independently.

6.1.3 RapidJSON Integration

Cognica uses RapidJSON, a high-performance JSON library, as its in-memory document representation:

class Document : public rapidjson::GenericDocument<
    rapidjson::UTF8<>,
    DocumentAllocator
> {
  // Extended with Cognica-specific operations
};

Performance Characteristics:

Operation	Complexity	Notes
Parse JSON string	$O(n)$	Single pass, in-situ possible
Access field by name	$O(m)$	Linear scan, $m$ = object size
Access array element	$O(1)$	Direct index
Iterate all fields	$O(m)$	Sequential scan
Serialize to string	$O(n)$	Single pass

RapidJSON's DOM (Document Object Model) representation stores parsed JSON in memory, enabling random access and modification.

6.1.4 Custom Allocator

Cognica employs a custom memory allocator for document operations:

Benefits:

Pool allocation: Reduces malloc/free overhead
Arena semantics: Bulk deallocation when document is destroyed
Cache locality: Related allocations are contiguous

Allocation Strategy:

\text{Block Size} = \max(\text{requested}, \text{arena\_block\_size})

Small allocations come from the current arena block; large allocations get dedicated blocks.

6.2 Document Encoding

Storing JSON documents directly would be inefficient. Cognica encodes documents into a compact binary format optimized for storage and retrieval.

6.2.1 Type Encoding

Each value is prefixed with a type marker:

Type	Code	Description
Object	`0x01`	Nested document
Array	`0x02`	Ordered sequence
Null	`0x03`	Null value
False	`0x04`	Boolean false
True	`0x05`	Boolean true
Int64	`0x06`	64-bit signed integer
UInt64	`0x07`	64-bit unsigned integer
Double	`0x08`	IEEE 754 double
String	`0x09`	UTF-8 string

Type-Length-Value (TLV) Encoding:

\text{Encoded Value} = \text{Type}(1) \| \text{Length}(var) \| \text{Data}(length)

Variable-length encoding uses continuation bits to minimize space for small values:

\text{Encoded Length} = \begin{cases} 1 \text{ byte} & \text{if } length < 128 \\ 2 \text{ bytes} & \text{if } length < 16384 \\ \vdots & \vdots \end{cases}

6.2.2 Primitive Encoding

Integers:

Signed integers use sign-flip encoding to preserve sort order:

\text{encode}(n) = \begin{cases} n \oplus \text{0x8000000000000000} & \text{if } n \geq 0 \\ n \oplus \text{0xFFFFFFFFFFFFFFFF} & \text{if } n < 0 \end{cases}

This transforms the two's complement representation so that:

\text{encode}(-1) < \text{encode}(0) < \text{encode}(1)

Lexicographic comparison of encoded bytes yields correct numeric ordering.

Floating-Point Numbers:

IEEE 754 doubles require special handling for sortable encoding:

\text{encode}(d) = \begin{cases} \text{bits}(d) \oplus \text{0x8000000000000000} & \text{if } d \geq 0 \\ \text{bits}(d) \oplus \text{0xFFFFFFFFFFFFFFFF} & \text{if } d < 0 \end{cases}

where $\text{bits}(d)$ interprets the 64-bit IEEE 754 representation as an unsigned integer.

Strings:

Strings are encoded with length prefix followed by UTF-8 bytes:

\text{Encoded String} = \text{VarInt}(length) \| \text{UTF8 bytes}

For key comparison, null-terminated encoding is used:

\text{Key String} = \text{UTF8 bytes} \| \text{0x00}

6.2.3 Composite Encoding

Objects:

Objects encode as sequences of key-value pairs:

\text{Object} = \text{0x01} \| \text{VarInt}(count) \| \text{KV}_1 \| \text{KV}_2 \| ... \| \text{KV}_n

Each key-value pair:

\text{KV} = \text{VarInt}(key\_len) \| \text{key} \| \text{encoded value}

Arrays:

Arrays encode as sequences of values:

\text{Array} = \text{0x02} \| \text{VarInt}(count) \| \text{Value}_1 \| \text{Value}_2 \| ... \| \text{Value}_n

6.2.4 Document Layout

Complete documents include a header with metadata:

Loading diagram...

Header Fields:

Field	Size	Purpose
Timestamp	8 bytes	Creation/modification time
TTL	4 bytes	Time-to-live in seconds (0 = never expires)
Flags	1 byte	Metadata flags (deleted, migrating, etc.)

Space Efficiency:

Consider encoding the example user document:

Component	JSON Size	Encoded Size
Field names	89 bytes	89 bytes
String values	78 bytes	82 bytes
Structural overhead	45 bytes	15 bytes
Total	212 bytes	186 bytes

Binary encoding typically achieves 10-30% size reduction through eliminated whitespace and compact length encoding.

6.3 Schema Definition

While documents can vary in structure, schemas define expectations and constraints that enable optimization and validation.

6.3.1 Schema Structure

A Cognica schema specifies:

collection: users
workspace: default

primary_key:
  fields: [_id]
  unique: true

secondary_keys:
  - name: email_idx
    fields: [email]
    unique: true
    type: secondary_key

  - name: location_idx
    fields: [profile.location.country, profile.location.city]
    type: secondary_key

  - name: content_idx
    fields: [profile.bio]
    type: full_text_search

comment: "User accounts with profile information"

6.3.2 Schema Components

Primary Key:

Every collection has exactly one primary key that uniquely identifies documents:

\text{PK}: \mathcal{D} \rightarrow \mathcal{K}

The primary key maps each document to a unique key value. Primary keys can be:

Single field: _id
Composite: (tenant_id, user_id)
Auto-generated: UUID or sequence

Secondary Keys:

Secondary keys create additional access paths:

\text{SK}: \mathcal{D} \rightarrow 2^{\mathcal{K}}

Unlike primary keys, secondary keys can map to sets (for non-unique indexes) and support:

B-tree indexes: For range queries and sorting
Full-text indexes: For text search
Clustered indexes: Storing document data with the index

6.3.3 Schema Builder Pattern

Schemas are constructed programmatically using the builder pattern:

auto schema = SchemaBuilder{}
    .set_workspace_id(workspace_id)
    .set_collection_id(collection_id)
    .set_collection_name("users")
    .set_primary_key({"_id"}, PrimaryKeyOptions{.unique = true})
    .add_secondary_key("email_idx", {"email"}, SecondaryKeyOptions{
        .unique = true,
        .type = IndexType::kSecondaryKey
    })
    .add_secondary_key("content_idx", {"profile.bio"}, SecondaryKeyOptions{
        .type = IndexType::kFullTextSearchIndex
    })
    .set_comment("User accounts")
    .build();

The builder validates constraints during construction:

Primary key must have at least one field
Secondary key names must be unique
Field paths must be valid dot notation

6.3.4 Schema Flexibility

Cognica supports schema-on-read semantics: documents can contain fields not defined in the schema. The schema defines:

Indexed fields: Fields with associated indexes
Type hints: Expected types for validation
Constraints: Uniqueness, nullability

Documents may include additional fields that are stored but not indexed. This enables gradual schema evolution without migration.

6.4 Key Encoding

Keys must be encoded to preserve ordering in the LSM-tree while supporting composite keys and nullable fields.

6.4.1 Primary Key Encoding

Primary keys are encoded with a prefix identifying the collection:

\text{PK Storage Key} = \text{Prefix}(14) \| \text{Encoded PK Fields}

Prefix Structure:

Component	Bytes	Purpose
Database Type	1	Distinguishes document DB from others
Category	1	Data category (user data = 2)
Workspace ID	4	Multi-tenant isolation
Collection ID	4	Collection identification
Index ID	4	Primary key index (always 0)

Field Encoding:

For composite primary keys (field_1, field_2, ...):

\text{Encoded PK} = \text{enc}(field_1) \| \text{enc}(field_2) \| ...

Each field is encoded with its type-specific encoding, ensuring lexicographic order matches logical order.

6.4.2 Secondary Key Encoding

Secondary keys include both the secondary key fields and the primary key (for uniqueness):

\text{SK Storage Key} = \text{Prefix}(14) \| \text{Encoded SK Fields} \| \text{Encoded PK}

Example:

For index location_idx on (country, city) with primary key _id:

Key: [prefix][country][city][_id]
     [14 bytes][var][var][var]

This encoding enables:

Prefix scans: Find all users in a country
Range scans: Find users in countries A-M
Exact lookup: Find user with specific country+city+id

6.4.3 Nullable Field Handling

Nullable fields require special encoding to maintain sort order:

\text{enc}_{nullable}(v) = \begin{cases} \text{0x00} & \text{if } v = \text{null} \\ \text{0x01} \| \text{enc}(v) & \text{otherwise} \end{cases}

Null values sort before all non-null values (or after, depending on configuration).

6.4.4 Sort Order Preservation

The encoding must satisfy:

v_1 < v_2 \implies \text{enc}(v_1) <_{lex} \text{enc}(v_2)

where $<_{lex}$ is lexicographic (byte-wise) comparison.

Descending Order:

For descending sorts, the encoding is inverted:

\text{enc}_{desc}(v) = \text{complement}(\text{enc}_{asc}(v))

where complement flips all bits. This reverses the sort order while maintaining the comparison-by-bytes property.

6.5 Index Architecture

Indexes are the primary mechanism for accelerating queries. Cognica supports multiple index types optimized for different access patterns.

6.5.1 Index Type Hierarchy

Loading diagram...

6.5.2 Index Types

Type	Code	Use Case
Primary Key	0	Unique document identification
Secondary Key	1	Traditional B-tree index
Clustered Secondary	2	Secondary index with embedded data
Full-Text Search	3	Text search with posting lists
Clustered FTS	4	FTS with embedded document data

Primary Key Index:

The primary key index stores complete documents:

\text{Key} = \text{PK} \quad \text{Value} = \text{Encoded Document}

Secondary Key Index:

Secondary indexes store only the mapping:

\text{Key} = \text{SK} \| \text{PK} \quad \text{Value} = \text{TTL Metadata}

Lookups require two steps:

Find PK via secondary index
Fetch document via primary key

Clustered Secondary Index:

Clustered secondaries embed document data:

\text{Key} = \text{SK} \| \text{PK} \quad \text{Value} = \text{Encoded Document}

This eliminates the second lookup at the cost of storage duplication.

6.5.3 Index Descriptor

The IndexDescriptor manages all indexes for a collection:

class IndexDescriptor {
  PrimaryKey primary_key_;
  std::vector<SecondaryKey> secondary_keys_;
  mutable std::shared_mutex mutex_;

  // Operations
  auto get_primary_key() const -> const PrimaryKey&;
  auto get_secondary_key(IndexID id) const -> const SecondaryKey*;
  auto find_by_name(std::string_view name) const -> const SecondaryKey*;
  void add_secondary_key(SecondaryKey&& sk);
  void remove_secondary_key(IndexID id);
};

Thread Safety:

The descriptor uses a shared mutex for concurrent access:

Multiple readers can access concurrently
Writers acquire exclusive access
Index additions/removals are atomic

6.5.4 Index Statistics

Each index tracks usage statistics for query optimization:

struct IndexStatistics {
  std::atomic<int64_t> accessed;   // Query count
  std::atomic<int64_t> added;      // Insert count
  std::atomic<int64_t> updated;    // Update count
  std::atomic<int64_t> deleted;    // Delete count
  std::atomic<int64_t> merged;     // Merge operation count

  TimePoint accessed_at;  // Last query time
  TimePoint added_at;     // Last insert time
  TimePoint updated_at;   // Last update time
  TimePoint deleted_at;   // Last delete time
  TimePoint merged_at;    // Last merge time
};

Statistics inform:

Index selection: Prefer frequently-used indexes
Maintenance scheduling: Identify cold indexes for optimization
Capacity planning: Track growth rates

6.6 Collection Operations

Collections are the primary interface for document manipulation, providing ACID operations through the transaction layer.

6.6.1 Collection Architecture

Loading diagram...

6.6.2 CRUD Operations

Insert:

Status Collection::insert(const Document& doc) {
  // 1. Extract primary key
  auto pk = extract_primary_key(doc);

  // 2. Check uniqueness
  if (pk_reader_->exists(pk)) {
    return Status::AlreadyExists("Duplicate primary key");
  }

  // 3. Encode document
  auto encoded = encode_document(doc);

  // 4. Write to primary index
  pk_writer_->put(pk, encoded);

  // 5. Update secondary indexes
  for (auto& sk_writer : sk_writers_) {
    auto sk = extract_secondary_key(doc, sk_writer->descriptor());
    sk_writer->put(sk, pk);
  }

  return Status::OK();
}

Find:

Cursor Collection::find(const Document& query) {
  // 1. Analyze query
  auto plan = query_planner_.plan(query);

  // 2. Select best index
  auto index = plan.best_index();

  // 3. Create cursor
  if (index.is_primary_key()) {
    return pk_reader_->scan(plan.key_range());
  } else {
    return sk_readers_[index.id()]->scan(plan.key_range());
  }
}

Update:

Status Collection::update(const Document& filter, const Document& updates) {
  // 1. Find matching documents
  auto cursor = find(filter);

  // 2. Apply updates
  while (cursor.valid()) {
    auto doc = cursor.document();

    // 3. Apply update operators
    apply_updates(doc, updates);

    // 4. Rewrite document
    auto pk = extract_primary_key(doc);
    pk_writer_->put(pk, encode_document(doc));

    // 5. Update secondary indexes if affected fields changed
    update_secondary_indexes(old_doc, doc);

    cursor.next();
  }

  return Status::OK();
}

Delete:

Status Collection::remove(const Document& filter) {
  auto cursor = find(filter);

  while (cursor.valid()) {
    auto doc = cursor.document();
    auto pk = extract_primary_key(doc);

    // 1. Delete from primary index
    pk_writer_->del(pk);

    // 2. Delete from secondary indexes
    for (auto& sk_writer : sk_writers_) {
      auto sk = extract_secondary_key(doc, sk_writer->descriptor());
      sk_writer->del(sk, pk);
    }

    cursor.next();
  }

  return Status::OK();
}

6.6.3 Batch Operations

For bulk inserts, batch operations amortize overhead:

Status Collection::insert_parallel(const std::vector<Document>& docs) {
  // 1. Partition documents across threads
  auto partitions = partition(docs, thread_count_);

  // 2. Process partitions in parallel
  parallel_for(partitions, [this](auto& partition) {
    auto batch = begin_write_batch();

    for (auto& doc : partition) {
      batch.insert(doc);
    }

    batch.commit();
  });

  return Status::OK();
}

Performance Characteristics:

Operation	Single	Batch (1000 docs)
Insert	100 us	50 ms (50 us/doc)
Index update	50 us	25 ms (25 us/doc)
Total	150 us	75 ms
Throughput	6,600/s	13,300/s

Batching doubles throughput by amortizing transaction overhead.

6.6.4 Transaction Support

Collections support ACID transactions:

auto txn = collection.begin_transaction();

try {
  txn.insert(doc1);
  txn.update(filter, updates);
  txn.remove(filter2);

  txn.commit();
} catch (...) {
  txn.rollback();
}

Isolation Levels:

Level	Dirty Read	Non-Repeatable	Phantom
Read Uncommitted	Yes	Yes	Yes
Read Committed	No	Yes	Yes
Repeatable Read	No	No	Yes
Serializable	No	No	No

Cognica defaults to Snapshot Isolation, which prevents dirty reads and non-repeatable reads while allowing phantoms in some cases.

6.7 Index Reader and Writer

The index reader/writer abstraction separates query and mutation operations.

6.7.1 Index Reader Interface

class IndexReader {
public:
  // Point lookup
  virtual auto get(const Slice& key) -> std::optional<Document> = 0;

  // Existence check
  virtual auto exists(const Slice& key) -> bool = 0;

  // Range scan
  virtual auto scan(const KeyRange& range) -> Cursor = 0;

  // Prefix scan
  virtual auto scan_prefix(const Slice& prefix) -> Cursor = 0;

  // Count
  virtual auto count(const KeyRange& range) -> size_t = 0;
};

6.7.2 Index Writer Interface

class IndexWriter {
public:
  // Insert
  virtual auto put(const Slice& key, const Slice& value) -> Status = 0;

  // Delete
  virtual auto del(const Slice& key) -> Status = 0;

  // Batch operations
  virtual auto put_batch(const std::vector<KV>& kvs) -> Status = 0;
  virtual auto del_batch(const std::vector<Slice>& keys) -> Status = 0;
};

6.7.3 Key Codec

The key codec handles encoding and decoding of index keys:

Primary Key Codec:

struct PrimaryKeyIndexKeyCodec {
  static auto encode(
    const PrimaryKey& pk_desc,
    const Slice& pk
  ) -> std::string {
    std::string key;
    // Add 14-byte prefix
    append_prefix(key, pk_desc.guid());
    // Add encoded primary key fields
    key.append(pk.data(), pk.size());
    return key;
  }

  static auto decode(
    const PrimaryKey& pk_desc,
    const Slice& storage_key
  ) -> Slice {
    // Skip 14-byte prefix
    return storage_key.substr(14);
  }
};

Secondary Key Codec:

struct SecondaryKeyIndexKeyCodec {
  static auto encode(
    const PrimaryKey& pk_desc,
    const SecondaryKey& sk_desc,
    const Slice& pk,
    const Document& doc,
    bool nullable
  ) -> std::string {
    std::string key;
    // Add 14-byte prefix with SK index ID
    append_prefix(key, sk_desc.guid());
    // Add encoded secondary key fields
    for (const auto& field : sk_desc.fields()) {
      auto value = doc.find(field);
      encode_field(key, value, nullable);
    }
    // Append primary key for uniqueness
    key.append(pk.data(), pk.size());
    return key;
  }
};

6.7.4 Index Affinity Score

The query optimizer uses affinity scores to select the best index:

\text{Affinity}(Q, I) = \sum_{f \in \text{fields}(Q) \cap \text{fields}(I)} w(f, I)

where $w(f, I)$ is the weight of field $f$ in index $I$ (higher for earlier positions).

Scoring Algorithm:

double Index::compute_affinity_score(const FieldNames& query_fields) const {
  double score = 0.0;
  size_t position = 0;

  for (const auto& field : fields_) {
    if (query_fields.contains(field)) {
      // Higher weight for earlier positions (prefix selectivity)
      score += 1.0 / (position + 1);
    } else {
      // Gap in index prefix reduces usefulness
      break;
    }
    position++;
  }

  return score;
}

6.8 Dot Notation and Nested Documents

Cognica supports dot notation for accessing nested fields, enabling queries and indexes on deeply nested data.

6.8.1 Path Syntax

Dot notation uses periods to separate nested field names:

Path	Meaning
`name`	Top-level field
`profile.bio`	Nested field
`profile.location.city`	Deeply nested field
`tags[0]`	Array element
`tags[*]`	All array elements

6.8.2 Path Resolution

class DotNotationSupport {
public:
  // Find nested member
  auto find_member(const Document& doc, std::string_view path)
      -> std::optional<Value>;

  // Add nested member (creating intermediate objects)
  auto add_member(Document& doc, std::string_view path, Value value)
      -> Status;

  // Check existence
  auto has_member(const Document& doc, std::string_view path)
      -> bool;

  // Remove nested member
  auto remove_member(Document& doc, std::string_view path)
      -> Status;
};

Resolution Algorithm:

find_member(doc, "profile.location.city"):
  1. Split path: ["profile", "location", "city"]
  2. current = doc
  3. For each segment:
     - If current is object and has segment:
       current = current[segment]
     - Else: return null
  4. Return current

6.8.3 Nested Index Creation

Indexes on nested fields work identically to top-level fields:

secondary_keys:
  - name: city_idx
    fields: [profile.location.city]
    type: secondary_key

The index stores the nested value directly, enabling efficient lookups:

SELECT * FROM users WHERE profile.location.city = 'San Francisco'

Uses city_idx for O(log n) lookup rather than O(n) full scan.

6.8.4 Array Handling

Arrays require special handling for indexing:

Multi-Key Index:

For a document with array field:

{"_id": "1", "tags": ["developer", "researcher"]}

A multi-key index creates entries for each array element:

\text{Index Entries} = \{(\text{"developer"}, \text{"1"}), (\text{"researcher"}, \text{"1"})\}

Query Semantics:

SELECT * FROM users WHERE tags = 'developer'

Matches any document where tags contains "developer".

6.9 Catalog Management

The catalog stores metadata about collections, indexes, and schemas.

6.9.1 Catalog Structure

Loading diagram...

6.9.2 Catalog Operations

Operation	Description
`create_collection`	Register new collection with schema
`drop_collection`	Remove collection and all data
`get_collection`	Retrieve collection metadata
`list_collections`	Enumerate workspace collections
`create_index`	Add secondary index
`drop_index`	Remove secondary index
`get_index`	Retrieve index metadata

6.9.3 Schema Versioning

Schemas evolve over time. Cognica tracks schema versions:

\text{Schema}_{v+1} = \text{migrate}(\text{Schema}_v, \text{changes})

Compatible Changes (no migration needed):

Adding nullable fields
Adding secondary indexes
Adding new collections

Incompatible Changes (require migration):

Changing primary key fields
Changing field types
Removing required fields

6.9.4 Metadata Persistence

Catalog metadata is stored in the system database category:

\text{Key} = \text{0x00} \| \text{type} \| \text{workspace\_id} \| \text{collection\_id}

Type	Purpose
`0x01`	Collection schema
`0x02`	Index descriptor
`0x03`	Statistics
`0x04`	Access control

6.10 Query Context and Projection

Query context carries execution state through the query pipeline.

6.10.1 Query Context Structure

struct QueryContext {
  // Execution mode
  bool is_single_document;
  bool is_streaming;

  // Field projection
  FieldProjectMap projection;

  // Transaction state
  Transaction* transaction;
  Snapshot* snapshot;

  // Statistics
  QueryStatistics stats;
};

6.10.2 Field Projection

Projections limit which fields are returned, reducing I/O and network transfer:

SELECT name, email FROM users WHERE status = 'active'

Projection Encoding:

struct FieldProjectMap {
  enum Mode { kInclude, kExclude };

  Mode mode;
  std::unordered_set<std::string> fields;

  bool should_include(std::string_view field) const {
    bool in_set = fields.contains(field);
    return (mode == kInclude) ? in_set : !in_set;
  }
};

Projection Optimization:

For queries touching only indexed fields, the query can be answered from the index alone (covering index):

\text{Covering} \iff \text{projected fields} \subseteq \text{index fields}

Covering queries avoid the primary key lookup entirely.

6.10.3 Query Statistics

Each query collects execution statistics:

struct QueryStatistics {
  size_t documents_scanned;
  size_t documents_returned;
  size_t index_keys_examined;
  size_t bytes_read;

  Duration parse_time;
  Duration plan_time;
  Duration execution_time;

  std::string selected_index;
};

Statistics enable:

Query debugging: Identify slow queries
Index tuning: Find missing indexes
Capacity planning: Predict resource usage

6.11 Summary

This chapter explored Cognica's document storage layer, from JSON representation through binary encoding to index management. Key takeaways:

JSON documents provide flexible schema with nested structure, encoded efficiently in binary format for storage.
Key encoding preserves sort order for composite keys, enabling efficient range scans in the LSM-tree.
Multiple index types (primary, secondary, full-text, clustered) optimize for different access patterns.
Schema management balances flexibility (schema-on-read) with optimization (indexed fields, constraints).
Collection operations provide ACID guarantees through the transaction layer, with batch optimization for bulk workloads.
Dot notation enables seamless access to nested fields, with multi-key indexes for arrays.
Catalog management tracks metadata with support for schema evolution.

The document layer provides the structured data interface that applications interact with, while the next chapter explores how full-text search indexes enable efficient text queries across document collections.