Chapter 1: Introduction to Unified Data Processing

1.1 The Data Paradigm Fragmentation Problem

Modern applications face an increasingly complex data landscape. A typical enterprise system might need to:

Store structured business records in a relational database for ACID transactions
Index product descriptions in a full-text search engine for keyword queries
Compute similarity between items using vector embeddings in a specialized vector database
Navigate relationships between entities through a graph database
Cache frequently accessed data in a key-value store

This approach, known as polyglot persistence, emerged from the recognition that different data models excel at different workloads. A relational database optimized for transactions differs fundamentally from a search engine optimized for text retrieval. Each system brings its own query language, consistency model, operational requirements, and failure modes.

1.1.1 The Operational Burden

Consider a product search feature that must:

Find products matching text keywords (full-text search)
Filter by category and price range (relational predicates)
Rank by semantic similarity to user preferences (vector search)
Traverse supplier relationships (graph queries)

In a polyglot architecture, this single user-facing feature requires orchestrating four separate database systems:

Loading diagram...

The application must:

Maintain connections to four different systems
Translate between four query languages
Merge results with potentially inconsistent identifiers
Handle partial failures when any system becomes unavailable
Manage data synchronization across all systems

Each additional system multiplies operational complexity. DevOps teams must master four different backup strategies, monitoring dashboards, scaling procedures, and failure recovery protocols.

1.1.2 The Consistency Challenge

Beyond operational complexity lies a more fundamental problem: consistency. When a product's price changes in PostgreSQL, how quickly does Elasticsearch reflect the update? What happens if the vector database's embedding becomes stale?

Polyglot architectures typically offer only eventual consistency across systems, with no transactional guarantees spanning multiple databases. Applications must implement complex reconciliation logic, handle temporary inconsistencies gracefully, and design around the possibility that different systems hold conflicting views of the same data.

1.1.3 The Impedance Mismatch

Each data paradigm brings its own conceptual model:

Paradigm	Data Model	Query Model	Result Model
Relational	Tables, Rows	SQL, Joins	Result Sets
Document	JSON/BSON	Query DSL	Documents
Full-Text	Terms, Postings	Boolean/BM25	Scored Hits
Vector	Embeddings	k-NN	Distances
Graph	Nodes, Edges	Traversal	Paths

Translating between these models introduces impedance mismatch - the conceptual friction of mapping one paradigm's abstractions onto another's. A document database's nested structure doesn't naturally decompose into relational joins. A graph traversal doesn't directly map to vector similarity. Each translation loses information or introduces complexity.

1.2 The Case for Unification

What if a single system could natively support all these paradigms? Not through adapters or plugins, but through a unified foundation that treats relational predicates, text relevance, vector similarity, and graph traversal as variations of the same underlying algebra?

This is the vision of unified data processing: a single database engine where:

One storage layer manages all data
One query language expresses all operations
One optimizer plans across paradigms
One transaction model ensures consistency
One operational model simplifies deployment

1.2.1 The Posting List Insight

The key insight enabling unification comes from information retrieval theory. Consider how a full-text search engine finds documents containing a term:

\tau_{\text{database}} = \{d_1, d_5, d_7, d_{12}, ...\}

This posting list - the set of document IDs containing term "database" - is simply a set. Set operations form a Boolean algebra with well-understood properties:

Intersection ( $\cap$ ): Documents containing BOTH terms
Union ( $\cup$ ): Documents containing EITHER term
Complement ( $\neg$ ): Documents NOT containing a term

Now consider a relational filter: "products where price < 100". This also produces a set of qualifying document IDs:

\sigma_{\text{price}<100} = \{d_2, d_5, d_7, d_9, ...\}

A vector similarity search for the top-k nearest neighbors? Another set of document IDs:

\text{kNN}(\vec{q}, k) = \{d_3, d_5, d_8, ...\}

A graph traversal finding all nodes within 2 hops? Yet another set:

\text{traverse}(v_0, 2) = \{d_5, d_6, d_7, ...\}

The profound realization: All these seemingly different operations produce the same thing - sets of document identifiers. They can all be represented as posting lists, combined with the same Boolean operations, and optimized with the same algebraic transformations.

1.2.2 Unified Architecture

This insight leads to a unified architecture where posting lists serve as the universal abstraction:

Loading diagram...

A single query can seamlessly combine:

SELECT p.name, p.price
FROM products p
WHERE p.category = 'electronics'                    -- Relational
  AND MATCH(p.description) AGAINST ('wireless')    -- Full-text
  AND vector_similarity(p.embedding, ?) > 0.8      -- Vector
  AND EXISTS (                                      -- Graph
    SELECT 1 FROM suppliers s
    WHERE s.id = p.supplier_id
    AND s.rating > 4.0
  )
ORDER BY bm25_score(p.description) DESC
LIMIT 10;

The query planner recognizes each predicate's paradigm, retrieves the corresponding posting lists, and merges them using optimized set operations - all within a single transaction, with consistent results, through one query interface.

1.3 Cognica Architecture Overview

Cognica implements this unified vision through carefully designed components that work together to process queries across paradigms.

1.3.1 Design Principles

Principle 1: Posting Lists as Universal Currency

Every index, filter, and search operation ultimately produces posting lists. The system maintains a common representation that flows through all processing stages, enabling uniform optimization and execution.

Principle 2: Algebraic Optimization

Because posting lists form a Boolean algebra, the query optimizer can apply algebraic transformations regardless of the original paradigm:

Predicate pushdown works for relational filters and text queries
Join reordering applies to relational joins and graph traversals
Cost-based selection chooses between index scan and sequential scan

Principle 3: Vectorized Execution

Modern CPUs achieve highest throughput when processing data in batches. Cognica's execution engine processes posting lists in columnar batches, exploiting SIMD instructions and cache locality.

Principle 4: Tiered Compilation

Frequently executed query patterns compile from bytecode interpretation through JIT compilation to native code, achieving performance competitive with hand-written C++ while maintaining the flexibility of a general-purpose query engine.

1.3.2 System Components

Loading diagram...

Protocol Layer: Cognica speaks multiple protocols natively. The PostgreSQL wire protocol enables compatibility with existing tools like psql, JDBC drivers, and BI platforms. Arrow Flight SQL enables high-throughput analytical queries.

Query Processing Pipeline: SQL queries parse through libpg_query (PostgreSQL's actual parser), ensuring compatibility with PostgreSQL syntax. The semantic analyzer resolves names, checks types, and expands views. The query planner generates logical plans, and the cost-based optimizer selects physical implementations.

Execution Engine: The Cognica Virtual Machine (CVM) executes queries through a register-based bytecode interpreter. Hot paths automatically compile to native code via copy-and-patch JIT compilation. Vectorized operators process data in columnar batches for maximum throughput.

Storage Layer: RocksDB provides the foundational LSM-tree storage with ACID transactions and MVCC. Specialized indexes layer on top: inverted indexes for full-text search, HNSW graphs for vector similarity, secondary indexes for relational queries.

Distribution Layer: The Raft consensus protocol provides distributed consistency for multi-node deployments. All storage operations replicate through the consensus layer, ensuring durability and enabling horizontal scaling.

1.3.3 Query Execution Flow

To illustrate how these components work together, consider a hybrid query that combines text search with relational filtering:

SELECT title, author, bm25_score(content) as score
FROM articles
WHERE MATCH(content) AGAINST ('database internals')
  AND published_date > '2024-01-01'
ORDER BY score DESC
LIMIT 10;

Step 1: Parsing

The SQL parser produces an Abstract Syntax Tree (AST) representing the query structure. The MATCH...AGAINST clause parses as a full-text search predicate; the date comparison as a relational predicate.

Step 2: Semantic Analysis

The analyzer resolves articles to a collection, validates that content has a full-text index, confirms published_date is a timestamp type, and verifies bm25_score() is a valid scoring function.

Step 3: Logical Planning

The planner produces a logical plan:

Limit(10)
  Sort(score DESC)
    Project(title, author, bm25_score(content) as score)
      Filter(published_date > '2024-01-01')
        FTSSearch(content, 'database internals')
          Scan(articles)

Step 4: Optimization

The optimizer recognizes that the FTS search and date filter can execute independently, then intersect their posting lists:

Limit(10)
  Sort(score DESC)
    Project(title, author, score)
      PostingListIntersect
        FTSSearch(content, 'database internals')  -> posting list + scores
        IndexScan(published_date > '2024-01-01')  -> posting list

Step 5: Physical Planning

The physical planner selects concrete implementations:

FTS search uses WAND algorithm for efficient top-k retrieval
Date filter uses secondary index range scan
Intersection uses sorted merge with score propagation
Sort uses in-memory heap for small result sets

Step 6: Code Generation

The CVM compiler generates bytecode implementing the physical plan. Register allocation assigns document IDs, scores, and intermediate results to virtual registers.

Step 7: Execution

The bytecode interpreter executes the plan, fetching posting lists from indexes, computing intersections, scoring documents with BM25, and returning the top 10 results.

Step 8: Result Delivery

Results serialize through the PostgreSQL wire protocol back to the client, appearing exactly as they would from a PostgreSQL database.

1.4 Historical Context

Cognica's unified approach builds on decades of database and information retrieval research.

1.4.1 Evolution of Database Systems

1970s - Relational Model: Edgar Codd's relational model established the mathematical foundation for database systems. Relational algebra provided a formal framework for query optimization, proving that different query expressions could produce identical results.

1980s - Query Optimization: System R and INGRES pioneered cost-based query optimization, demonstrating that declarative queries could compile to efficient execution plans through algebraic transformation.

1990s - Object-Relational: As applications grew complex, object-relational databases attempted to bridge the gap between relational storage and object-oriented programming. This era introduced extensible type systems and user-defined functions.

2000s - NoSQL Movement: Web-scale applications drove the NoSQL revolution. Document stores (MongoDB), key-value stores (Redis), and graph databases (Neo4j) optimized for specific access patterns at the cost of query flexibility.

2010s - NewSQL and Convergence: Systems like CockroachDB and TiDB proved that distributed ACID transactions were achievable. Meanwhile, traditional databases began adding JSON support, full-text search, and other features.

2020s - Unified Systems: The current generation aims to eliminate the polyglot complexity entirely. Rather than adding features piecemeal, systems like Cognica rethink the foundational abstractions to enable native multi-paradigm support.

1.4.2 Information Retrieval Foundations

Full-text search engines developed independently from databases, with their own theoretical foundations:

Boolean Retrieval: The earliest IR systems matched Boolean combinations of terms. Documents either matched a query or didn't - no ranking, just set operations on posting lists.

Vector Space Model: Salton's vector space model represented documents and queries as vectors in term space, enabling similarity-based ranking through cosine similarity.

Probabilistic Models: Robertson's probability ranking principle established that documents should rank by their probability of relevance. This led to BM25, still the dominant text ranking function.

Neural Retrieval: Modern neural models encode documents and queries as dense vectors, enabling semantic similarity beyond lexical matching. This drives the current interest in vector databases.

1.4.3 Prior Unification Attempts

Previous attempts at unification typically followed one of two paths:

Extension Approach: Traditional databases added features incrementally. PostgreSQL added tsvector for full-text search, jsonb for documents, and pgvector for embeddings. While functional, these extensions often feel bolted-on, with limited cross-feature optimization.

Federation Approach: Systems like Presto and Trino federate queries across multiple backends. While providing a unified interface, they cannot optimize across data sources or provide cross-source transactions.

Cognica takes a different path: native unification. Rather than extending a relational database or federating separate systems, it builds from a foundation where posting lists are first-class citizens, enabling deep optimization across paradigms.

1.5 What This Book Covers

This book provides a comprehensive treatment of Cognica's design and implementation, suitable for:

Graduate students studying database systems, information retrieval, or distributed systems
Database researchers exploring unified query processing
Senior engineers building or operating data-intensive applications
Contributors seeking to understand and extend Cognica

1.5.1 Part I: Foundations (Chapters 1-4)

We establish the mathematical framework for unified query processing:

Chapter 2 formalizes posting lists as a Boolean algebra and defines the type system spanning documents, vectors, terms, and graphs
Chapter 3 extends the algebra to incorporate graph structures while preserving algebraic properties
Chapter 4 develops query optimization theory, including cost models, selectivity estimation, and transformation rules

1.5.2 Part II: Storage Engine (Chapters 5-7)

We examine how data persists and indexes organize:

Chapter 5 details the LSM-tree storage architecture based on RocksDB
Chapter 6 explains document storage, schema management, and secondary indexes
Chapter 7 deep-dives into inverted index architecture, including the innovative clustered term index

1.5.3 Part III: Query Processing (Chapters 8-10)

We trace queries from SQL text to executable plans:

Chapter 8 covers SQL parsing and semantic analysis
Chapter 9 explains logical planning and optimization
Chapter 10 details physical planning and execution strategy selection

1.5.4 Part IV: Execution Engine (Chapters 11-15)

We explore the Cognica Virtual Machine in depth:

Chapter 11 presents CVM architecture: instruction formats, registers, and dispatch
Chapter 12 details the compilation pipeline from SQL to bytecode
Chapter 13 explains vectorized execution for batch processing
Chapter 14 covers copy-and-patch JIT compilation
Chapter 15 describes zero-copy JOIN implementation

1.5.5 Part V: Similarity Search and Ranking (Chapters 16-20)

We examine text and vector search capabilities:

Chapter 16 details the text analysis pipeline
Chapter 17 explains BM25 scoring and its Bayesian extension for calibrated relevance
Chapter 18 covers vector search with HNSW indexes
Chapter 19 describes hybrid search architecture combining text and vectors
Chapter 20 presents query evaluation strategies including WAND and Block-Max WAND

1.5.6 Part VI: Distributed Systems (Chapters 21-22)

We cover distributed operation:

Chapter 21 explains the Raft consensus protocol implementation
Chapter 22 details transaction processing and MVCC

1.5.7 Part VII: System Integration (Chapters 23-25)

We examine external interfaces:

Chapter 23 covers PostgreSQL wire protocol compatibility
Chapter 24 details external table integration with Parquet, Arrow, and cloud storage
Chapter 25 describes the multi-protocol service layer

1.5.8 Part VIII: Advanced Topics (Chapters 26-28)

We conclude with advanced subjects:

Chapter 26 details memory management strategies
Chapter 27 covers observability and debugging
Chapter 28 discusses performance engineering

1.5.9 Appendices

Reference materials include:

Appendix A: Complete CVM opcode reference
Appendix B: SQL compatibility matrix
Appendix C: Configuration reference
Appendix D: API specifications

1.6 Summary

This chapter introduced the challenge of data paradigm fragmentation and the vision of unified data processing. Key points:

Polyglot persistence creates operational complexity, consistency challenges, and impedance mismatch between data paradigms
Posting lists provide a universal abstraction - all query predicates ultimately produce sets of document identifiers that combine through Boolean operations
Cognica implements unified processing through carefully designed components: a multi-protocol service layer, PostgreSQL-compatible SQL processing, a bytecode virtual machine with JIT compilation, and specialized indexes for text and vector search
Historical context shows that Cognica builds on decades of database and information retrieval research, taking a different path than extension or federation approaches

The following chapter formalizes these intuitions mathematically, establishing the algebraic foundations that enable cross-paradigm optimization.