Chapter 1: Introduction to Unified Data Processing
1.1 The Data Paradigm Fragmentation Problem
Modern applications face an increasingly complex data landscape. A typical enterprise system might need to:
- Store structured business records in a relational database for ACID transactions
- Index product descriptions in a full-text search engine for keyword queries
- Compute similarity between items using vector embeddings in a specialized vector database
- Navigate relationships between entities through a graph database
- Cache frequently accessed data in a key-value store
This approach, known as polyglot persistence, emerged from the recognition that different data models excel at different workloads. A relational database optimized for transactions differs fundamentally from a search engine optimized for text retrieval. Each system brings its own query language, consistency model, operational requirements, and failure modes.
1.1.1 The Operational Burden
Consider a product search feature that must:
- Find products matching text keywords (full-text search)
- Filter by category and price range (relational predicates)
- Rank by semantic similarity to user preferences (vector search)
- Traverse supplier relationships (graph queries)
In a polyglot architecture, this single user-facing feature requires orchestrating four separate database systems:
The application must:
- Maintain connections to four different systems
- Translate between four query languages
- Merge results with potentially inconsistent identifiers
- Handle partial failures when any system becomes unavailable
- Manage data synchronization across all systems
Each additional system multiplies operational complexity. DevOps teams must master four different backup strategies, monitoring dashboards, scaling procedures, and failure recovery protocols.
1.1.2 The Consistency Challenge
Beyond operational complexity lies a more fundamental problem: consistency. When a product's price changes in PostgreSQL, how quickly does Elasticsearch reflect the update? What happens if the vector database's embedding becomes stale?
Polyglot architectures typically offer only eventual consistency across systems, with no transactional guarantees spanning multiple databases. Applications must implement complex reconciliation logic, handle temporary inconsistencies gracefully, and design around the possibility that different systems hold conflicting views of the same data.
1.1.3 The Impedance Mismatch
Each data paradigm brings its own conceptual model:
| Paradigm | Data Model | Query Model | Result Model |
|---|---|---|---|
| Relational | Tables, Rows | SQL, Joins | Result Sets |
| Document | JSON/BSON | Query DSL | Documents |
| Full-Text | Terms, Postings | Boolean/BM25 | Scored Hits |
| Vector | Embeddings | k-NN | Distances |
| Graph | Nodes, Edges | Traversal | Paths |
Translating between these models introduces impedance mismatch - the conceptual friction of mapping one paradigm's abstractions onto another's. A document database's nested structure doesn't naturally decompose into relational joins. A graph traversal doesn't directly map to vector similarity. Each translation loses information or introduces complexity.
1.2 The Case for Unification
What if a single system could natively support all these paradigms? Not through adapters or plugins, but through a unified foundation that treats relational predicates, text relevance, vector similarity, and graph traversal as variations of the same underlying algebra?
This is the vision of unified data processing: a single database engine where:
- One storage layer manages all data
- One query language expresses all operations
- One optimizer plans across paradigms
- One transaction model ensures consistency
- One operational model simplifies deployment
1.2.1 The Posting List Insight
The key insight enabling unification comes from information retrieval theory. Consider how a full-text search engine finds documents containing a term:
This posting list - the set of document IDs containing term "database" - is simply a set. Set operations form a Boolean algebra with well-understood properties:
- Intersection (): Documents containing BOTH terms
- Union (): Documents containing EITHER term
- Complement (): Documents NOT containing a term
Now consider a relational filter: "products where price < 100". This also produces a set of qualifying document IDs:
A vector similarity search for the top-k nearest neighbors? Another set of document IDs:
A graph traversal finding all nodes within 2 hops? Yet another set:
The profound realization: All these seemingly different operations produce the same thing - sets of document identifiers. They can all be represented as posting lists, combined with the same Boolean operations, and optimized with the same algebraic transformations.
1.2.2 Unified Architecture
This insight leads to a unified architecture where posting lists serve as the universal abstraction:
A single query can seamlessly combine:
SELECT p.name, p.price
FROM products p
WHERE p.category = 'electronics' -- Relational
AND MATCH(p.description) AGAINST ('wireless') -- Full-text
AND vector_similarity(p.embedding, ?) > 0.8 -- Vector
AND EXISTS ( -- Graph
SELECT 1 FROM suppliers s
WHERE s.id = p.supplier_id
AND s.rating > 4.0
)
ORDER BY bm25_score(p.description) DESC
LIMIT 10;
The query planner recognizes each predicate's paradigm, retrieves the corresponding posting lists, and merges them using optimized set operations - all within a single transaction, with consistent results, through one query interface.
1.3 Cognica Architecture Overview
Cognica implements this unified vision through carefully designed components that work together to process queries across paradigms.
1.3.1 Design Principles
Principle 1: Posting Lists as Universal Currency
Every index, filter, and search operation ultimately produces posting lists. The system maintains a common representation that flows through all processing stages, enabling uniform optimization and execution.
Principle 2: Algebraic Optimization
Because posting lists form a Boolean algebra, the query optimizer can apply algebraic transformations regardless of the original paradigm:
- Predicate pushdown works for relational filters and text queries
- Join reordering applies to relational joins and graph traversals
- Cost-based selection chooses between index scan and sequential scan
Principle 3: Vectorized Execution
Modern CPUs achieve highest throughput when processing data in batches. Cognica's execution engine processes posting lists in columnar batches, exploiting SIMD instructions and cache locality.
Principle 4: Tiered Compilation
Frequently executed query patterns compile from bytecode interpretation through JIT compilation to native code, achieving performance competitive with hand-written C++ while maintaining the flexibility of a general-purpose query engine.
1.3.2 System Components
Protocol Layer: Cognica speaks multiple protocols natively. The PostgreSQL wire protocol enables compatibility with existing tools like psql, JDBC drivers, and BI platforms. Arrow Flight SQL enables high-throughput analytical queries.
Query Processing Pipeline: SQL queries parse through libpg_query (PostgreSQL's actual parser), ensuring compatibility with PostgreSQL syntax. The semantic analyzer resolves names, checks types, and expands views. The query planner generates logical plans, and the cost-based optimizer selects physical implementations.
Execution Engine: The Cognica Virtual Machine (CVM) executes queries through a register-based bytecode interpreter. Hot paths automatically compile to native code via copy-and-patch JIT compilation. Vectorized operators process data in columnar batches for maximum throughput.
Storage Layer: RocksDB provides the foundational LSM-tree storage with ACID transactions and MVCC. Specialized indexes layer on top: inverted indexes for full-text search, HNSW graphs for vector similarity, secondary indexes for relational queries.
Distribution Layer: The Raft consensus protocol provides distributed consistency for multi-node deployments. All storage operations replicate through the consensus layer, ensuring durability and enabling horizontal scaling.
1.3.3 Query Execution Flow
To illustrate how these components work together, consider a hybrid query that combines text search with relational filtering:
SELECT title, author, bm25_score(content) as score
FROM articles
WHERE MATCH(content) AGAINST ('database internals')
AND published_date > '2024-01-01'
ORDER BY score DESC
LIMIT 10;
Step 1: Parsing
The SQL parser produces an Abstract Syntax Tree (AST) representing the query structure. The MATCH...AGAINST clause parses as a full-text search predicate; the date comparison as a relational predicate.
Step 2: Semantic Analysis
The analyzer resolves articles to a collection, validates that content has a full-text index, confirms published_date is a timestamp type, and verifies bm25_score() is a valid scoring function.
Step 3: Logical Planning
The planner produces a logical plan:
Limit(10)
Sort(score DESC)
Project(title, author, bm25_score(content) as score)
Filter(published_date > '2024-01-01')
FTSSearch(content, 'database internals')
Scan(articles)
Step 4: Optimization
The optimizer recognizes that the FTS search and date filter can execute independently, then intersect their posting lists:
Limit(10)
Sort(score DESC)
Project(title, author, score)
PostingListIntersect
FTSSearch(content, 'database internals') -> posting list + scores
IndexScan(published_date > '2024-01-01') -> posting list
Step 5: Physical Planning
The physical planner selects concrete implementations:
- FTS search uses WAND algorithm for efficient top-k retrieval
- Date filter uses secondary index range scan
- Intersection uses sorted merge with score propagation
- Sort uses in-memory heap for small result sets
Step 6: Code Generation
The CVM compiler generates bytecode implementing the physical plan. Register allocation assigns document IDs, scores, and intermediate results to virtual registers.
Step 7: Execution
The bytecode interpreter executes the plan, fetching posting lists from indexes, computing intersections, scoring documents with BM25, and returning the top 10 results.
Step 8: Result Delivery
Results serialize through the PostgreSQL wire protocol back to the client, appearing exactly as they would from a PostgreSQL database.
1.4 Historical Context
Cognica's unified approach builds on decades of database and information retrieval research.
1.4.1 Evolution of Database Systems
1970s - Relational Model: Edgar Codd's relational model established the mathematical foundation for database systems. Relational algebra provided a formal framework for query optimization, proving that different query expressions could produce identical results.
1980s - Query Optimization: System R and INGRES pioneered cost-based query optimization, demonstrating that declarative queries could compile to efficient execution plans through algebraic transformation.
1990s - Object-Relational: As applications grew complex, object-relational databases attempted to bridge the gap between relational storage and object-oriented programming. This era introduced extensible type systems and user-defined functions.
2000s - NoSQL Movement: Web-scale applications drove the NoSQL revolution. Document stores (MongoDB), key-value stores (Redis), and graph databases (Neo4j) optimized for specific access patterns at the cost of query flexibility.
2010s - NewSQL and Convergence: Systems like CockroachDB and TiDB proved that distributed ACID transactions were achievable. Meanwhile, traditional databases began adding JSON support, full-text search, and other features.
2020s - Unified Systems: The current generation aims to eliminate the polyglot complexity entirely. Rather than adding features piecemeal, systems like Cognica rethink the foundational abstractions to enable native multi-paradigm support.
1.4.2 Information Retrieval Foundations
Full-text search engines developed independently from databases, with their own theoretical foundations:
Boolean Retrieval: The earliest IR systems matched Boolean combinations of terms. Documents either matched a query or didn't - no ranking, just set operations on posting lists.
Vector Space Model: Salton's vector space model represented documents and queries as vectors in term space, enabling similarity-based ranking through cosine similarity.
Probabilistic Models: Robertson's probability ranking principle established that documents should rank by their probability of relevance. This led to BM25, still the dominant text ranking function.
Neural Retrieval: Modern neural models encode documents and queries as dense vectors, enabling semantic similarity beyond lexical matching. This drives the current interest in vector databases.
1.4.3 Prior Unification Attempts
Previous attempts at unification typically followed one of two paths:
Extension Approach: Traditional databases added features incrementally. PostgreSQL added tsvector for full-text search, jsonb for documents, and pgvector for embeddings. While functional, these extensions often feel bolted-on, with limited cross-feature optimization.
Federation Approach: Systems like Presto and Trino federate queries across multiple backends. While providing a unified interface, they cannot optimize across data sources or provide cross-source transactions.
Cognica takes a different path: native unification. Rather than extending a relational database or federating separate systems, it builds from a foundation where posting lists are first-class citizens, enabling deep optimization across paradigms.
1.5 What This Book Covers
This book provides a comprehensive treatment of Cognica's design and implementation, suitable for:
- Graduate students studying database systems, information retrieval, or distributed systems
- Database researchers exploring unified query processing
- Senior engineers building or operating data-intensive applications
- Contributors seeking to understand and extend Cognica
1.5.1 Part I: Foundations (Chapters 1-4)
We establish the mathematical framework for unified query processing:
- Chapter 2 formalizes posting lists as a Boolean algebra and defines the type system spanning documents, vectors, terms, and graphs
- Chapter 3 extends the algebra to incorporate graph structures while preserving algebraic properties
- Chapter 4 develops query optimization theory, including cost models, selectivity estimation, and transformation rules
1.5.2 Part II: Storage Engine (Chapters 5-7)
We examine how data persists and indexes organize:
- Chapter 5 details the LSM-tree storage architecture based on RocksDB
- Chapter 6 explains document storage, schema management, and secondary indexes
- Chapter 7 deep-dives into inverted index architecture, including the innovative clustered term index
1.5.3 Part III: Query Processing (Chapters 8-10)
We trace queries from SQL text to executable plans:
- Chapter 8 covers SQL parsing and semantic analysis
- Chapter 9 explains logical planning and optimization
- Chapter 10 details physical planning and execution strategy selection
1.5.4 Part IV: Execution Engine (Chapters 11-15)
We explore the Cognica Virtual Machine in depth:
- Chapter 11 presents CVM architecture: instruction formats, registers, and dispatch
- Chapter 12 details the compilation pipeline from SQL to bytecode
- Chapter 13 explains vectorized execution for batch processing
- Chapter 14 covers copy-and-patch JIT compilation
- Chapter 15 describes zero-copy JOIN implementation
1.5.5 Part V: Similarity Search and Ranking (Chapters 16-20)
We examine text and vector search capabilities:
- Chapter 16 details the text analysis pipeline
- Chapter 17 explains BM25 scoring and its Bayesian extension for calibrated relevance
- Chapter 18 covers vector search with HNSW indexes
- Chapter 19 describes hybrid search architecture combining text and vectors
- Chapter 20 presents query evaluation strategies including WAND and Block-Max WAND
1.5.6 Part VI: Distributed Systems (Chapters 21-22)
We cover distributed operation:
- Chapter 21 explains the Raft consensus protocol implementation
- Chapter 22 details transaction processing and MVCC
1.5.7 Part VII: System Integration (Chapters 23-25)
We examine external interfaces:
- Chapter 23 covers PostgreSQL wire protocol compatibility
- Chapter 24 details external table integration with Parquet, Arrow, and cloud storage
- Chapter 25 describes the multi-protocol service layer
1.5.8 Part VIII: Advanced Topics (Chapters 26-28)
We conclude with advanced subjects:
- Chapter 26 details memory management strategies
- Chapter 27 covers observability and debugging
- Chapter 28 discusses performance engineering
1.5.9 Appendices
Reference materials include:
- Appendix A: Complete CVM opcode reference
- Appendix B: SQL compatibility matrix
- Appendix C: Configuration reference
- Appendix D: API specifications
1.6 Summary
This chapter introduced the challenge of data paradigm fragmentation and the vision of unified data processing. Key points:
-
Polyglot persistence creates operational complexity, consistency challenges, and impedance mismatch between data paradigms
-
Posting lists provide a universal abstraction - all query predicates ultimately produce sets of document identifiers that combine through Boolean operations
-
Cognica implements unified processing through carefully designed components: a multi-protocol service layer, PostgreSQL-compatible SQL processing, a bytecode virtual machine with JIT compilation, and specialized indexes for text and vector search
-
Historical context shows that Cognica builds on decades of database and information retrieval research, taking a different path than extension or federation approaches
The following chapter formalizes these intuitions mathematically, establishing the algebraic foundations that enable cross-paradigm optimization.