Cognica Database Internals

A Unified Data Processing System

A comprehensive academic textbook on database engine architecture, query processing, and distributed systems.

This book provides an in-depth exploration of the Cognica database engine, covering theoretical foundations, implementation details, and practical engineering considerations. From the unified query algebra that bridges relational, document, and full-text search paradigms to the bytecode virtual machine that executes queries, each chapter combines rigorous theory with real-world implementation insights.

Author

Jaepil Jeong (jaepil@cognica.io)

Part I: Foundations

Establishing the theoretical framework for unified data processing

Chapter	Title	Topics
1	Introduction to Unified Data Processing	Data paradigm fragmentation, unified algebra motivation, system architecture overview
2	Mathematical Foundations of Query Algebras	Set theory, Boolean algebra, posting lists, algebraic properties
3	Extending the Algebra to Graph Structures	Graph type system, path expressions, pattern matching, traversal algebra
4	Query Optimization Theory	Cost models, plan enumeration, join ordering, cardinality estimation

Part II: Storage Engine

Building the persistent foundation for data management

Chapter	Title	Topics
5	LSM-Tree Storage Architecture	Memtables, compaction strategies, write amplification, RocksDB integration
6	Document Storage and Schema Management	Document model, schema inference, flexible typing, collection management
7	Inverted Index Architecture	Posting lists, skip lists, term dictionaries, clustered term index

Part III: Query Processing

Transforming SQL into executable plans

Chapter	Title	Topics
8	SQL Parser and Semantic Analysis	libpg_query, AST construction, name resolution, type checking
9	Logical Planning and Optimization	Logical operators, equivalence rules, predicate pushdown, join reordering
10	Physical Planning and Execution Strategies	Physical operators, access path selection, parallel execution, plan caching

Part IV: Graph Query Processing

From property graphs to Cypher-over-SQL

Chapter	Title	Topics
11	Graph Storage and Operations	Property graph model, dual-collection storage, traversal algorithms, adjacency cache
12	Cypher Query Language	Cypher lexer/parser, AST, Cypher-to-SQL rewriting, recursive CTE path translation

Part V: Execution Engine

The heart of query evaluation

Chapter	Title	Topics
13	CVM Architecture	Bytecode VM design, register allocation, opcode dispatch, stack management
14	CVM Compilation Pipeline	IR generation, lowering passes, optimization phases, code generation
15	Vectorized Execution	SIMD operations, batch processing, columnar evaluation, filter pushdown
16	Copy-and-Patch JIT Compilation	Stencil-based JIT, runtime code generation, hot path optimization
17	Zero-Copy JOIN Implementation	Composite rows, reference semantics, memory efficiency, JOIN algorithms

Part VI: Similarity Search and Ranking

Bridging structured and unstructured data retrieval

Chapter	Title	Topics
18	Text Analysis Pipeline	Tokenization, normalization, stemming, stopwords, ICU integration
19	BM25 Scoring	TF-IDF, BM25 formula, IDF computation, term frequency saturation, numerical stability
20	Bayesian BM25 and Probabilistic Calibration	Sigmoid likelihood, composite prior, base rate prior, three-term decomposition, WAND compatibility
21	Vector Search and HNSW Index	Dense embeddings, ANN search, HNSW algorithm, distance metrics, quantization
22	Vector Score Calibration	Likelihood ratio calibration, cross-modal estimation, index-aware statistics, unified fusion
23	From Bayesian Inference to Neural Computation	Conjunction shrinkage, log-odds framework, neural emergence, activation functions, exact pruning
24	Hybrid Search Architecture	Score fusion, log-odds conjunction, multi-stage retrieval, query composition
25	Query Evaluation Strategies (WAND/BMW)	Top-K algorithms, early termination, block-max optimization, neural pruning

Part VII: Distributed Systems

Scaling beyond a single node

Chapter	Title	Topics
26	Raft Consensus Protocol	Leader election, log replication, safety guarantees, membership changes
27	Transaction Processing	ACID properties, isolation levels, MVCC, SSI, deadlock detection

Part VIII: System Integration

Connecting to the outside world

Chapter	Title	Topics
28	PostgreSQL Wire Protocol	Message formats, authentication, extended query protocol, COPY
29	External Table Integration	Foreign data wrappers, Arrow Flight SQL, predicate pushdown, federation
30	Multi-Protocol Service Layer	HTTP REST, Flight SQL, protocol multiplexing

Part IX: Advanced Topics

Engineering for production excellence

Chapter	Title	Topics
31	Memory Management	Arena allocation, memory pools, cache hierarchies, OOM handling
32	Observability and Debugging	Metrics, logging, tracing, profiling, EXPLAIN analysis
33	Performance Engineering	Benchmarking, bottleneck analysis, tuning strategies, capacity planning
34	Context-Isolated Architecture	ServerContext, singleton removal, bridge patterns, multi-tenancy foundations

Appendices

Reference materials for implementation and deployment

Appendix	Title	Contents
A	CVM Opcode Reference	Complete opcode listing, instruction formats, type system, built-in functions
B	SQL Compatibility Reference	Supported statements, data types, operators, functions, PostgreSQL compatibility
C	Configuration Reference	All configuration options with types, defaults, and tuning guidelines
D	API Reference	PostgreSQL protocol, Flight SQL, HTTP REST endpoints

Key Innovations

This book documents several key innovations in the Cognica database engine:

Unified Query Algebra

A mathematical framework that treats posting lists as the universal abstraction, enabling seamless queries across relational, document, full-text search, and graph paradigms.

Cypher-over-SQL Graph Processing

A parse-time Cypher-to-SQL rewriting engine that translates graph pattern matching into standard SQL subquery pipelines, enabling graph queries to benefit from the full SQL optimizer without a separate graph execution engine.

Clustered Term Index

A novel inverted index organization that reduces key count by 62,500x compared to traditional term-per-key approaches, dramatically improving write performance.

Cognica Virtual Machine (CVM)

A 256+ opcode bytecode interpreter with computed-goto dispatch, achieving near-native performance for interpreted query execution.

Copy-and-Patch JIT

A stencil-based JIT compilation technique that provides 2-5x speedup for hot query paths with minimal compilation overhead.

Zero-Copy JOINs

Composite row representation that eliminates data copying during JOIN operations, reducing memory pressure and improving cache efficiency.

Bayesian BM25 and Probabilistic Calibration

A three-term posterior decomposition that transforms unbounded BM25 scores into calibrated probabilities, enabling principled multi-signal fusion with 68-77% calibration error reduction in unsupervised settings.

Vector Score Calibration

Likelihood ratio calibration that transforms vector similarity scores into relevance probabilities using distributional statistics extracted from ANN indexes at zero additional cost.

Neural Emergence from Bayesian Inference

The discovery that combining calibrated probability signals through log-odds conjunction analytically produces the structure of a feedforward neural network — reversing the conventional direction of explanation in neural network theory.

WAND/BMW Query Evaluation

Block-max weighted AND algorithms achieving 50-88% document skip rates for top-K retrieval, with proven exact neural pruning guarantees.

Reading Paths

For Database Developers

Start with Part I for theoretical foundations, then proceed through Parts II-V for core implementation details. Appendix A is essential for CVM development.

For Application Developers

Focus on Chapters 8 (SQL Parser), 12 (Cypher), 28 (PostgreSQL Protocol), and 30 (Multi-Protocol Layer). Appendices B and D provide API references.

For Graph Database Developers

Chapter 3 provides the theoretical graph algebra, Chapter 11 covers storage and traversal algorithms, and Chapter 12 details the Cypher-to-SQL rewriting pipeline.

For Operations Engineers

Chapters 31-33 cover production concerns. Appendix C provides comprehensive configuration documentation.

For Researchers

Part I establishes the theoretical framework. Chapters 19-25 cover state-of-the-art information retrieval techniques, including the probabilistic calibration trilogy (Chapters 20, 22, 23) that connects information retrieval to neural computation.

Version Information

Cognica Version: 1.0
Last Updated: March 2026
Total Chapters: 34
Total Appendices: 4