Cognica Database Internals

A Unified Data Processing System

A comprehensive academic textbook on database engine architecture, query processing, and distributed systems.

This book provides an in-depth exploration of the Cognica database engine, covering theoretical foundations, implementation details, and practical engineering considerations. From the unified query algebra that bridges relational, document, and full-text search paradigms to the bytecode virtual machine that executes queries, each chapter combines rigorous theory with real-world implementation insights.


Author

Jaepil Jeong (jaepil@cognica.io)


Part I: Foundations

Establishing the theoretical framework for unified data processing

ChapterTitleTopics
1Introduction to Unified Data ProcessingData paradigm fragmentation, unified algebra motivation, system architecture overview
2Mathematical Foundations of Query AlgebrasSet theory, Boolean algebra, posting lists, algebraic properties
3Extending the Algebra to Graph StructuresGraph type system, path expressions, pattern matching, traversal algebra
4Query Optimization TheoryCost models, plan enumeration, join ordering, cardinality estimation

Part II: Storage Engine

Building the persistent foundation for data management

ChapterTitleTopics
5LSM-Tree Storage ArchitectureMemtables, compaction strategies, write amplification, RocksDB integration
6Document Storage and Schema ManagementDocument model, schema inference, flexible typing, collection management
7Inverted Index ArchitecturePosting lists, skip lists, term dictionaries, clustered term index

Part III: Query Processing

Transforming SQL into executable plans

ChapterTitleTopics
8SQL Parser and Semantic Analysislibpg_query, AST construction, name resolution, type checking
9Logical Planning and OptimizationLogical operators, equivalence rules, predicate pushdown, join reordering
10Physical Planning and Execution StrategiesPhysical operators, access path selection, parallel execution, plan caching

Part IV: Graph Query Processing

From property graphs to Cypher-over-SQL

ChapterTitleTopics
11Graph Storage and OperationsProperty graph model, dual-collection storage, traversal algorithms, adjacency cache
12Cypher Query LanguageCypher lexer/parser, AST, Cypher-to-SQL rewriting, recursive CTE path translation

Part V: Execution Engine

The heart of query evaluation

ChapterTitleTopics
13CVM ArchitectureBytecode VM design, register allocation, opcode dispatch, stack management
14CVM Compilation PipelineIR generation, lowering passes, optimization phases, code generation
15Vectorized ExecutionSIMD operations, batch processing, columnar evaluation, filter pushdown
16Copy-and-Patch JIT CompilationStencil-based JIT, runtime code generation, hot path optimization
17Zero-Copy JOIN ImplementationComposite rows, reference semantics, memory efficiency, JOIN algorithms

Part VI: Similarity Search and Ranking

Bridging structured and unstructured data retrieval

ChapterTitleTopics
18Text Analysis PipelineTokenization, normalization, stemming, stopwords, ICU integration
19BM25 ScoringTF-IDF, BM25 formula, IDF computation, term frequency saturation, numerical stability
20Bayesian BM25 and Probabilistic CalibrationSigmoid likelihood, composite prior, base rate prior, three-term decomposition, WAND compatibility
21Vector Search and HNSW IndexDense embeddings, ANN search, HNSW algorithm, distance metrics, quantization
22Vector Score CalibrationLikelihood ratio calibration, cross-modal estimation, index-aware statistics, unified fusion
23From Bayesian Inference to Neural ComputationConjunction shrinkage, log-odds framework, neural emergence, activation functions, exact pruning
24Hybrid Search ArchitectureScore fusion, log-odds conjunction, multi-stage retrieval, query composition
25Query Evaluation Strategies (WAND/BMW)Top-K algorithms, early termination, block-max optimization, neural pruning

Part VII: Distributed Systems

Scaling beyond a single node

ChapterTitleTopics
26Raft Consensus ProtocolLeader election, log replication, safety guarantees, membership changes
27Transaction ProcessingACID properties, isolation levels, MVCC, SSI, deadlock detection

Part VIII: System Integration

Connecting to the outside world

ChapterTitleTopics
28PostgreSQL Wire ProtocolMessage formats, authentication, extended query protocol, COPY
29External Table IntegrationForeign data wrappers, Arrow Flight SQL, predicate pushdown, federation
30Multi-Protocol Service LayerHTTP REST, Flight SQL, protocol multiplexing

Part IX: Advanced Topics

Engineering for production excellence

ChapterTitleTopics
31Memory ManagementArena allocation, memory pools, cache hierarchies, OOM handling
32Observability and DebuggingMetrics, logging, tracing, profiling, EXPLAIN analysis
33Performance EngineeringBenchmarking, bottleneck analysis, tuning strategies, capacity planning
34Context-Isolated ArchitectureServerContext, singleton removal, bridge patterns, multi-tenancy foundations

Appendices

Reference materials for implementation and deployment

AppendixTitleContents
ACVM Opcode ReferenceComplete opcode listing, instruction formats, type system, built-in functions
BSQL Compatibility ReferenceSupported statements, data types, operators, functions, PostgreSQL compatibility
CConfiguration ReferenceAll configuration options with types, defaults, and tuning guidelines
DAPI ReferencePostgreSQL protocol, Flight SQL, HTTP REST endpoints

Key Innovations

This book documents several key innovations in the Cognica database engine:

Unified Query Algebra

A mathematical framework that treats posting lists as the universal abstraction, enabling seamless queries across relational, document, full-text search, and graph paradigms.

Cypher-over-SQL Graph Processing

A parse-time Cypher-to-SQL rewriting engine that translates graph pattern matching into standard SQL subquery pipelines, enabling graph queries to benefit from the full SQL optimizer without a separate graph execution engine.

Clustered Term Index

A novel inverted index organization that reduces key count by 62,500x compared to traditional term-per-key approaches, dramatically improving write performance.

Cognica Virtual Machine (CVM)

A 256+ opcode bytecode interpreter with computed-goto dispatch, achieving near-native performance for interpreted query execution.

Copy-and-Patch JIT

A stencil-based JIT compilation technique that provides 2-5x speedup for hot query paths with minimal compilation overhead.

Zero-Copy JOINs

Composite row representation that eliminates data copying during JOIN operations, reducing memory pressure and improving cache efficiency.

Bayesian BM25 and Probabilistic Calibration

A three-term posterior decomposition that transforms unbounded BM25 scores into calibrated probabilities, enabling principled multi-signal fusion with 68-77% calibration error reduction in unsupervised settings.

Vector Score Calibration

Likelihood ratio calibration that transforms vector similarity scores into relevance probabilities using distributional statistics extracted from ANN indexes at zero additional cost.

Neural Emergence from Bayesian Inference

The discovery that combining calibrated probability signals through log-odds conjunction analytically produces the structure of a feedforward neural network — reversing the conventional direction of explanation in neural network theory.

WAND/BMW Query Evaluation

Block-max weighted AND algorithms achieving 50-88% document skip rates for top-K retrieval, with proven exact neural pruning guarantees.


Reading Paths

For Database Developers

Start with Part I for theoretical foundations, then proceed through Parts II-V for core implementation details. Appendix A is essential for CVM development.

For Application Developers

Focus on Chapters 8 (SQL Parser), 12 (Cypher), 28 (PostgreSQL Protocol), and 30 (Multi-Protocol Layer). Appendices B and D provide API references.

For Graph Database Developers

Chapter 3 provides the theoretical graph algebra, Chapter 11 covers storage and traversal algorithms, and Chapter 12 details the Cypher-to-SQL rewriting pipeline.

For Operations Engineers

Chapters 31-33 cover production concerns. Appendix C provides comprehensive configuration documentation.

For Researchers

Part I establishes the theoretical framework. Chapters 19-25 cover state-of-the-art information retrieval techniques, including the probabilistic calibration trilogy (Chapters 20, 22, 23) that connects information retrieval to neural computation.

Version Information

  • Cognica Version: 1.0
  • Last Updated: March 2026
  • Total Chapters: 34
  • Total Appendices: 4

Copyright 2023-2026 Cognica, Inc. All rights reserved.

Copyright (c) 2023-2026 Cognica, Inc.