Chapter 2: Mathematical Foundations of Query Algebras
This chapter establishes the rigorous mathematical framework underlying unified query processing. We formalize posting lists as a Boolean algebra, define the type system spanning multiple data paradigms, and develop the operator calculus that enables cross-paradigm optimization.
2.1 Set Theory and Boolean Algebra Review
Before developing the unified algebra, we review the mathematical structures upon which it rests.
2.1.1 Sets and Relations
A set is an unordered collection of distinct elements. We write to indicate element belongs to set , and to denote the cardinality (size) of .
The power set of a set is the set of all subsets of :
For a finite set with elements, .
A relation is a set of ordered pairs from the Cartesian product of sets and . A function is a relation where each element of maps to exactly one element of .
2.1.2 Boolean Algebra Axioms
A Boolean algebra is a set with two binary operations (meet/AND) and (join/OR), a unary operation (complement/NOT), and two distinguished elements (bottom) and (top), satisfying:
Commutativity:
Associativity:
Absorption:
Identity:
Distributivity:
Complementation:
The power set with intersection (), union (), complement, empty set (), and universal set () forms a Boolean algebra. This is the algebra of posting lists.
2.1.3 Lattice Theory Fundamentals
A partially ordered set (poset) is a set with a reflexive, antisymmetric, transitive relation .
A lattice is a poset where every pair of elements has a least upper bound (join, ) and greatest lower bound (meet, ).
A complete lattice has joins and meets for all subsets, not just pairs. The power set ordered by inclusion forms a complete lattice:
- Meet of a family:
- Join of a family:
- Bottom element:
- Top element:
This structure is fundamental to query optimization, where query plans form a lattice ordered by cost, and transformations navigate toward the optimal plan.
2.2 Posting Lists as Universal Abstraction
We now formalize how posting lists serve as the universal abstraction connecting disparate query paradigms.
2.2.1 Document Universe
Let denote the document universe - the set of all documents in the database. Each document has a unique identifier .
In practice, might contain millions or billions of documents. We assume documents are immutable for query purposes; updates create new document versions.
2.2.2 Posting List Definition
A posting list is a function mapping a term (or more generally, a predicate) to the set of documents satisfying that predicate:
where is the term space. For a term , the posting list contains exactly those documents containing term :
Example: For documents:
- : "database systems"
- : "distributed database"
- : "operating systems"
The posting lists are:
2.2.3 Boolean Algebra of Posting Lists
Posting lists form a Boolean algebra under set operations:
Conjunction (AND): Documents matching both predicates
Disjunction (OR): Documents matching either predicate
Negation (NOT): Documents not matching the predicate
These operations satisfy the Boolean algebra axioms:
Commutativity:
Associativity:
Distributivity:
De Morgan's Laws:
2.2.4 Scored Posting Lists
For ranking queries, we extend posting lists to carry scores:
A scored posting list maps terms to sets of (document, score) pairs:
Score combination during Boolean operations requires specification:
AND with scores: Various strategies exist
- Minimum:
- Product: (probabilistic interpretation)
- Sum: (additive combination)
OR with scores:
- Maximum:
- Probabilistic OR:
Chapter 19 develops these score combination strategies in depth.
2.3 Type System Formalization
A unified query system must bridge multiple data paradigms. We formalize the type system that enables this bridging.
2.3.1 Document Space
The document space is the primary universe. Each document is a semi-structured record with:
- A unique identifier:
- A schema:
- Field values:
where is the field name space and is the value space.
The value space is a discriminated union:
2.3.2 Term Space
The term space contains atomic units of text:
where is the set of all strings and is an equivalence relation defined by text normalization (lowercasing, stemming, etc.).
The terms function extracts terms from text:
This function encapsulates the text analysis pipeline (Chapter 16).
2.3.3 Vector Space
The vector space is -dimensional real space:
Documents may have vector representations through embedding functions:
Common metrics on vector space include:
Euclidean (L2) distance:
Cosine distance:
Inner product (for normalized vectors):
2.3.4 Field Space
The field space names document attributes:
Fields support hierarchical naming for nested documents:
Field paths form a tree structure rooted at the document.
2.3.5 Bijective Mappings
The power of the unified type system comes from explicit mappings between spaces:
Document to Posting List: Every document belongs to posting lists for its terms
Term to Posting List: Direct mapping
Vector to Posting List: k-NN search produces a posting list
Relational Predicate to Posting List: Filter produces a posting list
These mappings enable the query planner to treat operations from different paradigms uniformly.
2.4 Operator Calculus
With the type system established, we define the operators that transform posting lists.
2.4.1 Primitive Operators
Term Retrieval (): Retrieve the posting list for a term
Filter (): Apply a predicate to filter a posting list
Projection (): Select specific fields from documents
where is the space of projected documents.
Vector Search (): Find nearest neighbors
Score (): Attach scores to documents
2.4.2 Composition as Monoid
Operators compose to form pipelines. The set of operators forms a monoid under composition:
where:
- is the set of operators on posting lists
- is function composition
- is the identity operator
Associativity:
Identity:
This monoid structure enables pipeline optimization through algebraic manipulation.
2.4.3 Operator Properties
Idempotence: Applying the same filter twice is equivalent to applying it once
Filter Commutativity: Independent filters commute
Filter Conjunction: Sequential filters equivalent to conjunctive filter
Projection Pushdown: Projection through filter (when fields independent)
Selection Pushdown: Filter through intersection
2.4.4 Equivalence Classes
Two operator expressions are equivalent if they produce identical results for all inputs:
Equivalence induces equivalence classes on query plans. The optimizer explores these classes to find minimum-cost representatives.
Theorem (Completeness): The transformation rules in Section 2.4.3 are complete for the class of select-project-join queries over posting lists.
This theorem ensures the optimizer can reach any equivalent plan through rule application.
2.5 Extending to Scored Operations
Relevance ranking requires extending the algebra to handle scores.
2.5.1 Scored Posting Lists
A scored posting list is a set of (document, score) pairs:
We define projection functions:
2.5.2 Score Combination Semantics
When combining scored posting lists, we must specify how scores combine:
Conjunction (AND): For and
Probabilistic:
Additive:
Minimum:
Disjunction (OR): For documents in
Probabilistic (inclusion-exclusion):
Maximum:
Sum:
2.5.3 Score Normalization
For probabilistic combination, scores must be in . Raw relevance scores (e.g., BM25) are unbounded. Chapter 17 develops Bayesian BM25 which provides calibrated probability scores suitable for probabilistic combination.
The normalization function:
must preserve relative ordering:
2.6 Categorical Perspective
For readers familiar with category theory, we briefly sketch the categorical interpretation of the unified algebra.
2.6.1 Category of Posting Lists
Define category :
- Objects: Posting lists
- Morphisms: Set functions preserving document identity
This category is equivalent to the category of finite sets and functions.
2.6.2 Functors Between Paradigms
Each data paradigm defines a functor to :
Text Functor
Vector Functor
Relational Functor
2.6.3 Natural Transformations
Optimization rules are natural transformations between functors:
For example, the transformation from nested-loop join to hash join is a natural transformation in the category of query plans.
2.6.4 Adjunctions
The free/forgetful adjunction captures the relationship between structured queries and their posting list representations:
This adjunction formalizes how queries generate posting lists and how posting list operations induce query transformations.
2.7 Implementation Considerations
The mathematical framework guides efficient implementation.
2.7.1 Posting List Representation
Posting lists are stored as sorted arrays of document IDs:
PostingList = [doc_id_1, doc_id_2, ..., doc_id_n] // sorted
Sorted order enables:
- Binary search for membership:
- Merge intersection:
- Merge union:
2.7.2 Skip Pointers
For large posting lists, skip pointers accelerate intersection:
PostingList with Skips:
[doc_id_1, ..., doc_id_k] -> skip to doc_id_{k*s}
[doc_id_{k*s+1}, ..., doc_id_{2*k*s}] -> skip to doc_id_{2*k*s}
...
Skip distance minimizes worst-case intersection time.
2.7.3 Compression
Document IDs compress through delta encoding:
Original: [100, 105, 110, 200, 250]
Deltas: [100, 5, 5, 90, 50]
Variable-byte encoding further reduces space:
- Small deltas: 1 byte
- Large deltas: 2-4 bytes
2.7.4 Block-Based Processing
Modern CPUs process data fastest in cache-line-sized blocks. Posting lists partition into blocks:
Block 1: [doc_1, ..., doc_128] max_score=0.95
Block 2: [doc_129, ..., doc_256] max_score=0.87
...
Block-level metadata enables pruning in top-k queries (Chapter 20).
2.8 Summary
This chapter established the mathematical foundations for unified query processing:
-
Boolean algebra of sets provides the framework for posting list operations, with well-defined intersection, union, and complement satisfying algebraic axioms
-
Type system spans documents, terms, vectors, and fields with explicit mappings between spaces, enabling unified representation of multi-paradigm data
-
Operator calculus defines primitive operators (term retrieval, filter, projection, vector search, scoring) that compose into query pipelines with algebraic properties enabling optimization
-
Scored extension handles relevance ranking through scored posting lists with various combination semantics (probabilistic, additive, min/max)
-
Categorical perspective reveals the deep structure: functors map paradigms to posting lists, natural transformations capture optimization rules, and adjunctions formalize the query-posting list relationship
The following chapter extends this framework to incorporate graph structures, proving that graph posting lists preserve the algebraic properties while enabling traversal and pattern matching operations.