Chapter 2: Mathematical Foundations of Query Algebras

This chapter establishes the rigorous mathematical framework underlying unified query processing. We formalize posting lists as a Boolean algebra, define the type system spanning multiple data paradigms, and develop the operator calculus that enables cross-paradigm optimization.

2.1 Set Theory and Boolean Algebra Review

Before developing the unified algebra, we review the mathematical structures upon which it rests.

2.1.1 Sets and Relations

A set is an unordered collection of distinct elements. We write $x \in S$ to indicate element $x$ belongs to set $S$ , and $|S|$ to denote the cardinality (size) of $S$ .

The power set $2^S$ of a set $S$ is the set of all subsets of $S$ :

2^S = \{T \mid T \subseteq S\}

For a finite set with $n$ elements, $|2^S| = 2^n$ .

A relation $R \subseteq A \times B$ is a set of ordered pairs from the Cartesian product of sets $A$ and $B$ . A function $f: A \rightarrow B$ is a relation where each element of $A$ maps to exactly one element of $B$ .

2.1.2 Boolean Algebra Axioms

A Boolean algebra is a set $B$ with two binary operations $\land$ (meet/AND) and $\lor$ (join/OR), a unary operation $\neg$ (complement/NOT), and two distinguished elements $0$ (bottom) and $1$ (top), satisfying:

Commutativity:

a \land b = b \land a

a \lor b = b \lor a

Associativity:

(a \land b) \land c = a \land (b \land c)

(a \lor b) \lor c = a \lor (b \lor c)

Absorption:

a \land (a \lor b) = a

a \lor (a \land b) = a

Identity:

a \land 1 = a

a \lor 0 = a

Distributivity:

a \land (b \lor c) = (a \land b) \lor (a \land c)

a \lor (b \land c) = (a \lor b) \land (a \lor c)

Complementation:

a \land \neg a = 0

a \lor \neg a = 1

The power set $2^S$ with intersection ( $\cap$ ), union ( $\cup$ ), complement, empty set ( $\emptyset$ ), and universal set ( $S$ ) forms a Boolean algebra. This is the algebra of posting lists.

2.1.3 Lattice Theory Fundamentals

A partially ordered set (poset) $(L, \leq)$ is a set $L$ with a reflexive, antisymmetric, transitive relation $\leq$ .

A lattice is a poset where every pair of elements has a least upper bound (join, $\lor$ ) and greatest lower bound (meet, $\land$ ).

A complete lattice has joins and meets for all subsets, not just pairs. The power set $2^S$ ordered by inclusion forms a complete lattice:

Meet of a family: $\bigcap_{i \in I} S_i$
Join of a family: $\bigcup_{i \in I} S_i$
Bottom element: $\emptyset$
Top element: $S$

This structure is fundamental to query optimization, where query plans form a lattice ordered by cost, and transformations navigate toward the optimal plan.

2.2 Posting Lists as Universal Abstraction

We now formalize how posting lists serve as the universal abstraction connecting disparate query paradigms.

2.2.1 Document Universe

Let $\mathcal{D}$ denote the document universe - the set of all documents in the database. Each document $d \in \mathcal{D}$ has a unique identifier $\text{id}(d) \in \mathbb{N}$ .

In practice, $\mathcal{D}$ might contain millions or billions of documents. We assume documents are immutable for query purposes; updates create new document versions.

2.2.2 Posting List Definition

A posting list is a function mapping a term (or more generally, a predicate) to the set of documents satisfying that predicate:

P: \mathcal{T} \rightarrow 2^{\mathcal{D}}

where $\mathcal{T}$ is the term space. For a term $t$ , the posting list $P(t)$ contains exactly those documents containing term $t$ :

P(t) = \{d \in \mathcal{D} \mid t \in \text{terms}(d)\}

Example: For documents:

$d_1$ : "database systems"
$d_2$ : "distributed database"
$d_3$ : "operating systems"

The posting lists are:

$P(\text{database}) = \{d_1, d_2\}$
$P(\text{systems}) = \{d_1, d_3\}$
$P(\text{distributed}) = \{d_2\}$
$P(\text{operating}) = \{d_3\}$

2.2.3 Boolean Algebra of Posting Lists

Posting lists form a Boolean algebra under set operations:

Conjunction (AND): Documents matching both predicates

P(t_1) \cap P(t_2) = \{d \in \mathcal{D} \mid t_1 \in \text{terms}(d) \land t_2 \in \text{terms}(d)\}

Disjunction (OR): Documents matching either predicate

P(t_1) \cup P(t_2) = \{d \in \mathcal{D} \mid t_1 \in \text{terms}(d) \lor t_2 \in \text{terms}(d)\}

Negation (NOT): Documents not matching the predicate

\overline{P(t)} = \mathcal{D} \setminus P(t) = \{d \in \mathcal{D} \mid t \notin \text{terms}(d)\}

These operations satisfy the Boolean algebra axioms:

Commutativity:

P_1 \cap P_2 = P_2 \cap P_1

Associativity:

(P_1 \cap P_2) \cap P_3 = P_1 \cap (P_2 \cap P_3)

Distributivity:

P_1 \cap (P_2 \cup P_3) = (P_1 \cap P_2) \cup (P_1 \cap P_3)

De Morgan's Laws:

\overline{P_1 \cap P_2} = \overline{P_1} \cup \overline{P_2}

\overline{P_1 \cup P_2} = \overline{P_1} \cap \overline{P_2}

2.2.4 Scored Posting Lists

For ranking queries, we extend posting lists to carry scores:

P_s: \mathcal{T} \rightarrow 2^{\mathcal{D} \times \mathbb{R}}

A scored posting list maps terms to sets of (document, score) pairs:

P_s(t) = \{(d, s) \mid d \in P(t), s = \text{score}(d, t)\}

Score combination during Boolean operations requires specification:

AND with scores: Various strategies exist

Minimum: $\min(s_1, s_2)$
Product: $s_1 \cdot s_2$ (probabilistic interpretation)
Sum: $s_1 + s_2$ (additive combination)

OR with scores:

Maximum: $\max(s_1, s_2)$
Probabilistic OR: $1 - (1-s_1)(1-s_2)$

Chapter 19 develops these score combination strategies in depth.

2.3 Type System Formalization

A unified query system must bridge multiple data paradigms. We formalize the type system that enables this bridging.

2.3.1 Document Space

The document space $\mathcal{D}$ is the primary universe. Each document $d \in \mathcal{D}$ is a semi-structured record with:

A unique identifier: $\text{id}: \mathcal{D} \rightarrow \mathbb{N}$
A schema: $\text{schema}: \mathcal{D} \rightarrow \Sigma$
Field values: $\text{field}: \mathcal{D} \times \mathcal{F} \rightarrow \mathcal{V}$

where $\mathcal{F}$ is the field name space and $\mathcal{V}$ is the value space.

The value space is a discriminated union:

\mathcal{V} = \text{Null} \mid \text{Bool} \mid \text{Int64} \mid \text{Float64} \mid \text{String} \mid \text{Binary} \mid \text{Array}(\mathcal{V}) \mid \text{Object}(\mathcal{F} \rightarrow \mathcal{V})

2.3.2 Term Space

The term space $\mathcal{T}$ contains atomic units of text:

\mathcal{T} = \Sigma^* / \sim

where $\Sigma^*$ is the set of all strings and $\sim$ is an equivalence relation defined by text normalization (lowercasing, stemming, etc.).

The terms function extracts terms from text:

\text{terms}: \text{String} \rightarrow 2^{\mathcal{T}}

This function encapsulates the text analysis pipeline (Chapter 16).

2.3.3 Vector Space

The vector space $\mathcal{V}_n$ is $n$ -dimensional real space:

\mathcal{V}_n = \mathbb{R}^n

Documents may have vector representations through embedding functions:

\text{embed}: \mathcal{D} \times \mathcal{F} \rightarrow \mathcal{V}_n

Common metrics on vector space include:

Euclidean (L2) distance:

d_{L2}(\vec{u}, \vec{v}) = \sqrt{\sum_{i=1}^{n} (u_i - v_i)^2}

Cosine distance:

d_{\cos}(\vec{u}, \vec{v}) = 1 - \frac{\vec{u} \cdot \vec{v}}{|\vec{u}| \cdot |\vec{v}|}

Inner product (for normalized vectors):

\text{ip}(\vec{u}, \vec{v}) = \vec{u} \cdot \vec{v} = \sum_{i=1}^{n} u_i \cdot v_i

2.3.4 Field Space

The field space $\mathcal{F}$ names document attributes:

\mathcal{F} = \text{String}

Fields support hierarchical naming for nested documents:

\text{address.city} \in \mathcal{F}

Field paths form a tree structure rooted at the document.

2.3.5 Bijective Mappings

The power of the unified type system comes from explicit mappings between spaces:

Document to Posting List: Every document belongs to posting lists for its terms

\text{PostingFor}: \mathcal{D} \rightarrow 2^{2^{\mathcal{D}}}

\text{PostingFor}(d) = \{P(t) \mid t \in \text{terms}(d)\}

Term to Posting List: Direct mapping

P: \mathcal{T} \rightarrow 2^{\mathcal{D}}

Vector to Posting List: k-NN search produces a posting list

\text{kNN}: \mathcal{V}_n \times \mathbb{N} \rightarrow 2^{\mathcal{D}}

\text{kNN}(\vec{q}, k) = \text{argmin}_{S \subseteq \mathcal{D}, |S|=k} \sum_{d \in S} d(\vec{q}, \text{embed}(d))

Relational Predicate to Posting List: Filter produces a posting list

\sigma_\phi: 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D}}

\sigma_\phi(\mathcal{D}) = \{d \in \mathcal{D} \mid \phi(d) = \text{true}\}

These mappings enable the query planner to treat operations from different paradigms uniformly.

2.4 Operator Calculus

With the type system established, we define the operators that transform posting lists.

2.4.1 Primitive Operators

Term Retrieval ( $\tau$ ): Retrieve the posting list for a term

\tau: \mathcal{T} \rightarrow 2^{\mathcal{D}}

\tau_t = \{d \in \mathcal{D} \mid t \in \text{terms}(d)\}

Filter ( $\sigma$ ): Apply a predicate to filter a posting list

\sigma: (D \rightarrow \text{Bool}) \times 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D}}

\sigma_\phi(P) = \{d \in P \mid \phi(d)\}

Projection ( $\pi$ ): Select specific fields from documents

\pi: 2^{\mathcal{F}} \times 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D}'}

where $\mathcal{D}'$ is the space of projected documents.

Vector Search ( $\nu$ ): Find nearest neighbors

\nu: \mathcal{V}_n \times \mathbb{N} \times \mathbb{R} \rightarrow 2^{\mathcal{D}}

\nu_{\vec{q},k,\epsilon} = \{d \in \mathcal{D} \mid d(\vec{q}, \text{embed}(d)) < \epsilon\} \cap \text{top-}k

Score ( $\rho$ ): Attach scores to documents

\rho: (\mathcal{D} \rightarrow \mathbb{R}) \times 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D} \times \mathbb{R}}

\rho_f(P) = \{(d, f(d)) \mid d \in P\}

2.4.2 Composition as Monoid

Operators compose to form pipelines. The set of operators forms a monoid under composition:

(\text{Op}, \circ, \text{id})

where:

$\text{Op}$ is the set of operators on posting lists
$\circ$ is function composition
$\text{id}$ is the identity operator

Associativity: $(f \circ g) \circ h = f \circ (g \circ h)$

Identity: $f \circ \text{id} = \text{id} \circ f = f$

This monoid structure enables pipeline optimization through algebraic manipulation.

2.4.3 Operator Properties

Idempotence: Applying the same filter twice is equivalent to applying it once

\sigma_\phi \circ \sigma_\phi = \sigma_\phi

Filter Commutativity: Independent filters commute

\sigma_{\phi_1} \circ \sigma_{\phi_2} = \sigma_{\phi_2} \circ \sigma_{\phi_1}

Filter Conjunction: Sequential filters equivalent to conjunctive filter

\sigma_{\phi_1} \circ \sigma_{\phi_2} = \sigma_{\phi_1 \land \phi_2}

Projection Pushdown: Projection through filter (when fields independent)

\pi_A \circ \sigma_\phi = \sigma_\phi \circ \pi_A \quad \text{if } \text{fields}(\phi) \subseteq A

Selection Pushdown: Filter through intersection

\sigma_\phi(P_1 \cap P_2) = \sigma_\phi(P_1) \cap \sigma_\phi(P_2)

2.4.4 Equivalence Classes

Two operator expressions are equivalent if they produce identical results for all inputs:

e_1 \equiv e_2 \iff \forall P \in 2^{\mathcal{D}}: e_1(P) = e_2(P)

Equivalence induces equivalence classes on query plans. The optimizer explores these classes to find minimum-cost representatives.

Theorem (Completeness): The transformation rules in Section 2.4.3 are complete for the class of select-project-join queries over posting lists.

This theorem ensures the optimizer can reach any equivalent plan through rule application.

2.5 Extending to Scored Operations

Relevance ranking requires extending the algebra to handle scores.

2.5.1 Scored Posting Lists

A scored posting list is a set of (document, score) pairs:

P_s \subseteq \mathcal{D} \times \mathbb{R}^+

We define projection functions:

\text{docs}(P_s) = \{d \mid (d, s) \in P_s\}

\text{score}(P_s, d) = s \text{ where } (d, s) \in P_s

2.5.2 Score Combination Semantics

When combining scored posting lists, we must specify how scores combine:

Conjunction (AND): For $(d, s_1) \in P_1$ and $(d, s_2) \in P_2$

Probabilistic:

\text{score}_{\land}(d) = s_1 \cdot s_2

Additive:

\text{score}_{\land}(d) = s_1 + s_2

Minimum:

\text{score}_{\land}(d) = \min(s_1, s_2)

Disjunction (OR): For documents in $P_1 \cup P_2$

Probabilistic (inclusion-exclusion):

\text{score}_{\lor}(d) = 1 - (1 - s_1)(1 - s_2) = s_1 + s_2 - s_1 \cdot s_2

Maximum:

\text{score}_{\lor}(d) = \max(s_1, s_2)

Sum:

\text{score}_{\lor}(d) = s_1 + s_2

2.5.3 Score Normalization

For probabilistic combination, scores must be in $[0, 1]$ . Raw relevance scores (e.g., BM25) are unbounded. Chapter 17 develops Bayesian BM25 which provides calibrated probability scores suitable for probabilistic combination.

The normalization function:

\text{normalize}: \mathbb{R}^+ \rightarrow [0, 1]

must preserve relative ordering:

s_1 > s_2 \implies \text{normalize}(s_1) > \text{normalize}(s_2)

2.6 Categorical Perspective

For readers familiar with category theory, we briefly sketch the categorical interpretation of the unified algebra.

2.6.1 Category of Posting Lists

Define category $\mathbf{Post}$ :

Objects: Posting lists $P \in 2^{\mathcal{D}}$
Morphisms: Set functions preserving document identity

This category is equivalent to the category of finite sets and functions.

2.6.2 Functors Between Paradigms

Each data paradigm defines a functor to $\mathbf{Post}$ :

Text Functor $F_T: \mathbf{Term} \rightarrow \mathbf{Post}$

F_T(t) = \tau_t

Vector Functor $F_V: \mathbf{Vec} \rightarrow \mathbf{Post}$

F_V(\vec{q}, k) = \text{kNN}(\vec{q}, k)

Relational Functor $F_R: \mathbf{Rel} \rightarrow \mathbf{Post}$

F_R(\phi) = \sigma_\phi(\mathcal{D})

2.6.3 Natural Transformations

Optimization rules are natural transformations between functors:

\eta: F \Rightarrow G

For example, the transformation from nested-loop join to hash join is a natural transformation in the category of query plans.

2.6.4 Adjunctions

The free/forgetful adjunction captures the relationship between structured queries and their posting list representations:

\text{Free} \dashv \text{Forget}: \mathbf{Query} \rightleftharpoons \mathbf{Post}

This adjunction formalizes how queries generate posting lists and how posting list operations induce query transformations.

2.7 Implementation Considerations

The mathematical framework guides efficient implementation.

2.7.1 Posting List Representation

Posting lists are stored as sorted arrays of document IDs:

PostingList = [doc_id_1, doc_id_2, ..., doc_id_n]  // sorted

Sorted order enables:

Binary search for membership: $O(\log n)$
Merge intersection: $O(n + m)$
Merge union: $O(n + m)$

2.7.2 Skip Pointers

For large posting lists, skip pointers accelerate intersection:

PostingList with Skips:
  [doc_id_1, ..., doc_id_k] -> skip to doc_id_{k*s}
  [doc_id_{k*s+1}, ..., doc_id_{2*k*s}] -> skip to doc_id_{2*k*s}
  ...

Skip distance $s = \sqrt{n}$ minimizes worst-case intersection time.

2.7.3 Compression

Document IDs compress through delta encoding:

Original: [100, 105, 110, 200, 250]
Deltas:   [100, 5, 5, 90, 50]

Variable-byte encoding further reduces space:

Small deltas: 1 byte
Large deltas: 2-4 bytes

2.7.4 Block-Based Processing

Modern CPUs process data fastest in cache-line-sized blocks. Posting lists partition into blocks:

Block 1: [doc_1, ..., doc_128]  max_score=0.95
Block 2: [doc_129, ..., doc_256]  max_score=0.87
...

Block-level metadata enables pruning in top-k queries (Chapter 20).

2.8 Summary

This chapter established the mathematical foundations for unified query processing:

Boolean algebra of sets provides the framework for posting list operations, with well-defined intersection, union, and complement satisfying algebraic axioms
Type system spans documents, terms, vectors, and fields with explicit mappings between spaces, enabling unified representation of multi-paradigm data
Operator calculus defines primitive operators (term retrieval, filter, projection, vector search, scoring) that compose into query pipelines with algebraic properties enabling optimization
Scored extension handles relevance ranking through scored posting lists with various combination semantics (probabilistic, additive, min/max)
Categorical perspective reveals the deep structure: functors map paradigms to posting lists, natural transformations capture optimization rules, and adjunctions formalize the query-posting list relationship

The following chapter extends this framework to incorporate graph structures, proving that graph posting lists preserve the algebraic properties while enabling traversal and pattern matching operations.