Chapter 2: Mathematical Foundations of Query Algebras

This chapter establishes the rigorous mathematical framework underlying unified query processing. We formalize posting lists as a Boolean algebra, define the type system spanning multiple data paradigms, and develop the operator calculus that enables cross-paradigm optimization.

2.1 Set Theory and Boolean Algebra Review

Before developing the unified algebra, we review the mathematical structures upon which it rests.

2.1.1 Sets and Relations

A set is an unordered collection of distinct elements. We write xSx \in S to indicate element xx belongs to set SS, and S|S| to denote the cardinality (size) of SS.

The power set 2S2^S of a set SS is the set of all subsets of SS:

2S={TTS}2^S = \{T \mid T \subseteq S\}

For a finite set with nn elements, 2S=2n|2^S| = 2^n.

A relation RA×BR \subseteq A \times B is a set of ordered pairs from the Cartesian product of sets AA and BB. A function f:ABf: A \rightarrow B is a relation where each element of AA maps to exactly one element of BB.

2.1.2 Boolean Algebra Axioms

A Boolean algebra is a set BB with two binary operations \land (meet/AND) and \lor (join/OR), a unary operation ¬\neg (complement/NOT), and two distinguished elements 00 (bottom) and 11 (top), satisfying:

Commutativity:

ab=baa \land b = b \land a ab=baa \lor b = b \lor a

Associativity:

(ab)c=a(bc)(a \land b) \land c = a \land (b \land c) (ab)c=a(bc)(a \lor b) \lor c = a \lor (b \lor c)

Absorption:

a(ab)=aa \land (a \lor b) = a a(ab)=aa \lor (a \land b) = a

Identity:

a1=aa \land 1 = a a0=aa \lor 0 = a

Distributivity:

a(bc)=(ab)(ac)a \land (b \lor c) = (a \land b) \lor (a \land c) a(bc)=(ab)(ac)a \lor (b \land c) = (a \lor b) \land (a \lor c)

Complementation:

a¬a=0a \land \neg a = 0 a¬a=1a \lor \neg a = 1

The power set 2S2^S with intersection (\cap), union (\cup), complement, empty set (\emptyset), and universal set (SS) forms a Boolean algebra. This is the algebra of posting lists.

2.1.3 Lattice Theory Fundamentals

A partially ordered set (poset) (L,)(L, \leq) is a set LL with a reflexive, antisymmetric, transitive relation \leq.

A lattice is a poset where every pair of elements has a least upper bound (join, \lor) and greatest lower bound (meet, \land).

A complete lattice has joins and meets for all subsets, not just pairs. The power set 2S2^S ordered by inclusion forms a complete lattice:

  • Meet of a family: iISi\bigcap_{i \in I} S_i
  • Join of a family: iISi\bigcup_{i \in I} S_i
  • Bottom element: \emptyset
  • Top element: SS

This structure is fundamental to query optimization, where query plans form a lattice ordered by cost, and transformations navigate toward the optimal plan.

2.2 Posting Lists as Universal Abstraction

We now formalize how posting lists serve as the universal abstraction connecting disparate query paradigms.

2.2.1 Document Universe

Let D\mathcal{D} denote the document universe - the set of all documents in the database. Each document dDd \in \mathcal{D} has a unique identifier id(d)N\text{id}(d) \in \mathbb{N}.

In practice, D\mathcal{D} might contain millions or billions of documents. We assume documents are immutable for query purposes; updates create new document versions.

2.2.2 Posting List Definition

A posting list is a function mapping a term (or more generally, a predicate) to the set of documents satisfying that predicate:

P:T2DP: \mathcal{T} \rightarrow 2^{\mathcal{D}}

where T\mathcal{T} is the term space. For a term tt, the posting list P(t)P(t) contains exactly those documents containing term tt:

P(t)={dDtterms(d)}P(t) = \{d \in \mathcal{D} \mid t \in \text{terms}(d)\}

Example: For documents:

  • d1d_1: "database systems"
  • d2d_2: "distributed database"
  • d3d_3: "operating systems"

The posting lists are:

  • P(database)={d1,d2}P(\text{database}) = \{d_1, d_2\}
  • P(systems)={d1,d3}P(\text{systems}) = \{d_1, d_3\}
  • P(distributed)={d2}P(\text{distributed}) = \{d_2\}
  • P(operating)={d3}P(\text{operating}) = \{d_3\}

2.2.3 Boolean Algebra of Posting Lists

Posting lists form a Boolean algebra under set operations:

Conjunction (AND): Documents matching both predicates

P(t1)P(t2)={dDt1terms(d)t2terms(d)}P(t_1) \cap P(t_2) = \{d \in \mathcal{D} \mid t_1 \in \text{terms}(d) \land t_2 \in \text{terms}(d)\}

Disjunction (OR): Documents matching either predicate

P(t1)P(t2)={dDt1terms(d)t2terms(d)}P(t_1) \cup P(t_2) = \{d \in \mathcal{D} \mid t_1 \in \text{terms}(d) \lor t_2 \in \text{terms}(d)\}

Negation (NOT): Documents not matching the predicate

P(t)=DP(t)={dDtterms(d)}\overline{P(t)} = \mathcal{D} \setminus P(t) = \{d \in \mathcal{D} \mid t \notin \text{terms}(d)\}

These operations satisfy the Boolean algebra axioms:

Commutativity:

P1P2=P2P1P_1 \cap P_2 = P_2 \cap P_1

Associativity:

(P1P2)P3=P1(P2P3)(P_1 \cap P_2) \cap P_3 = P_1 \cap (P_2 \cap P_3)

Distributivity:

P1(P2P3)=(P1P2)(P1P3)P_1 \cap (P_2 \cup P_3) = (P_1 \cap P_2) \cup (P_1 \cap P_3)

De Morgan's Laws:

P1P2=P1P2\overline{P_1 \cap P_2} = \overline{P_1} \cup \overline{P_2} P1P2=P1P2\overline{P_1 \cup P_2} = \overline{P_1} \cap \overline{P_2}

2.2.4 Scored Posting Lists

For ranking queries, we extend posting lists to carry scores:

Ps:T2D×RP_s: \mathcal{T} \rightarrow 2^{\mathcal{D} \times \mathbb{R}}

A scored posting list maps terms to sets of (document, score) pairs:

Ps(t)={(d,s)dP(t),s=score(d,t)}P_s(t) = \{(d, s) \mid d \in P(t), s = \text{score}(d, t)\}

Score combination during Boolean operations requires specification:

AND with scores: Various strategies exist

  • Minimum: min(s1,s2)\min(s_1, s_2)
  • Product: s1s2s_1 \cdot s_2 (probabilistic interpretation)
  • Sum: s1+s2s_1 + s_2 (additive combination)

OR with scores:

  • Maximum: max(s1,s2)\max(s_1, s_2)
  • Probabilistic OR: 1(1s1)(1s2)1 - (1-s_1)(1-s_2)

Chapter 19 develops these score combination strategies in depth.

2.3 Type System Formalization

A unified query system must bridge multiple data paradigms. We formalize the type system that enables this bridging.

2.3.1 Document Space

The document space D\mathcal{D} is the primary universe. Each document dDd \in \mathcal{D} is a semi-structured record with:

  • A unique identifier: id:DN\text{id}: \mathcal{D} \rightarrow \mathbb{N}
  • A schema: schema:DΣ\text{schema}: \mathcal{D} \rightarrow \Sigma
  • Field values: field:D×FV\text{field}: \mathcal{D} \times \mathcal{F} \rightarrow \mathcal{V}

where F\mathcal{F} is the field name space and V\mathcal{V} is the value space.

The value space is a discriminated union:

V=NullBoolInt64Float64StringBinaryArray(V)Object(FV)\mathcal{V} = \text{Null} \mid \text{Bool} \mid \text{Int64} \mid \text{Float64} \mid \text{String} \mid \text{Binary} \mid \text{Array}(\mathcal{V}) \mid \text{Object}(\mathcal{F} \rightarrow \mathcal{V})

2.3.2 Term Space

The term space T\mathcal{T} contains atomic units of text:

T=Σ/\mathcal{T} = \Sigma^* / \sim

where Σ\Sigma^* is the set of all strings and \sim is an equivalence relation defined by text normalization (lowercasing, stemming, etc.).

The terms function extracts terms from text:

terms:String2T\text{terms}: \text{String} \rightarrow 2^{\mathcal{T}}

This function encapsulates the text analysis pipeline (Chapter 16).

2.3.3 Vector Space

The vector space Vn\mathcal{V}_n is nn-dimensional real space:

Vn=Rn\mathcal{V}_n = \mathbb{R}^n

Documents may have vector representations through embedding functions:

embed:D×FVn\text{embed}: \mathcal{D} \times \mathcal{F} \rightarrow \mathcal{V}_n

Common metrics on vector space include:

Euclidean (L2) distance:

dL2(u,v)=i=1n(uivi)2d_{L2}(\vec{u}, \vec{v}) = \sqrt{\sum_{i=1}^{n} (u_i - v_i)^2}

Cosine distance:

dcos(u,v)=1uvuvd_{\cos}(\vec{u}, \vec{v}) = 1 - \frac{\vec{u} \cdot \vec{v}}{|\vec{u}| \cdot |\vec{v}|}

Inner product (for normalized vectors):

ip(u,v)=uv=i=1nuivi\text{ip}(\vec{u}, \vec{v}) = \vec{u} \cdot \vec{v} = \sum_{i=1}^{n} u_i \cdot v_i

2.3.4 Field Space

The field space F\mathcal{F} names document attributes:

F=String\mathcal{F} = \text{String}

Fields support hierarchical naming for nested documents:

address.cityF\text{address.city} \in \mathcal{F}

Field paths form a tree structure rooted at the document.

2.3.5 Bijective Mappings

The power of the unified type system comes from explicit mappings between spaces:

Document to Posting List: Every document belongs to posting lists for its terms

PostingFor:D22D\text{PostingFor}: \mathcal{D} \rightarrow 2^{2^{\mathcal{D}}} PostingFor(d)={P(t)tterms(d)}\text{PostingFor}(d) = \{P(t) \mid t \in \text{terms}(d)\}

Term to Posting List: Direct mapping

P:T2DP: \mathcal{T} \rightarrow 2^{\mathcal{D}}

Vector to Posting List: k-NN search produces a posting list

kNN:Vn×N2D\text{kNN}: \mathcal{V}_n \times \mathbb{N} \rightarrow 2^{\mathcal{D}} kNN(q,k)=argminSD,S=kdSd(q,embed(d))\text{kNN}(\vec{q}, k) = \text{argmin}_{S \subseteq \mathcal{D}, |S|=k} \sum_{d \in S} d(\vec{q}, \text{embed}(d))

Relational Predicate to Posting List: Filter produces a posting list

σϕ:2D2D\sigma_\phi: 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D}} σϕ(D)={dDϕ(d)=true}\sigma_\phi(\mathcal{D}) = \{d \in \mathcal{D} \mid \phi(d) = \text{true}\}

These mappings enable the query planner to treat operations from different paradigms uniformly.

2.4 Operator Calculus

With the type system established, we define the operators that transform posting lists.

2.4.1 Primitive Operators

Term Retrieval (τ\tau): Retrieve the posting list for a term

τ:T2D\tau: \mathcal{T} \rightarrow 2^{\mathcal{D}} τt={dDtterms(d)}\tau_t = \{d \in \mathcal{D} \mid t \in \text{terms}(d)\}

Filter (σ\sigma): Apply a predicate to filter a posting list

σ:(DBool)×2D2D\sigma: (D \rightarrow \text{Bool}) \times 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D}} σϕ(P)={dPϕ(d)}\sigma_\phi(P) = \{d \in P \mid \phi(d)\}

Projection (π\pi): Select specific fields from documents

π:2F×2D2D\pi: 2^{\mathcal{F}} \times 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D}'}

where D\mathcal{D}' is the space of projected documents.

Vector Search (ν\nu): Find nearest neighbors

ν:Vn×N×R2D\nu: \mathcal{V}_n \times \mathbb{N} \times \mathbb{R} \rightarrow 2^{\mathcal{D}} νq,k,ϵ={dDd(q,embed(d))<ϵ}top-k\nu_{\vec{q},k,\epsilon} = \{d \in \mathcal{D} \mid d(\vec{q}, \text{embed}(d)) < \epsilon\} \cap \text{top-}k

Score (ρ\rho): Attach scores to documents

ρ:(DR)×2D2D×R\rho: (\mathcal{D} \rightarrow \mathbb{R}) \times 2^{\mathcal{D}} \rightarrow 2^{\mathcal{D} \times \mathbb{R}} ρf(P)={(d,f(d))dP}\rho_f(P) = \{(d, f(d)) \mid d \in P\}

2.4.2 Composition as Monoid

Operators compose to form pipelines. The set of operators forms a monoid under composition:

(Op,,id)(\text{Op}, \circ, \text{id})

where:

  • Op\text{Op} is the set of operators on posting lists
  • \circ is function composition
  • id\text{id} is the identity operator

Associativity: (fg)h=f(gh)(f \circ g) \circ h = f \circ (g \circ h)

Identity: fid=idf=ff \circ \text{id} = \text{id} \circ f = f

This monoid structure enables pipeline optimization through algebraic manipulation.

2.4.3 Operator Properties

Idempotence: Applying the same filter twice is equivalent to applying it once

σϕσϕ=σϕ\sigma_\phi \circ \sigma_\phi = \sigma_\phi

Filter Commutativity: Independent filters commute

σϕ1σϕ2=σϕ2σϕ1\sigma_{\phi_1} \circ \sigma_{\phi_2} = \sigma_{\phi_2} \circ \sigma_{\phi_1}

Filter Conjunction: Sequential filters equivalent to conjunctive filter

σϕ1σϕ2=σϕ1ϕ2\sigma_{\phi_1} \circ \sigma_{\phi_2} = \sigma_{\phi_1 \land \phi_2}

Projection Pushdown: Projection through filter (when fields independent)

πAσϕ=σϕπAif fields(ϕ)A\pi_A \circ \sigma_\phi = \sigma_\phi \circ \pi_A \quad \text{if } \text{fields}(\phi) \subseteq A

Selection Pushdown: Filter through intersection

σϕ(P1P2)=σϕ(P1)σϕ(P2)\sigma_\phi(P_1 \cap P_2) = \sigma_\phi(P_1) \cap \sigma_\phi(P_2)

2.4.4 Equivalence Classes

Two operator expressions are equivalent if they produce identical results for all inputs:

e1e2    P2D:e1(P)=e2(P)e_1 \equiv e_2 \iff \forall P \in 2^{\mathcal{D}}: e_1(P) = e_2(P)

Equivalence induces equivalence classes on query plans. The optimizer explores these classes to find minimum-cost representatives.

Theorem (Completeness): The transformation rules in Section 2.4.3 are complete for the class of select-project-join queries over posting lists.

This theorem ensures the optimizer can reach any equivalent plan through rule application.

2.5 Extending to Scored Operations

Relevance ranking requires extending the algebra to handle scores.

2.5.1 Scored Posting Lists

A scored posting list is a set of (document, score) pairs:

PsD×R+P_s \subseteq \mathcal{D} \times \mathbb{R}^+

We define projection functions:

docs(Ps)={d(d,s)Ps}\text{docs}(P_s) = \{d \mid (d, s) \in P_s\} score(Ps,d)=s where (d,s)Ps\text{score}(P_s, d) = s \text{ where } (d, s) \in P_s

2.5.2 Score Combination Semantics

When combining scored posting lists, we must specify how scores combine:

Conjunction (AND): For (d,s1)P1(d, s_1) \in P_1 and (d,s2)P2(d, s_2) \in P_2

Probabilistic:

score(d)=s1s2\text{score}_{\land}(d) = s_1 \cdot s_2

Additive:

score(d)=s1+s2\text{score}_{\land}(d) = s_1 + s_2

Minimum:

score(d)=min(s1,s2)\text{score}_{\land}(d) = \min(s_1, s_2)

Disjunction (OR): For documents in P1P2P_1 \cup P_2

Probabilistic (inclusion-exclusion):

score(d)=1(1s1)(1s2)=s1+s2s1s2\text{score}_{\lor}(d) = 1 - (1 - s_1)(1 - s_2) = s_1 + s_2 - s_1 \cdot s_2

Maximum:

score(d)=max(s1,s2)\text{score}_{\lor}(d) = \max(s_1, s_2)

Sum:

score(d)=s1+s2\text{score}_{\lor}(d) = s_1 + s_2

2.5.3 Score Normalization

For probabilistic combination, scores must be in [0,1][0, 1]. Raw relevance scores (e.g., BM25) are unbounded. Chapter 17 develops Bayesian BM25 which provides calibrated probability scores suitable for probabilistic combination.

The normalization function:

normalize:R+[0,1]\text{normalize}: \mathbb{R}^+ \rightarrow [0, 1]

must preserve relative ordering:

s1>s2    normalize(s1)>normalize(s2)s_1 > s_2 \implies \text{normalize}(s_1) > \text{normalize}(s_2)

2.6 Categorical Perspective

For readers familiar with category theory, we briefly sketch the categorical interpretation of the unified algebra.

2.6.1 Category of Posting Lists

Define category Post\mathbf{Post}:

  • Objects: Posting lists P2DP \in 2^{\mathcal{D}}
  • Morphisms: Set functions preserving document identity

This category is equivalent to the category of finite sets and functions.

2.6.2 Functors Between Paradigms

Each data paradigm defines a functor to Post\mathbf{Post}:

Text Functor FT:TermPostF_T: \mathbf{Term} \rightarrow \mathbf{Post}

FT(t)=τtF_T(t) = \tau_t

Vector Functor FV:VecPostF_V: \mathbf{Vec} \rightarrow \mathbf{Post}

FV(q,k)=kNN(q,k)F_V(\vec{q}, k) = \text{kNN}(\vec{q}, k)

Relational Functor FR:RelPostF_R: \mathbf{Rel} \rightarrow \mathbf{Post}

FR(ϕ)=σϕ(D)F_R(\phi) = \sigma_\phi(\mathcal{D})

2.6.3 Natural Transformations

Optimization rules are natural transformations between functors:

η:FG\eta: F \Rightarrow G

For example, the transformation from nested-loop join to hash join is a natural transformation in the category of query plans.

2.6.4 Adjunctions

The free/forgetful adjunction captures the relationship between structured queries and their posting list representations:

FreeForget:QueryPost\text{Free} \dashv \text{Forget}: \mathbf{Query} \rightleftharpoons \mathbf{Post}

This adjunction formalizes how queries generate posting lists and how posting list operations induce query transformations.

2.7 Implementation Considerations

The mathematical framework guides efficient implementation.

2.7.1 Posting List Representation

Posting lists are stored as sorted arrays of document IDs:

PostingList = [doc_id_1, doc_id_2, ..., doc_id_n]  // sorted

Sorted order enables:

  • Binary search for membership: O(logn)O(\log n)
  • Merge intersection: O(n+m)O(n + m)
  • Merge union: O(n+m)O(n + m)

2.7.2 Skip Pointers

For large posting lists, skip pointers accelerate intersection:

PostingList with Skips:
  [doc_id_1, ..., doc_id_k] -> skip to doc_id_{k*s}
  [doc_id_{k*s+1}, ..., doc_id_{2*k*s}] -> skip to doc_id_{2*k*s}
  ...

Skip distance s=ns = \sqrt{n} minimizes worst-case intersection time.

2.7.3 Compression

Document IDs compress through delta encoding:

Original: [100, 105, 110, 200, 250]
Deltas:   [100, 5, 5, 90, 50]

Variable-byte encoding further reduces space:

  • Small deltas: 1 byte
  • Large deltas: 2-4 bytes

2.7.4 Block-Based Processing

Modern CPUs process data fastest in cache-line-sized blocks. Posting lists partition into blocks:

Block 1: [doc_1, ..., doc_128]  max_score=0.95
Block 2: [doc_129, ..., doc_256]  max_score=0.87
...

Block-level metadata enables pruning in top-k queries (Chapter 20).

2.8 Summary

This chapter established the mathematical foundations for unified query processing:

  1. Boolean algebra of sets provides the framework for posting list operations, with well-defined intersection, union, and complement satisfying algebraic axioms

  2. Type system spans documents, terms, vectors, and fields with explicit mappings between spaces, enabling unified representation of multi-paradigm data

  3. Operator calculus defines primitive operators (term retrieval, filter, projection, vector search, scoring) that compose into query pipelines with algebraic properties enabling optimization

  4. Scored extension handles relevance ranking through scored posting lists with various combination semantics (probabilistic, additive, min/max)

  5. Categorical perspective reveals the deep structure: functors map paradigms to posting lists, natural transformations capture optimization rules, and adjunctions formalize the query-posting list relationship

The following chapter extends this framework to incorporate graph structures, proving that graph posting lists preserve the algebraic properties while enabling traversal and pattern matching operations.

Copyright (c) 2023-2026 Cognica, Inc.