Chapter 4: Query Optimization Theory
This chapter develops the theoretical foundations for cost-based query optimization in unified systems. We formalize the query space as a complete lattice, establish equivalence-preserving transformations, build cost models spanning multiple paradigms, and develop selectivity estimation techniques. The chapter concludes with a category-theoretic perspective that reveals the deep structure underlying query optimization.
4.1 The Query Optimization Problem
Query optimization is the process of transforming a declarative query into an efficient execution plan. The declarative query specifies what data to retrieve; optimization determines how to retrieve it.
4.1.1 Search Space Explosion
Consider a simple join of five tables:
SELECT * FROM A, B, C, D, E
WHERE A.x = B.x AND B.y = C.y AND C.z = D.z AND D.w = E.w;
The number of possible join orderings is:
For each ordering, multiple physical implementations exist (nested loop, hash join, merge join). With index choices, the search space grows exponentially.
4.1.2 Multi-Paradigm Complexity
In a unified system, optimization spans paradigms:
SELECT p.title, bm25_score(p.content) as relevance
FROM papers p
WHERE MATCH(p.content) AGAINST ('distributed systems')
AND vector_similarity(p.embedding, ?) > 0.8
AND p.author_id IN (
SELECT vertex_id FROM graph_traverse(:professor, 'ADVISED', 3)
)
ORDER BY relevance DESC
LIMIT 10;
This query involves:
- Full-text search (MATCH...AGAINST)
- Vector similarity (embedding comparison)
- Graph traversal (advisor relationships)
- Relational filtering (author constraint)
The optimizer must reason about posting list intersections, index selection across paradigms, and cost trade-offs between different execution strategies.
4.1.3 Optimization Objectives
A query optimizer seeks to minimize a cost function:
where:
- is I/O cost (disk reads/writes)
- is CPU cost (operations performed)
- is memory cost (working memory required)
- are weights reflecting system characteristics
Different systems weight these differently: OLTP systems prioritize latency (minimize ), OLAP systems prioritize throughput (minimize per row), memory-constrained systems bound .
4.2 Query Space as Complete Lattice
We formalize the space of query plans as a mathematical structure amenable to optimization algorithms.
4.2.1 Logical Query Plans
A logical query plan is a tree of relational operators:
Each operator has:
- Input schema(s)
- Output schema
- Semantic meaning (what it computes)
Example logical plan:
Limit(10)
Sort(relevance DESC)
Project(title, relevance)
Filter(vector_sim > 0.8)
Join(author_id = vertex_id)
FTSSearch(content, 'distributed systems')
Scan(papers)
GraphTraverse(professor, ADVISED, 3)
4.2.2 Physical Query Plans
A physical query plan specifies concrete implementations:
Each physical operator has:
- Implementation algorithm
- Cost characteristics
- Resource requirements
Example physical plan:
HeapTopK(10, relevance DESC)
Project(title, relevance)
HashJoin(author_id = vertex_id)
PostingListIntersect
WANDSearch(content, 'distributed systems', k=1000)
HNSWSearch(embedding, query_vec, k=1000, threshold=0.8)
BFSTraversal(professor, ADVISED, max_depth=3)
4.2.3 Plan Equivalence
Two plans are equivalent if they produce identical results for all database states:
Equivalence induces equivalence classes on the plan space.
4.2.4 Partial Ordering by Cost
Define a partial order on plans by estimated cost:
Within an equivalence class, the optimizer seeks the minimum element under .
4.2.5 Lattice Structure
Theorem 4.1 (Query Plan Lattice): The space of query plans for a given query, ordered by cost, forms a complete lattice.
Proof sketch:
- Bottom: The optimal plan (minimum cost)
- Top: The worst plan (maximum cost)
- Meet: Given plans , their meet is the cheaper of any common refinement
- Join: The join is the more expensive plan from which both are reachable by cost-reducing transformations
The lattice structure guarantees that local search algorithms (like dynamic programming) can find global optima under appropriate conditions.
4.2.6 Plan Space Navigation
Optimization algorithms navigate the lattice:
Dynamic Programming (DPccp): Bottom-up construction of optimal subplans, guaranteed to find global optimum for join ordering.
Transformation-Based: Apply equivalence rules to transform plans, hill-climbing toward lower cost.
Randomized: Simulated annealing or genetic algorithms for large search spaces.
4.3 Equivalence-Preserving Transformations
Transformations convert one plan to an equivalent plan, enabling search through the plan space.
4.3.1 Selection Pushdown
Rule: Push selections through projections when possible.
Benefit: Filter early reduces intermediate result sizes.
Example:
Before: Project(name, price) -> Filter(price > 100) -> Scan(products)
After: Project(name, price) -> Scan(products) with pushed filter price > 100
4.3.2 Selection Split and Merge
Rule: Conjunctive predicates can split or merge.
Benefit: Enables independent optimization of each predicate.
4.3.3 Selection Commutativity
Rule: Independent selections commute.
Benefit: Enables ordering by selectivity (most selective first).
4.3.4 Join Commutativity
Rule: Joins are commutative.
Benefit: Choose build/probe sides for hash join based on size.
4.3.5 Join Associativity
Rule: Joins are associative.
Benefit: Enables exploring all join orderings.
4.3.6 Selection-Join Exchange
Rule: Push selection through join.
Benefit: Reduces join input sizes.
4.3.7 Projection Pushdown
Rule: Push projection through join.
where = join attributes, = , = .
Benefit: Reduces data width in intermediate results.
4.3.8 Cross-Paradigm Transformations
Unified systems enable novel transformations:
FTS-Filter Interchange:
Vector-Filter Interchange:
Note: Vector search may require to account for filtered-out results.
Graph-Relational Interchange:
When filtering can push into traversal.
4.4 Cost Model Fundamentals
Cost models estimate execution cost without actually running queries.
4.4.1 I/O Cost Model
Sequential Read Cost:
where = pages read, = cost per sequential page read.
Random Read Cost:
where (typically 10-100x for HDD, 2-10x for SSD).
Index Scan Cost:
Index lookup plus data page fetches for matching rows.
Sequential Scan Cost:
4.4.2 CPU Cost Model
Filter Cost:
where depends on predicate complexity.
Hash Table Build Cost:
Hash Probe Cost:
Sort Cost:
4.4.3 Memory Cost Model
Hash Join Memory:
If , spill to disk.
Sort Memory:
External sort required when input exceeds memory.
4.4.4 Join Cost Models
Nested Loop Join:
Hash Join:
Merge Join (sorted inputs):
Decision Criterion:
- Use NLJ when small or inner indexed
- Use HJ when memory sufficient for build side
- Use MJ when inputs pre-sorted or sort cost amortized
4.4.5 Posting List Operation Costs
Intersection Cost:
With skip pointers, seeks skip past non-matching regions.
Union Cost:
Linear merge of sorted lists.
FTS Scoring Cost:
Score computation for each document-term pair.
4.4.6 Vector Search Cost
HNSW Search Cost:
where = search width, = index size, = distance computation cost.
Brute Force Cost:
where = vector dimension.
4.4.7 Graph Traversal Cost
BFS Cost:
where = depth, = average degree.
With Filter:
Filter selectivity reduces effective degree at each hop.
4.5 Cardinality Estimation
Accurate cardinality estimation is critical for cost-based optimization.
4.5.1 Base Table Statistics
For each table , maintain:
- : Row count
- : Distinct values for attribute
- , : Value range
- : Value histogram
4.5.2 Single Predicate Selectivity
Equality:
Range:
With histogram: Use histogram bucket frequencies for more accurate estimation.
4.5.3 Compound Predicate Selectivity
Independence Assumption:
With Correlation:
where captures dependency (1 = independent, >1 = positive correlation, <1 = negative correlation).
4.5.4 Join Cardinality
Foreign Key Join:
Each row in matches exactly one row in .
General Equijoin:
Assuming uniform distribution.
With Statistics:
4.5.5 FTS Cardinality
Term Query:
Document frequency from inverted index.
Conjunction:
Independence assumption.
With Co-occurrence Statistics:
Pre-computed co-occurrence counts.
4.5.6 Vector Search Cardinality
k-NN Search:
By definition.
Threshold Search:
Requires distribution model for similarity scores.
4.5.7 Graph Traversal Cardinality
k-Hop Traversal:
Exponential growth bounded by graph size.
With Selectivity:
Filter applies at each hop.
4.6 Selectivity Estimation Techniques
4.6.1 Histograms
Equi-width Histogram: Divide value range into equal-width buckets.
Equi-depth Histogram: Each bucket contains equal number of values.
Better for skewed distributions.
Compressed Histogram: Store frequent values separately, histogram for remainder.
4.6.2 Sampling
Random Sampling: Estimate selectivity from sample.
Confidence Interval:
4.6.3 Sketches
Count-Min Sketch: Estimate frequency of items.
Space: for approximation with probability .
HyperLogLog: Estimate cardinality (distinct count).
Space: for relative error .
4.6.4 Machine Learning Approaches
Query-Driven Learning: Train models on query workload.
Features include: predicate type, column statistics, query structure.
Learned Cardinality Estimation: Neural networks trained on actual cardinalities.
4.7 The DPccp Algorithm
Dynamic Programming for Connected Subgraph Complement Pairs (DPccp) is the standard algorithm for optimal join ordering.
4.7.1 Problem Formulation
Given relations and join graph where vertices are relations and edges are join predicates:
Goal: Find minimum-cost join tree.
4.7.2 Algorithm
Algorithm: DPccp(R_1, ..., R_n, G)
// Initialize single relations
for i in 1..n:
opt[{R_i}] = AccessPath(R_i)
// Enumerate subgraph complement pairs
for S in subsets(R_1, ..., R_n) ordered by size:
for (S_1, S_2) in ccp_pairs(S, G):
for each join algorithm J:
cost = C(J, opt[S_1], opt[S_2])
if cost < opt[S].cost:
opt[S] = (J, opt[S_1], opt[S_2], cost)
return opt[{R_1, ..., R_n}]
ccp_pairs(S, G): Enumerate pairs where , , and both and are connected in .
4.7.3 Complexity
- Subsets:
- CCP pairs per subset: on average
- Total:
For large , heuristics or randomized algorithms necessary.
4.7.4 Extensions for Unified Queries
DPccp extends to multi-paradigm queries by:
- Treating FTS/Vector/Graph operations as "virtual relations"
- Modeling posting list intersections as joins
- Including paradigm-specific cost models
4.8 Category-Theoretic Perspective
For readers with category theory background, we sketch the categorical view of query optimization.
4.8.1 Category of Queries
Define category :
- Objects: Query plans (logical or physical)
- Morphisms: Equivalence-preserving transformations
This forms a groupoid (every morphism is invertible) since transformations preserve equivalence.
4.8.2 Cost Functor
The cost function defines a functor:
mapping query plans to their estimated costs.
Functoriality: Transformations that preserve equivalence should have predictable cost effects:
where is the cost delta of transformation .
4.8.3 Natural Transformations as Optimizations
An optimization strategy is a natural transformation:
where is the "optimized plan" functor.
Naturality ensures optimization is consistent across equivalent query formulations.
4.8.4 Adjunctions in Query Processing
Free-Forgetful Adjunction:
- (free functor): Compile SQL to logical plan
- (forgetful functor): Extract SQL semantics from plan
The adjunction captures that plans are "free algebras" over SQL expressions.
Embedding-Projection Adjunction:
- (embed): Logical plan as abstract physical plan
- (project): Physical plan's logical semantics
This adjunction formalizes the relationship between logical and physical planning.
4.8.5 Monad of Query Execution
Query execution forms a monad:
with:
- (plan becomes executable)
- (flatten nested execution)
The monad laws ensure consistent execution semantics.
4.9 Practical Optimization Architecture
4.9.1 Two-Phase Optimization
Phase 1: Logical Optimization
- Apply transformation rules
- Generate alternative logical plans
- Prune obviously suboptimal plans
Phase 2: Physical Optimization
- Select physical operators
- Choose access paths
- Determine join algorithms
4.9.2 Rule-Based Optimization
Apply transformation rules in priority order:
- Predicate simplification: Constant folding, contradiction detection
- Predicate pushdown: Move filters toward data sources
- Projection pruning: Remove unused columns
- Join elimination: Remove unnecessary joins (e.g., to unique key)
- Subquery unnesting: Convert correlated subqueries to joins
4.9.3 Cost-Based Optimization
After rule-based transformations:
- Enumerate join orderings (DPccp or heuristic)
- For each ordering, select physical operators
- Estimate cost of each plan
- Select minimum-cost plan
4.9.4 Adaptive Optimization
Modern optimizers adapt during execution:
Adaptive Join Selection: Switch join algorithm based on runtime cardinalities.
Adaptive Parallelism: Adjust parallelism based on observed throughput.
Re-optimization: Re-plan query mid-execution if estimates badly wrong.
4.10 Cross-Paradigm Optimization
4.10.1 Unified Cost Model
Cross-paradigm optimization requires a unified cost model:
The integration cost captures posting list operations that combine paradigm results.
4.10.2 Paradigm Selection
Given a predicate expressible in multiple paradigms, choose the cheapest:
Example: Find documents where category = 'electronics'
Options:
- Relational: Secondary index scan
- FTS: Term query on category field
- Graph: Label-based vertex filter
Cost comparison:
Select minimum.
4.10.3 Intersection Ordering
For conjunctive queries across paradigms:
WHERE MATCH(content) AGAINST ('database') -- FTS
AND category = 'tech' -- Relational
AND vector_sim(embedding, ?) > 0.8 -- Vector
Order intersections by:
- Selectivity (most selective first)
- Evaluation cost (cheapest first, tie-breaker)
If FTS selectivity = 0.01, relational selectivity = 0.1, vector selectivity = 0.05:
Order: FTS (0.01) -> Vector (0.05) -> Relational (0.1)
4.10.4 Materialization Points
Decide where to materialize intermediate results:
Eager Materialization: Materialize after each paradigm operation.
- Pro: Simple, predictable memory
- Con: May materialize large intermediate results
Lazy Materialization: Defer materialization until necessary.
- Pro: Avoids unnecessary work
- Con: Complex planning, potential repeated evaluation
Hybrid: Materialize based on estimated intermediate sizes.
4.11 Summary
This chapter established the theoretical foundations for query optimization:
-
Query space as lattice formalizes the search space with partial ordering by cost, enabling systematic exploration toward optimal plans
-
Equivalence-preserving transformations include classical rules (selection pushdown, join reordering) extended with cross-paradigm transformations for unified systems
-
Cost models span I/O, CPU, and memory costs for relational, FTS, vector, and graph operations, enabling unified cost estimation
-
Selectivity estimation uses histograms, sampling, and sketches for cardinality prediction, critical for accurate cost estimation
-
DPccp algorithm provides optimal join ordering through dynamic programming over connected subgraph complement pairs
-
Category-theoretic perspective reveals query optimization as navigation through a category of plans with cost as a functor and optimization strategies as natural transformations
-
Cross-paradigm optimization requires unified cost models, paradigm selection, intersection ordering, and materialization decisions
The following chapters (Part II) examine the storage engine that underlies all these operations, starting with LSM-tree architecture in Chapter 5.