Chapter 33: Performance Engineering
Performance engineering encompasses the systematic discipline of designing, implementing, and tuning database systems for optimal throughput and latency. Unlike ad-hoc optimization, performance engineering applies principled techniques—cost models, statistics, and algorithmic analysis—to make informed decisions throughout the query lifecycle. This chapter examines Cognica's performance engineering infrastructure, from cost-based query optimization to JIT compilation decisions.
33.1 Performance Architecture Overview
Cognica's performance architecture spans multiple layers:
The performance optimization pipeline follows a principled approach:
- Statistics Collection: Gather distribution data for cost estimation
- Cost Modeling: Estimate I/O, CPU, and memory costs for operations
- Plan Selection: Choose optimal execution strategies
- Execution Optimization: Apply runtime optimizations (JIT, vectorization)
- Result Caching: Avoid redundant computation
33.2 Cost-Based Query Optimization
33.2.1 Multi-Dimensional Cost Model
The cost model captures multiple performance dimensions:
struct Cost {
double cpu_cost = 0.0; // CPU cycles estimate
double io_cost = 0.0; // I/O operations estimate
double memory_cost = 0.0; // Memory pressure estimate
double spill_cost = 0.0; // Disk spill overhead
auto total() const -> double {
return cpu_cost + io_cost * kIOCostWeight
+ memory_cost * kMemoryCostWeight
+ spill_cost * kSpillCostWeight;
}
static constexpr double kIOCostWeight = 10.0;
static constexpr double kMemoryCostWeight = 0.1;
static constexpr double kSpillCostWeight = 5.0;
};
The weighted total cost enables comparison:
The weights reflect typical hardware characteristics:
- I/O Weight (10x): Disk access dominates in-memory computation
- Memory Weight (0.1x): Memory pressure matters but less than I/O
- Spill Weight (5x): Disk spill is expensive but preferable to OOM
33.2.2 Cost Estimation Functions
The cost estimator provides operation-specific estimates:
class CostEstimator {
public:
// Sequential scan: linear in row count
auto estimate_seq_scan_cost(int64_t total_rows, double row_width) const
-> Cost;
// Index scan: depends on selectivity and index structure
auto estimate_index_scan_cost(const Index* index, double selectivity,
int64_t total_rows, bool index_only) const
-> Cost;
// Sort: O(n log n) with potential spill
auto estimate_sort_cost(int64_t rows, double row_width,
size_t memory_budget) const -> Cost;
// Hash join: build + probe with potential partition spill
auto estimate_hash_join_cost(int64_t build_rows, int64_t probe_rows,
double row_width, size_t memory_budget) const
-> Cost;
// Aggregation: hash table or sorted groups
auto estimate_aggregate_cost(int64_t input_rows, int64_t estimated_groups,
size_t memory_budget) const -> Cost;
private:
static constexpr double kCPUTupleProcessing = 0.01;
static constexpr double kCPUComparison = 0.0001;
static constexpr double kCPUHashOp = 0.0005;
static constexpr double kSpillIOCost = 0.01;
};
Sequential Scan Cost:
where is row count, is average row width, and and are CPU cost constants.
Sort Cost:
where accounts for external sort passes if data exceeds memory:
where is memory budget, is merge width, and is the number of initial sorted runs.
33.2.3 Spill Cost Prediction
The cost estimator predicts when operations will spill:
auto will_spill(int64_t rows, double row_width, size_t memory_budget) const
-> bool {
auto estimated_size = static_cast<size_t>(rows * row_width);
return estimated_size > memory_budget;
}
auto estimate_sort_spill_cost(int64_t rows, double row_width,
size_t memory_budget) const -> double {
auto data_size = rows * row_width;
auto runs = static_cast<size_t>(std::ceil(data_size / memory_budget));
auto passes = static_cast<size_t>(
std::ceil(std::log(runs) / std::log(kMergeWidth)));
return 2 * passes * data_size * kSpillIOCost * kCompressionFactor;
}
The spill cost model accounts for:
- Number of Runs: Initial sorted partitions that fit in memory
- Merge Passes: merge passes
- I/O Cost: Read + write per pass
- Compression: LZ4 compression reduces I/O volume
33.3 LSM-Tree Cost Model
Cognica uses RocksDB (LSM-tree storage), requiring specialized I/O cost modeling.
33.3.1 LSM-Tree Characteristics
class LSMCostModel {
public:
static constexpr double kDefaultPointLookupCost = 1.0;
static constexpr double kDefaultRangeScanCostPerRow = 0.1;
static constexpr double kDefaultBloomFPR = 0.01; // 1%
static constexpr double kLevelSizeMultiplier = 10.0;
static auto collect(cognica::rdb::DB* db) -> LSMCostModel;
auto estimate_point_lookup_cost() const -> double;
auto estimate_range_scan_cost(int64_t estimated_rows) const -> double;
auto estimate_index_scan_cost(double selectivity, int64_t total_keys) const
-> double;
private:
int32_t num_levels_ = 0;
std::vector<int64_t> files_per_level_;
std::vector<int64_t> bytes_per_level_;
double bloom_filter_fpr_ = kDefaultBloomFPR;
};
33.3.2 Point Lookup Cost
Point lookups may check multiple LSM levels:
With bloom filters, the probability of checking level is:
where FPR is the bloom filter false positive rate (typically 1%).
33.3.3 Range Scan Cost
Range scans merge data across levels:
where is the size of level and is total data size. The merge cost accounts for heap-based multi-way merge.
33.4 Cardinality Estimation
Accurate cardinality estimates drive cost-based optimization.
33.4.1 Cardinality Propagation
struct CardinalityEstimate {
double rows = 0.0;
double row_width = 512.0;
auto memory_estimate() const -> size_t {
return static_cast<size_t>(rows * row_width);
}
auto apply_selectivity(double selectivity) const -> CardinalityEstimate {
return CardinalityEstimate {rows * selectivity, row_width};
}
};
Each operator transforms cardinality:
- Filter: where is selectivity
- Sort: (rows unchanged)
- Limit:
- Group:
- Join:
33.4.2 Selectivity Estimation
class CardinalityEstimator {
public:
auto estimate_filter_selectivity(const Document& predicates) const -> double;
auto estimate_filter_selectivity(const query::ast::Expression* expr) const
-> double;
auto estimate_group_count(const std::vector<std::string>& group_keys) const
-> int64_t;
auto estimate_join_selectivity(const std::string& join_key) const -> double;
private:
static constexpr double kDefaultEqualitySelectivity = 0.01;
static constexpr double kDefaultRangeSelectivity = 0.30;
static constexpr double kDefaultJoinSelectivity = 0.1;
static constexpr bool kUseDampedAnd = true;
};
Equality Selectivity:
With statistics:
Without statistics (heuristic):
\sigma_{eq} = 0.01 \quad \text{(1% default)}Range Selectivity:
With histogram:
where is the fraction of bucket overlapping the range.
33.4.3 Correlated Predicate Handling
For conjunctive predicates, naive independence assumption underestimates:
Cognica uses damped AND estimation:
This accounts for common correlation patterns where predicates share underlying factors.
33.5 Query Plan Optimization
33.5.1 Optimization Passes
class PlanOptimizer final {
public:
auto optimize(LogicalPlanPtr plan)
-> std::expected<LogicalPlanPtr, db::Status>;
private:
// Move filters closer to scans
auto pushdown_predicates_(LogicalPlanPtr plan) -> LogicalPlanPtr;
// Transform subqueries to joins
auto unnest_subqueries_(LogicalPlanPtr plan) -> LogicalPlanPtr;
// Extract repeated expressions
auto eliminate_common_subexpressions_(LogicalPlanPtr plan) -> LogicalPlanPtr;
// Optimize join order based on cost
auto reorder_joins_(LogicalPlanPtr plan) -> LogicalPlanPtr;
// Eliminate unused columns
auto prune_columns_(LogicalPlanPtr plan) -> LogicalPlanPtr;
// Remove redundant sorts when index provides order
auto eliminate_redundant_sorts_(LogicalPlanPtr plan)
-> std::expected<LogicalPlanPtr, db::Status>;
// Transform OR joins to UNION
auto expand_join_or_to_union_(LogicalPlanPtr plan) -> LogicalPlanPtr;
};
33.5.2 Predicate Pushdown
Predicate pushdown moves filters below blocking operators:
The optimizer checks pushability:
auto can_pushdown_predicate_(const ast::Expr* predicate,
const LogicalPlan* node) const -> bool {
// Check if predicate references only columns from the target node
return references_only_(predicate, node);
}
33.5.3 Join Reordering
For multi-way joins, the optimizer builds a join graph and selects optimal order:
struct JoinNode {
std::string identifier;
LogicalPlanPtr scan;
int64_t estimated_rows;
};
struct JoinEdge {
ast::ExprPtr condition;
ast::JoinType join_type;
};
auto build_optimal_join_tree_(std::vector<JoinNode>& nodes,
std::vector<JoinEdge>& edges,
std::vector<ast::ExprPtr>& filters)
-> LogicalPlanPtr;
The greedy algorithm selects joins that minimize intermediate result size:
Smaller intermediate results reduce memory pressure and I/O.
33.5.4 Common Subexpression Elimination
CSE identifies repeated expressions:
struct CSECandidate {
ast::ExprPtr expr;
size_t hash;
int32_t occurrence_count;
std::string synthetic_name; // __cse_0, __cse_1, ...
};
class ExpressionCollector {
public:
void set_function_registry(const functions::ScalarFunctionRegistry* registry);
void collect_from_plan(const LogicalPlan* plan);
auto get_common_subexpressions() -> std::vector<CSECandidate>;
private:
auto is_cse_candidate_(const ast::Expr* expr) const -> bool {
// Volatile functions (random, gen_random_uuid) cannot be CSE'd
if (is_volatile_(expr)) return false;
// Simple column references don't benefit from CSE
if (is_simple_column_ref_(expr)) return false;
return true;
}
};
CSE transforms:
SELECT x * y + 1, x * y + 2 FROM t;
-- Into:
SELECT __cse_0 + 1, __cse_0 + 2 FROM (SELECT x * y AS __cse_0, ... FROM t);
33.5.5 Redundant Sort Elimination
When ORDER BY matches index order, sorting is unnecessary:
auto can_eliminate_sort_(const LogicalSort* sort,
const std::vector<std::string>& index_fields) const
-> bool {
// Check if ORDER BY columns match index field prefix
for (size_t i = 0; i < sort->sort_keys().size(); ++i) {
if (i >= index_fields.size()) return false;
auto column = get_order_by_column_name_(sort->sort_keys()[i].expr.get());
if (column != index_fields[i]) return false;
// Direction must also match
}
return true;
}
33.6 Pipeline Optimization
33.6.1 Pipeline Optimizer
class PipelineOptimizer final {
public:
void set_statistics(const Statistics* stats);
void set_base_row_count(int64_t base_rows);
auto optimize(ParsedPipeline& pipeline) -> void;
auto get_cardinality_estimates(const ParsedPipeline& pipeline) const
-> std::vector<CardinalityEstimate>;
private:
auto apply_limit_hint_optimization_(ParsedPipeline& pipeline) -> void;
auto apply_stage_reordering_(ParsedPipeline& pipeline) -> void;
auto apply_stage_merging_(ParsedPipeline& pipeline) -> void;
auto order_filter_predicates_(query::ast::Expression* filter) const -> void;
};
33.6.2 Top-K Optimization
Push LIMIT hints into SORT stages:
The sort operator uses the hint for heap-based Top-K instead of full sort:
33.6.3 Predicate Ordering
Order conjunctive predicates by selectivity (most selective first):
auto order_filter_predicates_(query::ast::Expression* filter) const -> void {
// For AND expressions, reorder operands by estimated cost/selectivity
// Most selective (lowest selectivity) should be first for short-circuit
}
auto estimate_predicate_cost_(const query::ast::Expression* expr) const
-> double {
// Combine selectivity with evaluation cost
auto selectivity = cardinality_estimator_.estimate_filter_selectivity(expr);
auto eval_cost = estimate_eval_cost_(expr);
return selectivity * eval_cost;
}
Evaluating the most selective predicate first maximizes short-circuit benefit.
33.7 Index Selection
33.7.1 Cost-Based Index Selection
struct IndexSelector {
static auto select(const IndexDescriptor& index_desc, const Document& query,
const IndexStatisticsManager* stats_mgr = nullptr,
const LSMCostModel* lsm_cost = nullptr)
-> std::shared_ptr<const Index>;
};
The selector compares:
- Full Scan Cost:
- Index Scan Cost:
Index scan wins when:
where coverage indicates index-only scan capability.
33.7.2 Selectivity Threshold
For low selectivity queries, full scan may be cheaper:
// Typical crossover point: ~5-10% selectivity
constexpr double kIndexScanThreshold = 0.1;
if (selectivity > kIndexScanThreshold) {
// Full scan likely cheaper due to sequential I/O
return nullptr;
}
33.8 JIT Compilation Cost Model
33.8.1 JIT Decision Framework
struct JITDecision {
bool should_jit = false;
enum class Reason {
kNotEligible, // Cannot be JIT compiled
kTooFewRows, // Not enough rows to justify compilation
kTooSimple, // No benefit for simple expressions
kCostEffective, // JIT is cost effective
kAlwaysJIT, // Complex enough to always JIT
kCacheHit, // Already compiled in cache
};
Reason reason;
int64_t break_even_rows = 0;
double expected_speedup = 1.0;
double compilation_time_us = 0.0;
};
33.8.2 Break-Even Analysis
class CostModel {
public:
auto decide(const AnalysisResult& analysis, int64_t estimated_rows) const
-> JITDecision;
// Per-row costs (nanoseconds)
double interpret_cost_per_op = 5.0; // Interpreter overhead
double jit_cost_per_op = 0.5; // Native code execution
// Compilation costs (nanoseconds)
double compilation_base_ns = 50000.0; // 50us base
double compilation_per_op_ns = 1000.0; // 1us per operation
};
The break-even row count:
For an expression with operations:
For operations: rows.
33.8.3 JIT Configuration
struct JITConfig {
static constexpr int32_t kMinOpsForJIT = 3;
static constexpr int64_t kMinRowsForSimpleJIT = 1000;
static constexpr int64_t kMinRowsForComplexJIT = 500;
static constexpr int32_t kAlwaysJITOps = 10;
static constexpr double kInterpretCostPerOp = 5.0;
static constexpr double kJITCostPerOp = 0.5;
static constexpr double kCompilationBaseNs = 50000.0;
static constexpr double kCompilationPerOpNs = 1000.0;
};
The tiered approach:
- < 3 ops: Never JIT (overhead not justified)
- 3-9 ops: JIT if rows
- >= 10 ops: Always JIT (complexity warrants compilation)
33.9 IR Optimization Passes
33.9.1 Optimization Framework
class OptimizationPass {
public:
virtual auto run(std::shared_ptr<ir::IRModule> module) -> OptimizationResult
= 0;
virtual auto name() const -> std::string = 0;
virtual auto enabled_at_level(uint8_t level) const -> bool = 0;
};
struct OptimizerConfig {
uint8_t optimization_level = 1; // 0-3
bool enable_constant_folding = true;
bool enable_dce = true;
bool enable_cse = true;
bool enable_strength_reduction = true;
size_t max_iterations = 10;
};
33.9.2 Constant Folding
Evaluate constant expressions at compile time:
class ConstantFoldingPass final : public OptimizationPass {
public:
auto enabled_at_level(uint8_t level) const -> bool override {
return level >= 1; // O1 and above
}
private:
auto fold_binary_(const ir::IRBinaryOperation& node) -> ir::IRNodePtr {
// 5 + 3 -> 8
// "hello" || "world" -> "helloworld"
}
};
33.9.3 Dead Code Elimination
Remove unreachable or unused code:
class DeadCodeEliminationPass final : public OptimizationPass {
public:
auto enabled_at_level(uint8_t level) const -> bool override {
return level >= 1;
}
private:
void mark_live_(const ir::IRNodePtr& node,
std::unordered_set<const ir::IRNode*>& live);
auto eliminate_dead_(const ir::IRNodePtr& node,
const std::unordered_set<const ir::IRNode*>& live,
size_t& nodes_removed) -> ir::IRNodePtr;
};
33.9.4 Common Subexpression Elimination
class CSEPass final : public OptimizationPass {
public:
auto enabled_at_level(uint8_t level) const -> bool override {
return level >= 2; // O2 and above
}
private:
auto hash_node_(const ir::IRNodePtr& node) const -> uint64_t;
auto nodes_equal_(const ir::IRNodePtr& a, const ir::IRNodePtr& b) const
-> bool;
};
33.9.5 Strength Reduction
Replace expensive operations with cheaper equivalents:
class StrengthReductionPass final : public OptimizationPass {
public:
auto enabled_at_level(uint8_t level) const -> bool override {
return level >= 2;
}
private:
auto reduce_node_(ir::IRNodePtr node) -> ir::IRNodePtr {
// x * 2 -> x << 1
// x / 4 -> x >> 2
// x * 0 -> 0
// x + 0 -> x
}
};
33.9.6 Register Allocation
Map virtual registers to physical registers:
class RegisterAllocationPass final : public OptimizationPass {
public:
struct Config {
uint32_t num_general_registers = 16;
uint32_t num_float_registers = 8;
bool enable_coalescing = true;
};
auto enabled_at_level(uint8_t /*level*/) const -> bool override {
return true; // Always required for codegen
}
private:
void compute_liveness_(const ir::IRModule& module);
void build_interference_graph_();
void color_graph_();
void coalesce_registers_();
};
The register allocator uses graph coloring with coalescing to minimize register-to-register moves.
33.10 Query Caching
33.10.1 Result Cache
class SQLQueryCache final {
public:
SQLQueryCache(int32_t num_shards, size_t capacity_per_shard,
uint64_t ttl_seconds, size_t max_result_size);
auto get(const std::string& sql,
const catalog::SchemaGenerationTracker* schema_tracker = nullptr)
const noexcept -> std::optional<executor::ResultSet>;
void put(const std::string& sql, const executor::ResultSet& result,
std::unordered_set<std::string> table_dependencies,
bool depends_on_schema = false,
const catalog::SchemaGenerationTracker* schema_tracker = nullptr);
auto invalidate_by_table(const std::string& table_name) -> size_t;
private:
struct Shard {
LRUCache<std::string, std::shared_ptr<CachedEntry>> cache;
std::unordered_map<std::string, std::unordered_set<std::string>>
table_to_keys;
mutable threading::spinlock lock;
};
std::vector<std::unique_ptr<Shard>> shards_;
};
Key design decisions:
- Sharding: Multiple shards reduce lock contention
- TTL Expiration: Time-based invalidation for stale data
- Table Dependencies: Eager invalidation on writes
- Schema Generation: Cross-session invalidation for DDL
33.10.2 Plan Cache
struct PlanCacheEntry {
std::shared_ptr<cvm::BytecodeModule> bytecode;
std::vector<std::string> output_columns;
std::unordered_map<uint8_t, db::document::Document> index_queries;
uint64_t schema_generation = 0;
std::vector<int32_t> parameter_oids;
std::unordered_set<std::string> table_dependencies;
uint64_t hit_count = 0;
};
class StatementPlanCache final {
public:
auto get(uint64_t query_hash, const std::vector<int32_t>& parameter_oids,
uint64_t current_schema_generation) -> PlanCacheEntry*;
void put(PlanCacheEntry entry);
auto invalidate_by_table(const std::string& table_name) -> size_t;
};
Plan cache key combines:
- Query Fingerprint: Structural hash ignoring literals
- Parameter OIDs: Type-specific compilation
inline auto combine_hash_with_params(uint64_t query_fingerprint,
const std::vector<int32_t>& parameter_oids)
-> uint64_t {
constexpr uint64_t kFNV64Prime = 1099511628211ULL;
auto combined = query_fingerprint;
for (auto oid : parameter_oids) {
combined ^= static_cast<uint64_t>(static_cast<uint32_t>(oid));
combined *= kFNV64Prime;
}
return combined;
}
33.11 CVM Execution Configuration
33.11.1 Execution Modes
enum class CVMExecutionMode : uint8_t {
kDisabled = 0, // Traditional Volcano iterator
kExpressionOnly = 1, // CVM for expressions only
kFullQuery = 2, // CVM for entire query
kAuto = 3, // Heuristic selection
};
33.11.2 Configuration Options
struct CVMExecutionOptions {
CVMExecutionMode mode = CVMExecutionMode::kExpressionOnly;
bool expression_enabled = true;
bool full_query_enabled = false;
size_t cache_max_entries = 4096;
size_t cache_max_bytes = 16 * 1024 * 1024; // 16MB
int32_t min_ops_threshold = 1;
int64_t min_rows_for_cvm = 100;
size_t sort_memory_limit = 256 * 1024 * 1024; // 256MB
size_t hash_memory_limit = 256 * 1024 * 1024;
size_t agg_memory_limit = 256 * 1024 * 1024;
static constexpr size_t kDefaultBatchSize = 1024;
static constexpr int64_t kMinRowsForParallelScan = 10000;
};
33.11.3 Execution Decision
auto should_use_cvm_for_expression() const -> bool {
return expression_enabled
&& (mode == CVMExecutionMode::kExpressionOnly
|| mode == CVMExecutionMode::kFullQuery
|| mode == CVMExecutionMode::kAuto);
}
auto should_use_cvm_for_query() const -> bool {
return full_query_enabled
&& (mode == CVMExecutionMode::kFullQuery
|| mode == CVMExecutionMode::kAuto);
}
33.12 Summary
Cognica's performance engineering infrastructure provides systematic optimization across the query lifecycle:
-
Multi-Dimensional Cost Model: Balances CPU, I/O, memory, and spill costs with empirically-tuned weights
-
LSM-Tree Aware I/O Estimation: Models read amplification characteristics of RocksDB storage for accurate index selection
-
Cardinality Estimation: Propagates row count estimates through operators using statistics-based selectivity
-
Query Plan Optimization: Applies predicate pushdown, join reordering, CSE, and redundant sort elimination
-
Pipeline Optimization: Top-K hints, predicate ordering, and stage merging reduce execution cost
-
JIT Compilation Cost Model: Break-even analysis determines when compilation overhead is justified
-
IR Optimization Passes: Constant folding, dead code elimination, CSE, and strength reduction at multiple optimization levels
-
Query Caching: Sharded result cache and plan cache with dependency-based invalidation
-
CVM Configuration: Tunable execution modes, memory limits, and parallelism settings
The performance engineering discipline emphasizes measurement-driven optimization. Cost models require calibration against real workloads; heuristics need validation against production queries. The infrastructure provides the foundation—accurate statistics, principled cost modeling, and systematic optimization—but effective performance tuning requires continuous measurement and refinement.
The interplay between compile-time optimization (query planning, IR passes) and runtime optimization (JIT compilation, adaptive execution) enables Cognica to handle diverse workloads efficiently. Simple queries execute with minimal overhead; complex analytical queries benefit from sophisticated optimization. This adaptive approach reflects the fundamental challenge of database performance engineering: no single strategy is optimal for all workloads.