Chapter 15: Vectorized Execution
15.1 Introduction
Traditional query execution engines process data one row at a time using the Volcano iterator model. While this approach offers elegant composability and low memory footprint, it suffers from significant interpretation overhead—each row requires a virtual function call through the operator tree, resulting in poor instruction cache utilization and branch misprediction.
Vectorized execution addresses these limitations by processing data in batches of rows, amortizing interpretation overhead across hundreds or thousands of values. Combined with columnar data layout and SIMD (Single Instruction, Multiple Data) instructions, vectorized execution can achieve 3-10x performance improvements over row-at-a-time processing for analytical workloads.
This chapter explores Cognica's vectorized execution engine, which extends the CVM bytecode interpreter with batch-oriented opcodes, columnar data structures, and selection vectors for efficient predicate evaluation.
15.1.1 The Cost of Row-at-a-Time Processing
Consider a simple filter operation that selects rows where age > 30. In the Volcano model, each row requires:
- Virtual function call to
next()on the child operator - Extracting the
agefield from the row - Comparing against the constant 30
- Conditional branch based on the result
- Virtual function call to return the row (if qualifying)
For a table with rows, this results in virtual calls and conditional branches. Modern CPUs with deep pipelines suffer significant penalties from branch mispredictions, and the virtual call overhead prevents effective instruction caching.
The interpretation overhead ratio can be quantified as:
where is the cost of opcode dispatch, is the branch misprediction penalty, and is the actual operation cost. For simple operations like integer comparison, is small (1-2 cycles), while can exceed 20 cycles, yielding overhead ratios of 10x or more.
15.1.2 Vectorized Execution Model
Vectorized execution reduces interpretation overhead by processing batches of rows with a single opcode dispatch:
where is the batch size. With , the overhead becomes negligible—the dispatch cost is amortized across 1024 operations.
The key principles of vectorized execution are:
- Batch Processing: Operators consume and produce batches of rows rather than individual tuples
- Columnar Layout: Data within batches is organized by column, enabling SIMD parallelism
- Selection Vectors: Filtering produces index lists rather than materializing data
- Late Materialization: Values are extracted only when needed for output
15.2 Columnar Data Representation
15.2.1 The ColumnData Structure
Cognica represents columnar data using the ColumnData class, which provides type-safe, aligned storage for a single column:
class ColumnData {
CVMType type_; // Column data type
int64_t* data_ptr_; // 64-byte aligned storage
NullBitmap null_bitmap_; // One bit per value
size_t capacity_; // Pre-allocated slots
size_t size_; // Actual value count
};
The design reflects several important considerations:
Type Safety: Each column has a fixed type established at creation. Type checking occurs once during column setup rather than per-value access, eliminating runtime type dispatch overhead.
Memory Alignment: The data pointer is aligned to 64 bytes using std::aligned_alloc(64, alloc_size). This alignment is critical for SIMD performance:
- AVX-512 requires 64-byte alignment for optimal loads
- AVX2 requires 32-byte alignment
- Misaligned access incurs significant penalties (up to 100% slowdown)
Unified Storage: All numeric types use int64_t* as the underlying storage, with type-safe accessors:
auto as_int64() -> int64_t* { return data_ptr_; }
auto as_double() -> double* { return reinterpret_cast<double*>(data_ptr_); }
auto as_bool() -> bool* { return reinterpret_cast<bool*>(data_ptr_); }
15.2.2 The ColumnBatch Structure
Multiple columns are grouped into a ColumnBatch representing a horizontal partition of a table:
class ColumnBatch {
std::vector<ColumnData> columns_; // Column storage
SelectionVector selection_; // Active row indices
size_t row_capacity_; // Maximum rows
size_t row_count_; // Current rows
};
Default Batch Size: Cognica uses a default batch size of 1024 rows, chosen for several reasons:
- Fits comfortably in L1 cache (1024 values * 8 bytes = 8KB per column)
- Multiple of 64 for alignment with null bitmap words
- Balances memory efficiency with processing throughput
Batch Size Selection: The optimal batch size depends on the workload:
where is the number of columns accessed simultaneously. For a 32KB L1 cache processing 4 columns of 8-byte values, .
15.2.3 Null Handling with Bitmaps
SQL's three-valued logic requires tracking NULL values separately from data values. Cognica uses a compact bitmap representation:
class NullBitmap {
std::vector<uint64_t> words_; // 64 rows per word
auto is_valid(size_t idx) const -> bool {
return (words_[idx / 64] >> (idx % 64)) & 1;
}
void set_null(size_t idx) {
words_[idx / 64] &= ~(1ULL << (idx % 64));
}
void set_valid(size_t idx) {
words_[idx / 64] |= (1ULL << (idx % 64));
}
};
Memory Efficiency: The bitmap uses only 1 bit per value, compared to 8 bits for a separate boolean column. For 1024 rows, the null bitmap requires only 128 bytes (16 words).
Bulk Operations: Combining null bitmaps for binary operations uses efficient bitwise operations:
void NullBitmap::intersect(const NullBitmap& other) {
for (size_t i = 0; i < words_.size(); ++i) {
words_[i] &= other.words_[i]; // Valid only if both valid
}
}
The intersection operation processes 64 null flags per CPU instruction, yielding speedup over per-value null checking.
15.2.4 String Column Handling
String columns require special treatment due to variable-length data:
class ColumnData {
// For string columns:
StringRef* string_refs_; // Lightweight references
std::vector<std::string> string_storage_; // Owned string data
};
StringRef Design: A StringRef is a non-owning reference consisting of a pointer and length:
struct StringRef {
const char* data;
size_t length;
};
This design enables zero-copy string comparisons and projections while maintaining ownership in the string_storage_ vector.
15.3 SIMD Acceleration
15.3.1 Portable SIMD with xsimd
Cognica uses the xsimd library to provide portable SIMD abstractions across CPU architectures:
#include <xsimd/xsimd.hpp>
using arch_type = xsimd::default_arch;
using batch_int64 = xsimd::batch<int64_t, arch_type>;
using batch_float64 = xsimd::batch<double, arch_type>;
At compile time, xsimd automatically selects the best available instruction set:
| Architecture | Instruction Set | Elements per Register |
|---|---|---|
| x86-64 (modern) | AVX-512 | 8 int64/double |
| x86-64 (common) | AVX2 | 4 int64/double |
| x86-64 (legacy) | SSE4.2 | 2 int64/double |
| ARM64 | NEON | 2 int64/double |
15.3.2 SIMD Arithmetic Operations
Vectorized arithmetic processes multiple values per instruction:
void batch_add_int64(int64_t* dst, const int64_t* src1,
const int64_t* src2, size_t count) {
constexpr size_t simd_width = batch_int64::size; // 4 for AVX2
// Process SIMD-width elements at a time
size_t i = 0;
for (; i + simd_width <= count; i += simd_width) {
auto v1 = batch_int64::load_aligned(&src1[i]);
auto v2 = batch_int64::load_aligned(&src2[i]);
auto result = v1 + v2;
result.store_aligned(&dst[i]);
}
// Scalar remainder
for (; i < count; ++i) {
dst[i] = src1[i] + src2[i];
}
}
Performance Analysis: For AVX2 with 4-wide SIMD:
- Scalar loop: 1024 iterations, 1024 add instructions
- SIMD loop: 256 iterations, 256
vpadddinstructions - Theoretical speedup: (limited by memory bandwidth in practice)
15.3.3 SIMD with Null Propagation
Real-world operations must handle NULL values correctly:
void batch_add_with_nulls(int64_t* dst, const int64_t* src1,
const int64_t* src2, size_t count,
const NullBitmap& null1, const NullBitmap& null2,
NullBitmap& null_out) {
constexpr size_t simd_width = batch_int64::size;
// Process 64 rows at a time (one null bitmap word)
for (size_t word = 0; word < count / 64; ++word) {
// Combine null bitmaps: result valid only if both inputs valid
uint64_t valid_mask = null1.get_word(word) & null2.get_word(word);
null_out.set_word(word, valid_mask);
// SIMD arithmetic within the 64-element block
size_t base = word * 64;
for (size_t j = 0; j + simd_width <= 64; j += simd_width) {
auto v1 = batch_int64::load_aligned(&src1[base + j]);
auto v2 = batch_int64::load_aligned(&src2[base + j]);
auto result = v1 + v2;
result.store_aligned(&dst[base + j]);
}
}
}
The null bitmap combination operates on 64 values per instruction, perfectly aligned with the SIMD processing granularity.
15.3.4 SIMD Comparison Operations
Comparisons in vectorized execution produce selection vectors rather than boolean columns:
void batch_cmp_gt_int64(SelectionVector& sel, const int64_t* src,
int64_t constant, size_t count,
const NullBitmap& nulls) {
constexpr size_t simd_width = batch_int64::size;
auto const_vec = batch_int64::broadcast(constant);
sel.clear();
for (size_t i = 0; i + simd_width <= count; i += simd_width) {
auto values = batch_int64::load_aligned(&src[i]);
auto cmp_mask = values > const_vec; // SIMD comparison
// Extract matching indices
alignas(32) bool results[simd_width];
cmp_mask.store_aligned(results);
for (size_t j = 0; j < simd_width; ++j) {
if (results[j] && nulls.is_valid(i + j)) {
sel.push_back(static_cast<uint16_t>(i + j));
}
}
}
}
15.4 Selection Vectors
15.4.1 Concept and Motivation
A selection vector stores the indices of rows that satisfy a predicate, enabling zero-copy filtering:
class SelectionVector {
std::vector<uint16_t> indices_; // Sorted row indices
auto size() const -> size_t { return indices_.size(); }
auto operator[](size_t i) const -> uint16_t { return indices_[i]; }
};
Why uint16_t? With a maximum batch size of 65,536 rows, 16-bit indices suffice while minimizing memory usage. For the typical 1024-row batch, a selection vector uses at most 2KB.
Zero-Copy Advantage: Traditional filtering materializes qualifying rows into a new buffer:
// Traditional: O(selectivity * n) memory copies
std::vector<Row> filtered;
for (const auto& row : input) {
if (row.age > 30) {
filtered.push_back(row); // Copy!
}
}
Selection vectors avoid copying entirely:
// Selection vector: O(selectivity * n) index stores
SelectionVector sel;
for (size_t i = 0; i < count; ++i) {
if (ages[i] > 30) {
sel.push_back(i); // Just store index
}
}
15.4.2 Selection Vector Operations
Creation from Predicate:
auto SelectionVector::from_predicate(
const int64_t* values, size_t count,
std::function<bool(int64_t)> predicate) -> SelectionVector {
SelectionVector result;
for (size_t i = 0; i < count; ++i) {
if (predicate(values[i])) {
result.indices_.push_back(static_cast<uint16_t>(i));
}
}
return result;
}
All-Selected Initialization:
auto SelectionVector::all_selected(size_t count) -> SelectionVector {
SelectionVector result;
result.indices_.reserve(count);
for (size_t i = 0; i < count; ++i) {
result.indices_.push_back(static_cast<uint16_t>(i));
}
return result;
}
15.4.3 Boolean Operations on Selection Vectors
Compound predicates require combining selection vectors:
AND (Intersection): Using two-pointer merge on sorted indices:
auto selection_and(const SelectionVector& a, const SelectionVector& b)
-> SelectionVector {
SelectionVector result;
size_t i = 0, j = 0;
while (i < a.size() && j < b.size()) {
if (a[i] < b[j]) {
++i;
} else if (a[i] > b[j]) {
++j;
} else {
result.push_back(a[i]);
++i;
++j;
}
}
return result;
}
Complexity: , exploiting the sorted property.
OR (Union): Merge two sorted vectors:
auto selection_or(const SelectionVector& a, const SelectionVector& b)
-> SelectionVector {
SelectionVector result;
size_t i = 0, j = 0;
while (i < a.size() && j < b.size()) {
if (a[i] < b[j]) {
result.push_back(a[i++]);
} else if (a[i] > b[j]) {
result.push_back(b[j++]);
} else {
result.push_back(a[i]);
++i;
++j;
}
}
while (i < a.size()) result.push_back(a[i++]);
while (j < b.size()) result.push_back(b[j++]);
return result;
}
NOT (Complement):
auto selection_not(const SelectionVector& sel, size_t total_count)
-> SelectionVector {
SelectionVector result;
size_t j = 0;
for (size_t i = 0; i < total_count; ++i) {
if (j < sel.size() && sel[j] == i) {
++j; // Skip this index
} else {
result.push_back(static_cast<uint16_t>(i));
}
}
return result;
}
15.4.4 Applying Selection Vectors
When data must finally be materialized (for output or complex operations), the selection vector guides extraction:
Gather Operation: Extract selected values into a compact array:
void gather_int64(int64_t* dst, const int64_t* src,
const SelectionVector& sel) {
for (size_t i = 0; i < sel.size(); ++i) {
dst[i] = src[sel[i]];
}
}
SIMD Gather (AVX2/AVX-512): Modern CPUs provide hardware gather instructions:
void gather_int64_simd(int64_t* dst, const int64_t* src,
const SelectionVector& sel) {
// AVX2 vgatherqd / AVX-512 vpgatherdq
// Note: Gather has limited performance benefit;
// often sequential access is faster for small selections
}
15.5 Vectorized Opcodes
15.5.1 Extended Opcode Space
Cognica's CVM reserves the extended opcode range (prefix 0xFE) for vectorized operations. Batch opcodes occupy codes 0x20 through 0x67:
enum class ExtendedOpcode : uint8_t {
// Batch Scan Operations (0x20-0x27)
BATCH_SCAN_OPEN = 0x20,
BATCH_SCAN_NEXT = 0x21,
BATCH_SCAN_CLOSE = 0x22,
BATCH_EMIT = 0x23,
// Batch Integer Arithmetic (0x28-0x2F)
BATCH_ADD_I64 = 0x28,
BATCH_SUB_I64 = 0x29,
BATCH_MUL_I64 = 0x2A,
BATCH_DIV_I64 = 0x2B,
BATCH_MOD_I64 = 0x2C,
BATCH_NEG_I64 = 0x2D,
BATCH_ABS_I64 = 0x2E,
// Batch Float Arithmetic (0x30-0x37)
BATCH_ADD_F64 = 0x30,
BATCH_SUB_F64 = 0x31,
BATCH_MUL_F64 = 0x32,
BATCH_DIV_F64 = 0x33,
BATCH_NEG_F64 = 0x34,
BATCH_ABS_F64 = 0x35,
BATCH_SQRT_F64 = 0x36,
// Batch Comparisons (0x38-0x47)
BATCH_CMP_EQ_I64 = 0x38,
BATCH_CMP_NE_I64 = 0x39,
BATCH_CMP_LT_I64 = 0x3A,
BATCH_CMP_LE_I64 = 0x3B,
BATCH_CMP_GT_I64 = 0x3C,
BATCH_CMP_GE_I64 = 0x3D,
BATCH_CMP_EQ_F64 = 0x40,
BATCH_CMP_LT_F64 = 0x42,
// ... more comparison opcodes
// Batch Logical Operations (0x48-0x4F)
BATCH_AND = 0x48,
BATCH_OR = 0x49,
BATCH_NOT = 0x4A,
BATCH_IS_NULL = 0x4B,
BATCH_SAVE_SEL = 0x4C,
BATCH_CLEAR_SEL = 0x4D,
// Batch Data Movement (0x50-0x57)
BATCH_SELECT = 0x50,
BATCH_GATHER = 0x51,
BATCH_SCATTER = 0x52,
BATCH_PROJECT = 0x53,
BATCH_FILTER = 0x57,
// Batch Hash Operations (0x58-0x5B)
BATCH_HASH = 0x58,
BATCH_HASH_COMB = 0x59,
BATCH_HASH_PROBE = 0x5A,
BATCH_HASH_BUILD = 0x5B,
// Parallel Operations (0x5C-0x67)
PAR_SCAN_OPEN = 0x5C,
PAR_SCAN_NEXT = 0x5D,
PAR_PARTITION = 0x60,
PAR_MERGE = 0x61,
PAR_BARRIER = 0x64,
};
15.5.2 Batch Scan Operations
BATCH_SCAN_OPEN: Opens a columnar scan iterator:
Encoding: [0xFE] [0x20] [cursor_slot] [collection_idx]
Semantics: Opens collection for batch scanning, stores handle in cursor_slot
BATCH_SCAN_NEXT: Fetches the next batch:
Encoding: [0xFE] [0x21] [cursor_slot] [batch_reg]
Semantics: Reads up to 1024 rows into batch register
Sets status flag: 1 if more data, 0 if exhausted
BATCH_EMIT: Outputs a batch to the result callback:
Encoding: [0xFE] [0x23] [batch_reg]
Semantics: Sends batch to registered output handler
Applies current selection vector before emission
15.5.3 Batch Arithmetic Semantics
Batch arithmetic operations follow element-wise semantics:
BATCH_ADD_I64 Rd, Rs1, Rs2
For i in 0..batch_size:
if selection.contains(i):
Rd.data[i] = Rs1.data[i] + Rs2.data[i]
Rd.null[i] = Rs1.null[i] AND Rs2.null[i]
The operation respects both the selection vector (only computing selected rows) and null bitmaps (propagating nulls).
15.5.4 Batch Comparison Semantics
Unlike scalar comparisons that produce boolean values, batch comparisons update the selection vector:
BATCH_CMP_GT_I64 Rs1, Rs2
new_selection = {}
For i in 0..batch_size:
if current_selection.contains(i):
if Rs1.data[i] > Rs2.data[i] AND !Rs1.null[i] AND !Rs2.null[i]:
new_selection.add(i)
current_selection = new_selection
This design enables predicate chaining without intermediate materialization:
BATCH_CMP_GT_I64 age_col, const_30 ; age > 30
BATCH_CMP_LT_I64 salary_col, const_100000 ; AND salary < 100000
; Selection now contains rows where both predicates hold
15.5.5 Selection Vector Management
BATCH_SAVE_SEL: Saves the current selection for later OR operations:
BATCH_SAVE_SEL batch_reg
; Copies current_selection to batch_reg.selection
BATCH_OR: Performs OR by merging saved and current selections:
BATCH_OR batch_reg
; current_selection = union(current_selection, batch_reg.selection)
Example for age > 30 OR salary < 50000:
BATCH_CMP_GT_I64 age_col, const_30 ; Evaluate first predicate
BATCH_SAVE_SEL temp_batch ; Save selection
BATCH_CLEAR_SEL ; Reset to all rows
BATCH_CMP_LT_I64 salary_col, const_50k ; Evaluate second predicate
BATCH_OR temp_batch ; Merge selections
15.6 The Vectorized Interpreter
15.6.1 VectorizedContext
The vectorized interpreter maintains specialized state:
class VectorizedContext {
// Batch registers (32 slots)
std::vector<std::unique_ptr<ColumnBatch>> batch_registers_;
// Current selection state
SelectionVector current_selection_;
// Cursor slot mapping (batch slot -> VM cursor slot)
std::array<int8_t, kMaxBatchCursorSlots> cursor_slots_;
// Output callback
BatchEmitCallback batch_emit_callback_;
// Hash tables for joins
std::array<std::unique_ptr<VectorizedHashTable>,
kMaxBatchRegisters> hash_tables_;
};
Batch Registers: 32 registers hold ColumnBatch objects, analogous to scalar VM registers but for columnar data.
Cursor Mapping: Batch cursor slots map to underlying VM cursor slots, enabling the vectorized interpreter to leverage existing cursor infrastructure.
15.6.2 Dispatch Loop
The vectorized interpreter uses computed goto dispatch, similar to the scalar interpreter:
void VectorizedInterpreter::dispatch_loop() {
static void* dispatch_table[] = {
[BATCH_SCAN_OPEN] = &&op_batch_scan_open,
[BATCH_SCAN_NEXT] = &&op_batch_scan_next,
[BATCH_ADD_I64] = &&op_batch_add_i64,
[BATCH_CMP_GT_I64] = &&op_batch_cmp_gt_i64,
// ... 48 batch opcodes
};
DISPATCH_NEXT;
op_batch_add_i64:
{
uint8_t dst = fetch_byte();
uint8_t src1 = fetch_byte();
uint8_t src2 = fetch_byte();
auto& dst_batch = batch_registers_[dst];
auto& src1_batch = batch_registers_[src1];
auto& src2_batch = batch_registers_[src2];
batch_add_int64(
dst_batch->column(0).as_int64(),
src1_batch->column(0).as_int64(),
src2_batch->column(0).as_int64(),
dst_batch->row_count(),
src1_batch->column(0).null_bitmap(),
src2_batch->column(0).null_bitmap(),
dst_batch->column(0).null_bitmap()
);
DISPATCH_NEXT;
}
op_batch_cmp_gt_i64:
{
uint8_t src1 = fetch_byte();
uint8_t src2 = fetch_byte();
batch_cmp_gt_int64(
current_selection_,
batch_registers_[src1]->column(0).as_int64(),
batch_registers_[src2]->column(0).as_int64(),
batch_registers_[src1]->row_count(),
batch_registers_[src1]->column(0).null_bitmap(),
batch_registers_[src2]->column(0).null_bitmap()
);
DISPATCH_NEXT;
}
// ... more opcode handlers
}
15.6.3 Hybrid Execution
Some operations cannot be fully vectorized and require falling back to scalar processing. Cognica implements a hybrid execution model:
Batch-to-Scalar Conversion:
void emit_batch_to_scalar(const ColumnBatch& batch,
const SelectionVector& sel,
DocumentCallback callback) {
for (size_t i = 0; i < sel.size(); ++i) {
uint16_t row_idx = sel[i];
Document doc = extract_row(batch, row_idx);
callback(std::move(doc));
}
}
Vectorization Eligibility: The compiler analyzes pipeline stages to determine vectorizability:
| Stage Type | Vectorizable | Reason |
|---|---|---|
| Filter (simple comparison) | Yes | Direct SIMD implementation |
| Project (column selection) | Yes | Zero-copy column reference |
| Sort | Partial | Scan vectorized, sort scalar |
| Group/Aggregate | Partial | Hash build vectorized, aggregate scalar |
| Script (Lua/Python) | No | Requires row-at-a-time callbacks |
| Full-Text Search | No | Complex scoring logic |
15.7 Vectorized Hash Operations
15.7.1 Hash Table Building
Hash joins in vectorized mode build hash tables batch-at-a-time:
void batch_hash_build(VectorizedHashTable& ht,
const ColumnBatch& keys,
const ColumnBatch& payloads,
const SelectionVector& sel) {
// Compute hashes for all keys in batch
std::vector<uint64_t> hashes(sel.size());
batch_hash_int64(hashes.data(),
keys.column(0).as_int64(),
sel);
// Insert into hash table
for (size_t i = 0; i < sel.size(); ++i) {
uint16_t row_idx = sel[i];
uint64_t hash = hashes[i];
ht.insert(hash, row_idx, keys, payloads);
}
}
Hash Function: Cognica uses FNV-1a hashing for its simplicity and good distribution:
void batch_hash_int64(uint64_t* hashes, const int64_t* values,
const SelectionVector& sel) {
constexpr uint64_t FNV_PRIME = 0x100000001b3;
constexpr uint64_t FNV_OFFSET = 0xcbf29ce484222325;
for (size_t i = 0; i < sel.size(); ++i) {
uint64_t h = FNV_OFFSET;
const auto* bytes = reinterpret_cast<const uint8_t*>(
&values[sel[i]]);
for (size_t j = 0; j < sizeof(int64_t); ++j) {
h ^= bytes[j];
h *= FNV_PRIME;
}
hashes[i] = h;
}
}
15.7.2 Hash Table Probing
Probing processes a batch of keys against the hash table:
void batch_hash_probe(SelectionVector& matches,
std::vector<uint32_t>& match_positions,
const VectorizedHashTable& ht,
const ColumnBatch& probe_keys,
const SelectionVector& sel) {
matches.clear();
match_positions.clear();
std::vector<uint64_t> hashes(sel.size());
batch_hash_int64(hashes.data(),
probe_keys.column(0).as_int64(),
sel);
for (size_t i = 0; i < sel.size(); ++i) {
uint16_t row_idx = sel[i];
uint64_t hash = hashes[i];
auto it = ht.find(hash, probe_keys.column(0).as_int64()[row_idx]);
if (it != ht.end()) {
matches.push_back(row_idx);
match_positions.push_back(it->build_row_idx);
}
}
}
15.8 Performance Characteristics
15.8.1 Theoretical Speedup
The vectorized execution speedup depends on several factors:
where:
- = number of rows
- = batch size (1024)
- = dispatch cost per operation
- = computation cost per value
- = SIMD width (4 for AVX2, 8 for AVX-512)
For large and small (simple operations):
With (typical for interpreted execution), , :
In practice, memory bandwidth limits speedup to 3-10x for most workloads.
15.8.2 Selectivity Impact
Predicate selectivity significantly affects vectorized performance:
High Selectivity (few rows pass): Selection vectors remain small, but SIMD comparison still processes all values. Speedup is limited because most computation is "wasted" on non-qualifying rows.
Low Selectivity (most rows pass): Selection vectors are large, and subsequent operations benefit from dense SIMD processing. Speedup approaches the theoretical maximum.
The selectivity-adjusted speedup is:
where is selectivity and is the filter-only speedup.
15.8.3 Memory Bandwidth Considerations
For memory-bound operations, vectorization primarily improves cache utilization:
Columnar layout enables sequential access patterns that maximize prefetcher effectiveness:
| Access Pattern | Cache Behavior | Prefetch Benefit |
|---|---|---|
| Row-at-a-time | Random access | Poor |
| Column batch | Sequential | Excellent |
With effective prefetching, memory latency is hidden and processing becomes compute-bound, where SIMD parallelism provides maximum benefit.
15.9 Implementation Example
15.9.1 Vectorized Filter Pipeline
Consider the query:
SELECT name, salary FROM employees WHERE age > 30 AND salary < 100000
Bytecode Generation:
; Initialize
BATCH_SCAN_OPEN 0, employees_collection
loop:
; Fetch next batch
BATCH_SCAN_NEXT 0, BR0 ; BR0 = batch register 0
JZ end ; Exit if no more batches
; Reset selection to all rows
BATCH_CLEAR_SEL
; Extract columns
BATCH_PROJECT BR1, BR0, age_col ; BR1 = age column
BATCH_PROJECT BR2, BR0, salary ; BR2 = salary column
; Load constants
BATCH_CONST_I64 BR3, 30 ; BR3 = constant 30
BATCH_CONST_I64 BR4, 100000 ; BR4 = constant 100000
; Evaluate predicates
BATCH_CMP_GT_I64 BR1, BR3 ; selection = rows where age > 30
BATCH_CMP_LT_I64 BR2, BR4 ; selection &= rows where salary < 100000
; Project output columns
BATCH_PROJECT BR5, BR0, name_col
BATCH_PROJECT BR6, BR0, salary
; Emit results (applies selection automatically)
BATCH_EMIT BR5
BATCH_EMIT BR6
JMP loop
end:
BATCH_SCAN_CLOSE 0
RET
15.9.2 Execution Trace
For a batch of 1024 rows with 300 qualifying:
- BATCH_SCAN_NEXT: Loads 1024 rows into BR0 (columnar format)
- BATCH_CLEAR_SEL: Selection = {0, 1, 2, ..., 1023}
- BATCH_PROJECT: Zero-copy column references
- BATCH_CMP_GT_I64: SIMD comparison, selection = {qualifying age rows}
- BATCH_CMP_LT_I64: SIMD comparison + intersection, selection = {300 rows}
- BATCH_EMIT: Outputs 300 rows using selection vector
Performance Metrics:
- Dispatch overhead: 10 opcodes per batch (vs. 10,240 for row-at-a-time)
- SIMD operations: 256 vector comparisons (vs. 2048 scalar)
- Memory copies: 0 (selection vector avoids materialization)
15.10 Summary
Vectorized execution represents a fundamental shift in query processing, replacing the elegant but inefficient row-at-a-time model with batch-oriented, SIMD-accelerated processing. The key concepts covered in this chapter are:
Columnar Data Layout: The ColumnBatch and ColumnData structures organize data by column rather than row, enabling SIMD parallelism and improved cache utilization. Memory alignment to 64 bytes ensures optimal performance on modern CPUs.
Selection Vectors: Instead of materializing filtered rows, selection vectors track qualifying row indices. This zero-copy approach eliminates data movement and enables efficient predicate chaining.
SIMD Acceleration: Using the xsimd library, Cognica achieves portable SIMD execution across x86-64 (SSE4.2/AVX2/AVX-512) and ARM64 (NEON) platforms. Operations process 2-8 values per instruction.
Vectorized Opcodes: The CVM extended opcode space (0x20-0x67) provides 48 batch-oriented instructions covering scans, arithmetic, comparisons, logical operations, and hash operations.
Null Handling: Compact null bitmaps (1 bit per value) enable efficient three-valued logic with bulk AND/OR operations processing 64 null flags per instruction.
Hybrid Execution: When full vectorization is not possible (sorting, complex expressions), the system gracefully falls back to scalar processing while maintaining vectorized scans.
The combination of these techniques yields 3-10x performance improvements for analytical workloads, making vectorized execution essential for modern database systems processing large datasets.
References
- Boncz, P., Zukowski, M., & Nes, N. (2005). MonetDB/X100: Hyper-Pipelining Query Execution. CIDR.
- Polychroniou, O., Raghavan, A., & Ross, K. A. (2015). Rethinking SIMD Vectorization for In-Memory Databases. SIGMOD.
- Kersten, T., Leis, V., Kemper, A., Neumann, T., Pavlo, A., & Boncz, P. (2018). Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. PVLDB.
- Intel. (2023). Intel Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/
- xsimd Documentation. https://xsimd.readthedocs.io/