Chapter 15: Vectorized Execution

15.1 Introduction

Traditional query execution engines process data one row at a time using the Volcano iterator model. While this approach offers elegant composability and low memory footprint, it suffers from significant interpretation overhead—each row requires a virtual function call through the operator tree, resulting in poor instruction cache utilization and branch misprediction.

Vectorized execution addresses these limitations by processing data in batches of rows, amortizing interpretation overhead across hundreds or thousands of values. Combined with columnar data layout and SIMD (Single Instruction, Multiple Data) instructions, vectorized execution can achieve 3-10x performance improvements over row-at-a-time processing for analytical workloads.

This chapter explores Cognica's vectorized execution engine, which extends the CVM bytecode interpreter with batch-oriented opcodes, columnar data structures, and selection vectors for efficient predicate evaluation.

15.1.1 The Cost of Row-at-a-Time Processing

Consider a simple filter operation that selects rows where age > 30. In the Volcano model, each row requires:

Virtual function call to next() on the child operator
Extracting the age field from the row
Comparing against the constant 30
Conditional branch based on the result
Virtual function call to return the row (if qualifying)

For a table with $n$ rows, this results in $O(n)$ virtual calls and $O(n)$ conditional branches. Modern CPUs with deep pipelines suffer significant penalties from branch mispredictions, and the virtual call overhead prevents effective instruction caching.

The interpretation overhead ratio can be quantified as:

\text{Overhead} = \frac{T_{\text{interpret}}}{T_{\text{compute}}} = \frac{n \cdot (C_{\text{dispatch}} + C_{\text{branch}})}{n \cdot C_{\text{op}}}

where $C_{\text{dispatch}}$ is the cost of opcode dispatch, $C_{\text{branch}}$ is the branch misprediction penalty, and $C_{\text{op}}$ is the actual operation cost. For simple operations like integer comparison, $C_{\text{op}}$ is small (1-2 cycles), while $C_{\text{dispatch}} + C_{\text{branch}}$ can exceed 20 cycles, yielding overhead ratios of 10x or more.

15.1.2 Vectorized Execution Model

Vectorized execution reduces interpretation overhead by processing batches of rows with a single opcode dispatch:

\text{Overhead}_{\text{vectorized}} = \frac{C_{\text{dispatch}}}{B \cdot C_{\text{op}}}

where $B$ is the batch size. With $B = 1024$ , the overhead becomes negligible—the dispatch cost is amortized across 1024 operations.

The key principles of vectorized execution are:

Batch Processing: Operators consume and produce batches of rows rather than individual tuples
Columnar Layout: Data within batches is organized by column, enabling SIMD parallelism
Selection Vectors: Filtering produces index lists rather than materializing data
Late Materialization: Values are extracted only when needed for output

Loading diagram...

15.2 Columnar Data Representation

15.2.1 The ColumnData Structure

Cognica represents columnar data using the ColumnData class, which provides type-safe, aligned storage for a single column:

class ColumnData {
  CVMType type_;            // Column data type
  int64_t* data_ptr_;       // 64-byte aligned storage
  NullBitmap null_bitmap_;  // One bit per value
  size_t capacity_;         // Pre-allocated slots
  size_t size_;             // Actual value count
};

The design reflects several important considerations:

Type Safety: Each column has a fixed type established at creation. Type checking occurs once during column setup rather than per-value access, eliminating runtime type dispatch overhead.

Memory Alignment: The data pointer is aligned to 64 bytes using std::aligned_alloc(64, alloc_size). This alignment is critical for SIMD performance:

AVX-512 requires 64-byte alignment for optimal loads
AVX2 requires 32-byte alignment
Misaligned access incurs significant penalties (up to 100% slowdown)

Unified Storage: All numeric types use int64_t* as the underlying storage, with type-safe accessors:

auto as_int64() -> int64_t* { return data_ptr_; }
auto as_double() -> double* { return reinterpret_cast<double*>(data_ptr_); }
auto as_bool() -> bool* { return reinterpret_cast<bool*>(data_ptr_); }

15.2.2 The ColumnBatch Structure

Multiple columns are grouped into a ColumnBatch representing a horizontal partition of a table:

class ColumnBatch {
  std::vector<ColumnData> columns_;   // Column storage
  SelectionVector selection_;         // Active row indices
  size_t row_capacity_;               // Maximum rows
  size_t row_count_;                  // Current rows
};

Default Batch Size: Cognica uses a default batch size of 1024 rows, chosen for several reasons:

Fits comfortably in L1 cache (1024 values * 8 bytes = 8KB per column)
Multiple of 64 for alignment with null bitmap words
Balances memory efficiency with processing throughput

Batch Size Selection: The optimal batch size depends on the workload:

B_{\text{opt}} = \min\left(\frac{L1_{\text{size}}}{C \cdot \text{sizeof}(\text{value})}, B_{\text{max}}\right)

where $C$ is the number of columns accessed simultaneously. For a 32KB L1 cache processing 4 columns of 8-byte values, $B_{\text{opt}} = 32768 / (4 \times 8) = 1024$ .

15.2.3 Null Handling with Bitmaps

SQL's three-valued logic requires tracking NULL values separately from data values. Cognica uses a compact bitmap representation:

class NullBitmap {
  std::vector<uint64_t> words_;  // 64 rows per word

  auto is_valid(size_t idx) const -> bool {
    return (words_[idx / 64] >> (idx % 64)) & 1;
  }

  void set_null(size_t idx) {
    words_[idx / 64] &= ~(1ULL << (idx % 64));
  }

  void set_valid(size_t idx) {
    words_[idx / 64] |= (1ULL << (idx % 64));
  }
};

Memory Efficiency: The bitmap uses only 1 bit per value, compared to 8 bits for a separate boolean column. For 1024 rows, the null bitmap requires only 128 bytes (16 words).

Bulk Operations: Combining null bitmaps for binary operations uses efficient bitwise operations:

void NullBitmap::intersect(const NullBitmap& other) {
  for (size_t i = 0; i < words_.size(); ++i) {
    words_[i] &= other.words_[i];  // Valid only if both valid
  }
}

The intersection operation processes 64 null flags per CPU instruction, yielding $64 \times$ speedup over per-value null checking.

15.2.4 String Column Handling

String columns require special treatment due to variable-length data:

class ColumnData {
  // For string columns:
  StringRef* string_refs_;                    // Lightweight references
  std::vector<std::string> string_storage_;   // Owned string data
};

StringRef Design: A StringRef is a non-owning reference consisting of a pointer and length:

struct StringRef {
  const char* data;
  size_t length;
};

This design enables zero-copy string comparisons and projections while maintaining ownership in the string_storage_ vector.

15.3 SIMD Acceleration

15.3.1 Portable SIMD with xsimd

Cognica uses the xsimd library to provide portable SIMD abstractions across CPU architectures:

#include <xsimd/xsimd.hpp>

using arch_type = xsimd::default_arch;
using batch_int64 = xsimd::batch<int64_t, arch_type>;
using batch_float64 = xsimd::batch<double, arch_type>;

At compile time, xsimd automatically selects the best available instruction set:

Architecture	Instruction Set	Elements per Register
x86-64 (modern)	AVX-512	8 int64/double
x86-64 (common)	AVX2	4 int64/double
x86-64 (legacy)	SSE4.2	2 int64/double
ARM64	NEON	2 int64/double

15.3.2 SIMD Arithmetic Operations

Vectorized arithmetic processes multiple values per instruction:

void batch_add_int64(int64_t* dst, const int64_t* src1,
                     const int64_t* src2, size_t count) {
  constexpr size_t simd_width = batch_int64::size;  // 4 for AVX2

  // Process SIMD-width elements at a time
  size_t i = 0;
  for (; i + simd_width <= count; i += simd_width) {
    auto v1 = batch_int64::load_aligned(&src1[i]);
    auto v2 = batch_int64::load_aligned(&src2[i]);
    auto result = v1 + v2;
    result.store_aligned(&dst[i]);
  }

  // Scalar remainder
  for (; i < count; ++i) {
    dst[i] = src1[i] + src2[i];
  }
}

Performance Analysis: For AVX2 with 4-wide SIMD:

Scalar loop: 1024 iterations, 1024 add instructions
SIMD loop: 256 iterations, 256 vpaddd instructions
Theoretical speedup: $4 \times$ (limited by memory bandwidth in practice)

15.3.3 SIMD with Null Propagation

Real-world operations must handle NULL values correctly:

void batch_add_with_nulls(int64_t* dst, const int64_t* src1,
                          const int64_t* src2, size_t count,
                          const NullBitmap& null1, const NullBitmap& null2,
                          NullBitmap& null_out) {
  constexpr size_t simd_width = batch_int64::size;

  // Process 64 rows at a time (one null bitmap word)
  for (size_t word = 0; word < count / 64; ++word) {
    // Combine null bitmaps: result valid only if both inputs valid
    uint64_t valid_mask = null1.get_word(word) & null2.get_word(word);
    null_out.set_word(word, valid_mask);

    // SIMD arithmetic within the 64-element block
    size_t base = word * 64;
    for (size_t j = 0; j + simd_width <= 64; j += simd_width) {
      auto v1 = batch_int64::load_aligned(&src1[base + j]);
      auto v2 = batch_int64::load_aligned(&src2[base + j]);
      auto result = v1 + v2;
      result.store_aligned(&dst[base + j]);
    }
  }
}

The null bitmap combination operates on 64 values per instruction, perfectly aligned with the SIMD processing granularity.

15.3.4 SIMD Comparison Operations

Comparisons in vectorized execution produce selection vectors rather than boolean columns:

void batch_cmp_gt_int64(SelectionVector& sel, const int64_t* src,
                        int64_t constant, size_t count,
                        const NullBitmap& nulls) {
  constexpr size_t simd_width = batch_int64::size;
  auto const_vec = batch_int64::broadcast(constant);

  sel.clear();

  for (size_t i = 0; i + simd_width <= count; i += simd_width) {
    auto values = batch_int64::load_aligned(&src[i]);
    auto cmp_mask = values > const_vec;  // SIMD comparison

    // Extract matching indices
    alignas(32) bool results[simd_width];
    cmp_mask.store_aligned(results);

    for (size_t j = 0; j < simd_width; ++j) {
      if (results[j] && nulls.is_valid(i + j)) {
        sel.push_back(static_cast<uint16_t>(i + j));
      }
    }
  }
}

15.4 Selection Vectors

15.4.1 Concept and Motivation

A selection vector stores the indices of rows that satisfy a predicate, enabling zero-copy filtering:

class SelectionVector {
  std::vector<uint16_t> indices_;  // Sorted row indices

  auto size() const -> size_t { return indices_.size(); }
  auto operator[](size_t i) const -> uint16_t { return indices_[i]; }
};

Why uint16_t? With a maximum batch size of 65,536 rows, 16-bit indices suffice while minimizing memory usage. For the typical 1024-row batch, a selection vector uses at most 2KB.

Zero-Copy Advantage: Traditional filtering materializes qualifying rows into a new buffer:

// Traditional: O(selectivity * n) memory copies
std::vector<Row> filtered;
for (const auto& row : input) {
  if (row.age > 30) {
    filtered.push_back(row);  // Copy!
  }
}

Selection vectors avoid copying entirely:

// Selection vector: O(selectivity * n) index stores
SelectionVector sel;
for (size_t i = 0; i < count; ++i) {
  if (ages[i] > 30) {
    sel.push_back(i);  // Just store index
  }
}

15.4.2 Selection Vector Operations

Creation from Predicate:

auto SelectionVector::from_predicate(
    const int64_t* values, size_t count,
    std::function<bool(int64_t)> predicate) -> SelectionVector {
  SelectionVector result;
  for (size_t i = 0; i < count; ++i) {
    if (predicate(values[i])) {
      result.indices_.push_back(static_cast<uint16_t>(i));
    }
  }
  return result;
}

All-Selected Initialization:

auto SelectionVector::all_selected(size_t count) -> SelectionVector {
  SelectionVector result;
  result.indices_.reserve(count);
  for (size_t i = 0; i < count; ++i) {
    result.indices_.push_back(static_cast<uint16_t>(i));
  }
  return result;
}

15.4.3 Boolean Operations on Selection Vectors

Compound predicates require combining selection vectors:

AND (Intersection): Using two-pointer merge on sorted indices:

auto selection_and(const SelectionVector& a, const SelectionVector& b)
    -> SelectionVector {
  SelectionVector result;
  size_t i = 0, j = 0;

  while (i < a.size() && j < b.size()) {
    if (a[i] < b[j]) {
      ++i;
    } else if (a[i] > b[j]) {
      ++j;
    } else {
      result.push_back(a[i]);
      ++i;
      ++j;
    }
  }
  return result;
}

Complexity: $O(|a| + |b|)$ , exploiting the sorted property.

OR (Union): Merge two sorted vectors:

auto selection_or(const SelectionVector& a, const SelectionVector& b)
    -> SelectionVector {
  SelectionVector result;
  size_t i = 0, j = 0;

  while (i < a.size() && j < b.size()) {
    if (a[i] < b[j]) {
      result.push_back(a[i++]);
    } else if (a[i] > b[j]) {
      result.push_back(b[j++]);
    } else {
      result.push_back(a[i]);
      ++i;
      ++j;
    }
  }
  while (i < a.size()) result.push_back(a[i++]);
  while (j < b.size()) result.push_back(b[j++]);

  return result;
}

NOT (Complement):

auto selection_not(const SelectionVector& sel, size_t total_count)
    -> SelectionVector {
  SelectionVector result;
  size_t j = 0;

  for (size_t i = 0; i < total_count; ++i) {
    if (j < sel.size() && sel[j] == i) {
      ++j;  // Skip this index
    } else {
      result.push_back(static_cast<uint16_t>(i));
    }
  }
  return result;
}

15.4.4 Applying Selection Vectors

When data must finally be materialized (for output or complex operations), the selection vector guides extraction:

Gather Operation: Extract selected values into a compact array:

void gather_int64(int64_t* dst, const int64_t* src,
                  const SelectionVector& sel) {
  for (size_t i = 0; i < sel.size(); ++i) {
    dst[i] = src[sel[i]];
  }
}

SIMD Gather (AVX2/AVX-512): Modern CPUs provide hardware gather instructions:

void gather_int64_simd(int64_t* dst, const int64_t* src,
                       const SelectionVector& sel) {
  // AVX2 vgatherqd / AVX-512 vpgatherdq
  // Note: Gather has limited performance benefit;
  // often sequential access is faster for small selections
}

15.5 Vectorized Opcodes

15.5.1 Extended Opcode Space

Cognica's CVM reserves the extended opcode range (prefix 0xFE) for vectorized operations. Batch opcodes occupy codes 0x20 through 0x67:

enum class ExtendedOpcode : uint8_t {
  // Batch Scan Operations (0x20-0x27)
  BATCH_SCAN_OPEN   = 0x20,
  BATCH_SCAN_NEXT   = 0x21,
  BATCH_SCAN_CLOSE  = 0x22,
  BATCH_EMIT        = 0x23,

  // Batch Integer Arithmetic (0x28-0x2F)
  BATCH_ADD_I64     = 0x28,
  BATCH_SUB_I64     = 0x29,
  BATCH_MUL_I64     = 0x2A,
  BATCH_DIV_I64     = 0x2B,
  BATCH_MOD_I64     = 0x2C,
  BATCH_NEG_I64     = 0x2D,
  BATCH_ABS_I64     = 0x2E,

  // Batch Float Arithmetic (0x30-0x37)
  BATCH_ADD_F64     = 0x30,
  BATCH_SUB_F64     = 0x31,
  BATCH_MUL_F64     = 0x32,
  BATCH_DIV_F64     = 0x33,
  BATCH_NEG_F64     = 0x34,
  BATCH_ABS_F64     = 0x35,
  BATCH_SQRT_F64    = 0x36,

  // Batch Comparisons (0x38-0x47)
  BATCH_CMP_EQ_I64  = 0x38,
  BATCH_CMP_NE_I64  = 0x39,
  BATCH_CMP_LT_I64  = 0x3A,
  BATCH_CMP_LE_I64  = 0x3B,
  BATCH_CMP_GT_I64  = 0x3C,
  BATCH_CMP_GE_I64  = 0x3D,
  BATCH_CMP_EQ_F64  = 0x40,
  BATCH_CMP_LT_F64  = 0x42,
  // ... more comparison opcodes

  // Batch Logical Operations (0x48-0x4F)
  BATCH_AND         = 0x48,
  BATCH_OR          = 0x49,
  BATCH_NOT         = 0x4A,
  BATCH_IS_NULL     = 0x4B,
  BATCH_SAVE_SEL    = 0x4C,
  BATCH_CLEAR_SEL   = 0x4D,

  // Batch Data Movement (0x50-0x57)
  BATCH_SELECT      = 0x50,
  BATCH_GATHER      = 0x51,
  BATCH_SCATTER     = 0x52,
  BATCH_PROJECT     = 0x53,
  BATCH_FILTER      = 0x57,

  // Batch Hash Operations (0x58-0x5B)
  BATCH_HASH        = 0x58,
  BATCH_HASH_COMB   = 0x59,
  BATCH_HASH_PROBE  = 0x5A,
  BATCH_HASH_BUILD  = 0x5B,

  // Parallel Operations (0x5C-0x67)
  PAR_SCAN_OPEN     = 0x5C,
  PAR_SCAN_NEXT     = 0x5D,
  PAR_PARTITION     = 0x60,
  PAR_MERGE         = 0x61,
  PAR_BARRIER       = 0x64,
};

15.5.2 Batch Scan Operations

BATCH_SCAN_OPEN: Opens a columnar scan iterator:

Encoding: [0xFE] [0x20] [cursor_slot] [collection_idx]
Semantics: Opens collection for batch scanning, stores handle in cursor_slot

BATCH_SCAN_NEXT: Fetches the next batch:

Encoding: [0xFE] [0x21] [cursor_slot] [batch_reg]
Semantics: Reads up to 1024 rows into batch register
           Sets status flag: 1 if more data, 0 if exhausted

BATCH_EMIT: Outputs a batch to the result callback:

Encoding: [0xFE] [0x23] [batch_reg]
Semantics: Sends batch to registered output handler
           Applies current selection vector before emission

15.5.3 Batch Arithmetic Semantics

Batch arithmetic operations follow element-wise semantics:

BATCH_ADD_I64 Rd, Rs1, Rs2

For i in 0..batch_size:
  if selection.contains(i):
    Rd.data[i] = Rs1.data[i] + Rs2.data[i]
    Rd.null[i] = Rs1.null[i] AND Rs2.null[i]

The operation respects both the selection vector (only computing selected rows) and null bitmaps (propagating nulls).

15.5.4 Batch Comparison Semantics

Unlike scalar comparisons that produce boolean values, batch comparisons update the selection vector:

BATCH_CMP_GT_I64 Rs1, Rs2

new_selection = {}
For i in 0..batch_size:
  if current_selection.contains(i):
    if Rs1.data[i] > Rs2.data[i] AND !Rs1.null[i] AND !Rs2.null[i]:
      new_selection.add(i)
current_selection = new_selection

This design enables predicate chaining without intermediate materialization:

BATCH_CMP_GT_I64 age_col, const_30    ; age > 30
BATCH_CMP_LT_I64 salary_col, const_100000  ; AND salary < 100000
; Selection now contains rows where both predicates hold

15.5.5 Selection Vector Management

BATCH_SAVE_SEL: Saves the current selection for later OR operations:

BATCH_SAVE_SEL batch_reg
; Copies current_selection to batch_reg.selection

BATCH_OR: Performs OR by merging saved and current selections:

BATCH_OR batch_reg
; current_selection = union(current_selection, batch_reg.selection)

Example for age > 30 OR salary < 50000:

BATCH_CMP_GT_I64 age_col, const_30     ; Evaluate first predicate
BATCH_SAVE_SEL temp_batch              ; Save selection
BATCH_CLEAR_SEL                        ; Reset to all rows
BATCH_CMP_LT_I64 salary_col, const_50k ; Evaluate second predicate
BATCH_OR temp_batch                    ; Merge selections

15.6 The Vectorized Interpreter

15.6.1 VectorizedContext

The vectorized interpreter maintains specialized state:

class VectorizedContext {
  // Batch registers (32 slots)
  std::vector<std::unique_ptr<ColumnBatch>> batch_registers_;

  // Current selection state
  SelectionVector current_selection_;

  // Cursor slot mapping (batch slot -> VM cursor slot)
  std::array<int8_t, kMaxBatchCursorSlots> cursor_slots_;

  // Output callback
  BatchEmitCallback batch_emit_callback_;

  // Hash tables for joins
  std::array<std::unique_ptr<VectorizedHashTable>,
             kMaxBatchRegisters> hash_tables_;
};

Batch Registers: 32 registers hold ColumnBatch objects, analogous to scalar VM registers but for columnar data.

Cursor Mapping: Batch cursor slots map to underlying VM cursor slots, enabling the vectorized interpreter to leverage existing cursor infrastructure.

15.6.2 Dispatch Loop

The vectorized interpreter uses computed goto dispatch, similar to the scalar interpreter:

void VectorizedInterpreter::dispatch_loop() {
  static void* dispatch_table[] = {
    [BATCH_SCAN_OPEN]  = &&op_batch_scan_open,
    [BATCH_SCAN_NEXT]  = &&op_batch_scan_next,
    [BATCH_ADD_I64]    = &&op_batch_add_i64,
    [BATCH_CMP_GT_I64] = &&op_batch_cmp_gt_i64,
    // ... 48 batch opcodes
  };

  DISPATCH_NEXT;

op_batch_add_i64:
  {
    uint8_t dst = fetch_byte();
    uint8_t src1 = fetch_byte();
    uint8_t src2 = fetch_byte();

    auto& dst_batch = batch_registers_[dst];
    auto& src1_batch = batch_registers_[src1];
    auto& src2_batch = batch_registers_[src2];

    batch_add_int64(
      dst_batch->column(0).as_int64(),
      src1_batch->column(0).as_int64(),
      src2_batch->column(0).as_int64(),
      dst_batch->row_count(),
      src1_batch->column(0).null_bitmap(),
      src2_batch->column(0).null_bitmap(),
      dst_batch->column(0).null_bitmap()
    );

    DISPATCH_NEXT;
  }

op_batch_cmp_gt_i64:
  {
    uint8_t src1 = fetch_byte();
    uint8_t src2 = fetch_byte();

    batch_cmp_gt_int64(
      current_selection_,
      batch_registers_[src1]->column(0).as_int64(),
      batch_registers_[src2]->column(0).as_int64(),
      batch_registers_[src1]->row_count(),
      batch_registers_[src1]->column(0).null_bitmap(),
      batch_registers_[src2]->column(0).null_bitmap()
    );

    DISPATCH_NEXT;
  }

  // ... more opcode handlers
}

15.6.3 Hybrid Execution

Some operations cannot be fully vectorized and require falling back to scalar processing. Cognica implements a hybrid execution model:

Loading diagram...

Batch-to-Scalar Conversion:

void emit_batch_to_scalar(const ColumnBatch& batch,
                          const SelectionVector& sel,
                          DocumentCallback callback) {
  for (size_t i = 0; i < sel.size(); ++i) {
    uint16_t row_idx = sel[i];
    Document doc = extract_row(batch, row_idx);
    callback(std::move(doc));
  }
}

Vectorization Eligibility: The compiler analyzes pipeline stages to determine vectorizability:

Stage Type	Vectorizable	Reason
Filter (simple comparison)	Yes	Direct SIMD implementation
Project (column selection)	Yes	Zero-copy column reference
Sort	Partial	Scan vectorized, sort scalar
Group/Aggregate	Partial	Hash build vectorized, aggregate scalar
Script (Lua/Python)	No	Requires row-at-a-time callbacks
Full-Text Search	No	Complex scoring logic

15.7 Vectorized Hash Operations

15.7.1 Hash Table Building

Hash joins in vectorized mode build hash tables batch-at-a-time:

void batch_hash_build(VectorizedHashTable& ht,
                      const ColumnBatch& keys,
                      const ColumnBatch& payloads,
                      const SelectionVector& sel) {
  // Compute hashes for all keys in batch
  std::vector<uint64_t> hashes(sel.size());
  batch_hash_int64(hashes.data(),
                   keys.column(0).as_int64(),
                   sel);

  // Insert into hash table
  for (size_t i = 0; i < sel.size(); ++i) {
    uint16_t row_idx = sel[i];
    uint64_t hash = hashes[i];
    ht.insert(hash, row_idx, keys, payloads);
  }
}

Hash Function: Cognica uses FNV-1a hashing for its simplicity and good distribution:

void batch_hash_int64(uint64_t* hashes, const int64_t* values,
                      const SelectionVector& sel) {
  constexpr uint64_t FNV_PRIME = 0x100000001b3;
  constexpr uint64_t FNV_OFFSET = 0xcbf29ce484222325;

  for (size_t i = 0; i < sel.size(); ++i) {
    uint64_t h = FNV_OFFSET;
    const auto* bytes = reinterpret_cast<const uint8_t*>(
      &values[sel[i]]);
    for (size_t j = 0; j < sizeof(int64_t); ++j) {
      h ^= bytes[j];
      h *= FNV_PRIME;
    }
    hashes[i] = h;
  }
}

15.7.2 Hash Table Probing

Probing processes a batch of keys against the hash table:

void batch_hash_probe(SelectionVector& matches,
                      std::vector<uint32_t>& match_positions,
                      const VectorizedHashTable& ht,
                      const ColumnBatch& probe_keys,
                      const SelectionVector& sel) {
  matches.clear();
  match_positions.clear();

  std::vector<uint64_t> hashes(sel.size());
  batch_hash_int64(hashes.data(),
                   probe_keys.column(0).as_int64(),
                   sel);

  for (size_t i = 0; i < sel.size(); ++i) {
    uint16_t row_idx = sel[i];
    uint64_t hash = hashes[i];

    auto it = ht.find(hash, probe_keys.column(0).as_int64()[row_idx]);
    if (it != ht.end()) {
      matches.push_back(row_idx);
      match_positions.push_back(it->build_row_idx);
    }
  }
}

15.8 Performance Characteristics

15.8.1 Theoretical Speedup

The vectorized execution speedup depends on several factors:

S = \frac{T_{\text{scalar}}}{T_{\text{vectorized}}} = \frac{n \cdot (C_d + C_{op})}{(n/B) \cdot C_d + n \cdot C_{op}/W}

where:

$n$ = number of rows
$B$ = batch size (1024)
$C_d$ = dispatch cost per operation
$C_{op}$ = computation cost per value
$W$ = SIMD width (4 for AVX2, 8 for AVX-512)

For large $n$ and small $C_{op}$ (simple operations):

S \approx \frac{C_d + C_{op}}{C_d/B + C_{op}/W} \approx W \cdot \frac{C_d/C_{op} + 1}{C_d/(B \cdot C_{op}) + 1/W}

With $C_d/C_{op} = 10$ (typical for interpreted execution), $B = 1024$ , $W = 4$ :

S \approx 4 \cdot \frac{10 + 1}{10/1024 + 0.25} \approx 4 \cdot \frac{11}{0.26} \approx 170

In practice, memory bandwidth limits speedup to 3-10x for most workloads.

15.8.2 Selectivity Impact

Predicate selectivity significantly affects vectorized performance:

Loading diagram...

High Selectivity (few rows pass): Selection vectors remain small, but SIMD comparison still processes all values. Speedup is limited because most computation is "wasted" on non-qualifying rows.

Low Selectivity (most rows pass): Selection vectors are large, and subsequent operations benefit from dense SIMD processing. Speedup approaches the theoretical maximum.

The selectivity-adjusted speedup is:

S_{\text{adjusted}} = S \cdot \sigma + (1 - \sigma) \cdot S_{\text{filter}}

where $\sigma$ is selectivity and $S_{\text{filter}}$ is the filter-only speedup.

15.8.3 Memory Bandwidth Considerations

For memory-bound operations, vectorization primarily improves cache utilization:

\text{Effective Bandwidth} = \frac{\text{Data Processed}}{\text{Cache Misses} \cdot \text{Miss Latency}}

Columnar layout enables sequential access patterns that maximize prefetcher effectiveness:

Access Pattern	Cache Behavior	Prefetch Benefit
Row-at-a-time	Random access	Poor
Column batch	Sequential	Excellent

With effective prefetching, memory latency is hidden and processing becomes compute-bound, where SIMD parallelism provides maximum benefit.

15.9 Implementation Example

15.9.1 Vectorized Filter Pipeline

Consider the query:

SELECT name, salary FROM employees WHERE age > 30 AND salary < 100000

Bytecode Generation:

; Initialize
BATCH_SCAN_OPEN 0, employees_collection

loop:
  ; Fetch next batch
  BATCH_SCAN_NEXT 0, BR0           ; BR0 = batch register 0
  JZ end                           ; Exit if no more batches

  ; Reset selection to all rows
  BATCH_CLEAR_SEL

  ; Extract columns
  BATCH_PROJECT BR1, BR0, age_col  ; BR1 = age column
  BATCH_PROJECT BR2, BR0, salary   ; BR2 = salary column

  ; Load constants
  BATCH_CONST_I64 BR3, 30          ; BR3 = constant 30
  BATCH_CONST_I64 BR4, 100000      ; BR4 = constant 100000

  ; Evaluate predicates
  BATCH_CMP_GT_I64 BR1, BR3        ; selection = rows where age > 30
  BATCH_CMP_LT_I64 BR2, BR4        ; selection &= rows where salary < 100000

  ; Project output columns
  BATCH_PROJECT BR5, BR0, name_col
  BATCH_PROJECT BR6, BR0, salary

  ; Emit results (applies selection automatically)
  BATCH_EMIT BR5
  BATCH_EMIT BR6

  JMP loop

end:
  BATCH_SCAN_CLOSE 0
  RET

15.9.2 Execution Trace

For a batch of 1024 rows with 300 qualifying:

BATCH_SCAN_NEXT: Loads 1024 rows into BR0 (columnar format)
BATCH_CLEAR_SEL: Selection = {0, 1, 2, ..., 1023}
BATCH_PROJECT: Zero-copy column references
BATCH_CMP_GT_I64: SIMD comparison, selection = {qualifying age rows}
BATCH_CMP_LT_I64: SIMD comparison + intersection, selection = {300 rows}
BATCH_EMIT: Outputs 300 rows using selection vector

Performance Metrics:

Dispatch overhead: 10 opcodes per batch (vs. 10,240 for row-at-a-time)
SIMD operations: 256 vector comparisons (vs. 2048 scalar)
Memory copies: 0 (selection vector avoids materialization)

15.10 Summary

Vectorized execution represents a fundamental shift in query processing, replacing the elegant but inefficient row-at-a-time model with batch-oriented, SIMD-accelerated processing. The key concepts covered in this chapter are:

Columnar Data Layout: The ColumnBatch and ColumnData structures organize data by column rather than row, enabling SIMD parallelism and improved cache utilization. Memory alignment to 64 bytes ensures optimal performance on modern CPUs.

Selection Vectors: Instead of materializing filtered rows, selection vectors track qualifying row indices. This zero-copy approach eliminates data movement and enables efficient predicate chaining.

SIMD Acceleration: Using the xsimd library, Cognica achieves portable SIMD execution across x86-64 (SSE4.2/AVX2/AVX-512) and ARM64 (NEON) platforms. Operations process 2-8 values per instruction.

Vectorized Opcodes: The CVM extended opcode space (0x20-0x67) provides 48 batch-oriented instructions covering scans, arithmetic, comparisons, logical operations, and hash operations.

Null Handling: Compact null bitmaps (1 bit per value) enable efficient three-valued logic with bulk AND/OR operations processing 64 null flags per instruction.

Hybrid Execution: When full vectorization is not possible (sorting, complex expressions), the system gracefully falls back to scalar processing while maintaining vectorized scans.

The combination of these techniques yields 3-10x performance improvements for analytical workloads, making vectorized execution essential for modern database systems processing large datasets.

References

Boncz, P., Zukowski, M., & Nes, N. (2005). MonetDB/X100: Hyper-Pipelining Query Execution. CIDR.
Polychroniou, O., Raghavan, A., & Ross, K. A. (2015). Rethinking SIMD Vectorization for In-Memory Databases. SIGMOD.
Kersten, T., Leis, V., Kemper, A., Neumann, T., Pavlo, A., & Boncz, P. (2018). Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. PVLDB.
Intel. (2023). Intel Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/
xsimd Documentation. https://xsimd.readthedocs.io/