Chapter 17: Zero-Copy JOIN Implementation
17.1 Introduction
JOIN operations are fundamental to relational database systems, combining rows from multiple tables based on related columns. However, naive JOIN implementations suffer from significant memory overhead—materializing intermediate results for every matching pair of rows consumes memory proportional to the join cardinality, which can be enormous for large tables.
Cognica implements a zero-copy JOIN strategy that avoids materializing intermediate results until absolutely necessary. Instead of copying row data between operators, the system uses lightweight wrapper objects that reference original documents through pointers. This approach reduces memory consumption by orders of magnitude while maintaining the familiar SQL semantics.
This chapter explores Cognica's zero-copy JOIN implementation, covering the composite row abstraction, JOIN algorithms, bytecode generation, and memory management strategies for handling datasets larger than available memory.
17.1.1 The Cost of Materialization
Consider a JOIN between two tables with and rows respectively, producing result rows. Traditional implementations materialize each joined row:
Memory Cost Analysis:
where and are the average row sizes. For a JOIN with high selectivity (many matches), this can quickly exceed available memory.
Example: Joining a 1M-row users table with a 10M-row orders table where each user has 10 orders on average:
- Result cardinality:
- Average row size: 200 bytes per side
- Memory required: GB
This memory pressure forces expensive disk spills and degrades performance significantly.
17.1.2 The Zero-Copy Insight
The key insight is that most JOIN result columns come from source tables unchanged—we simply select which columns to include in the output. Rather than copying data, we can:
- Reference the original documents through pointers
- Defer materialization until output is required
- Reuse composite row structures across iterations
The composite row stores only pointers (8 bytes each) plus metadata, reducing memory from gigabytes to kilobytes for the working set.
17.2 The CompositeRow Abstraction
17.2.1 Design Overview
The CompositeRow class represents a joined row as a collection of references to source documents:
class CompositeRow {
std::vector<AliasedDocument> slots_;
public:
void add_slot(std::string_view alias, const Document* doc);
void clear();
auto get_field(std::string_view qualified_name) const -> Value;
auto get_field_by_slot(size_t slot, std::string_view field) const -> Value;
auto materialize(const ColumnMappingList& mappings) const -> Document;
auto materialize_all() const -> Document;
auto materialize_all_qualified() const -> Document;
};
AliasedDocument: A lightweight wrapper pairing a table alias with a document pointer:
struct AliasedDocument {
std::string_view alias; // Table alias (e.g., "u", "o")
const Document* doc; // Pointer to source document
// Total size: 24 bytes (16 for string_view + 8 for pointer)
};
17.2.2 Memory Layout
For a two-table JOIN, the CompositeRow contains just two slots:
CompositeRow for: SELECT u.id, u.name, o.amount FROM users u JOIN orders o ON u.id = o.user_id
Total overhead: ~64 bytes vs. potentially KB for materialized row
17.2.3 Field Access
Field access in CompositeRow supports two modes:
Qualified Name Lookup: Search through slots for matching alias:
auto CompositeRow::get_field(std::string_view qualified_name) const -> Value {
// Parse "alias.field" format
auto dot_pos = qualified_name.find('.');
auto alias = qualified_name.substr(0, dot_pos);
auto field = qualified_name.substr(dot_pos + 1);
// Search slots for matching alias
for (const auto& slot : slots_) {
if (slot.alias == alias) {
return slot.doc->get(field);
}
}
return Value::null();
}
Complexity: where is the number of slots (typically 2-5).
Direct Slot Access: When the slot index is known at compile time:
auto CompositeRow::get_field_by_slot(size_t slot_idx,
std::string_view field) const -> Value {
return slots_[slot_idx].doc->get(field);
}
Complexity: —preferred when the planner can resolve slot indices.
17.2.4 Deferred Materialization
Materialization creates a concrete Document from the composite references:
auto CompositeRow::materialize(const ColumnMappingList& mappings) const
-> Document {
Document result;
for (const auto& mapping : mappings.mappings) {
// Get value from source slot
const auto& slot = slots_[mapping.source_slot];
Value value = slot.doc->get(mapping.source_field);
// Add to result with mapped name
result.set(mapping.output_name, std::move(value));
}
return result;
}
Materialization is triggered only when:
- Output must be emitted to the client
- Data must persist beyond cursor lifetime
- Working tables require owned copies
- Column reordering or renaming is needed
17.3 Column Mapping
17.3.1 The ColumnMappingList Structure
Column mappings define how source fields map to output columns:
struct ColumnMapping {
uint8_t source_slot; // Index into CompositeRow slots
std::string source_field; // Field name in source document
std::string output_name; // Output column name
std::optional<uint16_t> field_index; // Optimization: cached field index
};
struct ColumnMappingList {
std::vector<ColumnMapping> mappings;
};
17.3.2 Mapping Example
For the query:
SELECT u.id AS user_id, u.name, o.amount, o.created_at
FROM users u
JOIN orders o ON u.id = o.user_id
The ColumnMappingList contains:
| Index | source_slot | source_field | output_name |
|---|---|---|---|
| 0 | 0 (users) | id | user_id |
| 1 | 0 (users) | name | name |
| 2 | 1 (orders) | amount | amount |
| 3 | 1 (orders) | created_at | created_at |
17.3.3 Handling Name Collisions
When both tables have columns with the same name, qualified names resolve ambiguity:
auto CompositeRow::materialize_all_qualified() const -> Document {
Document result;
for (const auto& slot : slots_) {
for (const auto& [field, value] : *slot.doc) {
// Prefix field with alias: "u.id", "o.id"
std::string qualified_name =
std::string(slot.alias) + "." + field;
result.set(qualified_name, value);
}
}
return result;
}
17.4 JOIN Algorithms
17.4.1 Hash Join
Hash join is the preferred algorithm for equi-joins on large tables. It operates in two phases:
Build Phase: Scan the build side (typically smaller table), hash the join key, and store document pointers in a hash table.
Probe Phase: Scan the probe side (typically larger table), compute the hash, look up matching documents, and emit composite rows.
Hash Table Structure:
struct HashTable {
std::unordered_multimap<int64_t, Document*> entries;
bool active = false;
};
The multimap supports multiple documents with the same key hash, handling duplicate join keys correctly.
Build Phase Implementation:
void hash_table_insert(VMContext& ctx, uint8_t slot,
const VMValue& key, Document* doc) {
int64_t hash = compute_hash(key);
ctx.hash_tables_[slot].entries.insert({hash, doc});
}
Probe Phase Implementation:
auto hash_table_probe(VMContext& ctx, uint8_t slot,
const VMValue& key) -> Document* {
int64_t hash = compute_hash(key);
auto it = ctx.hash_tables_[slot].entries.find(hash);
if (it != ctx.hash_tables_[slot].entries.end()) {
return it->second; // Return pointer to matching document
}
return nullptr;
}
Complexity Analysis:
| Phase | Time | Space |
|---|---|---|
| Build | pointers | |
| Probe | average | working set |
| Total |
where is the build side cardinality and is the probe side cardinality.
17.4.2 Nested Loop Join
Nested loop join is used for small tables or when index-based lookups are available:
// Outer loop
for (const auto& outer_doc : outer_cursor) {
// Inner loop - reset for each outer row
inner_cursor.reset();
for (const auto& inner_doc : inner_cursor) {
if (evaluate_join_condition(outer_doc, inner_doc)) {
CompositeRow composite;
composite.add_slot(outer_alias, &outer_doc);
composite.add_slot(inner_alias, &inner_doc);
emit(composite);
}
}
}
Complexity: where is outer cardinality and is inner cardinality.
When to Use:
- Small inner table (fits in cache)
- Index available on inner table
- Complex join conditions that cannot use hash lookup
- Cross joins (no join condition)
17.4.3 Build Side Selection
The optimizer selects the build side based on cardinality estimates:
The smaller relation becomes the build side because:
- Hash table memory is proportional to build side size
- Smaller hash table has better cache locality
- Probe phase processes more data in streaming fashion
auto select_build_side(const JoinNode& join) -> size_t {
auto left_card = estimate_cardinality(join.left);
auto right_card = estimate_cardinality(join.right);
return (left_card <= right_card) ? 0 : 1; // 0 = left, 1 = right
}
17.5 CVM JOIN Opcodes
17.5.1 Composite Row Operations
The CVM provides dedicated opcodes for composite row manipulation:
| Opcode | Code | Description |
|---|---|---|
| COMPOSITE_NEW | 0x13 | Create empty CompositeRow |
| COMPOSITE_ADD_SLOT | 0x14 | Add document reference with alias |
| COMPOSITE_GET_FIELD | 0x15 | Get field by qualified name |
| COMPOSITE_GET_SLOT | 0x16 | Get field by slot index + field |
| COMPOSITE_MATERIALIZE | 0x17 | Create Document with mappings |
| COMPOSITE_EMIT | 0x18 | Emit composite row to output |
| COMPOSITE_CLEAR | 0x19 | Clear slots for reuse |
| COMPOSITE_EMIT_MAPPED | 0x1A | Emit with column mapping |
| COMPOSITE_MATERIALIZE_QUALIFIED | 0x1C | Materialize with alias prefixes |
17.5.2 Hash Table Operations
| Opcode | Code | Description |
|---|---|---|
| HT_NEW | 0x10 | Create new hash table in slot |
| HT_INSERT | 0x11 | Insert key-document pair |
| HT_PROBE | 0x12 | Probe for matching document |
| HT_DESTROY | 0x1F | Destroy hash table |
17.5.3 Instruction Encoding
Composite and hash table operations use Format H (extended format):
Word 1: [0xFE:8][ExtOpcode:8][Dst:8][Src:8]
Word 2: [Operand1:16][Operand2:16]
Total: 64 bits
The 16-bit operand fields support:
- Up to 65,536 constant pool entries
- Up to 65,536 column mapping indices
- Up to 256 slots per composite row
17.5.4 Example: Hash Join Bytecode
For the query:
SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.id = o.user_id
Generated bytecode:
; Phase 1: BUILD (users table)
HT_NEW R10 ; Create hash table in slot 0
CURSOR_OPEN R1, users_cursor ; Open users cursor
build_loop:
CURSOR_NEXT R2, R1 ; R2 = next users document
JUMP_NULL R2, build_done ; Exit if no more rows
GET_FIELD R3, R2, "id" ; R3 = u.id (join key)
HT_INSERT R10, R3, R2 ; hash_table[hash(R3)] = R2
JUMP build_loop
build_done:
; Keep users cursor open (documents referenced by hash table)
; Phase 2: PROBE (orders table)
CURSOR_OPEN R4, orders_cursor ; Open orders cursor
COMPOSITE_NEW R5 ; Create reusable composite row
probe_loop:
CURSOR_NEXT R6, R4 ; R6 = next orders document
JUMP_NULL R6, probe_done ; Exit if no more rows
GET_FIELD R7, R6, "user_id" ; R7 = o.user_id (join key)
HT_PROBE R8, R10, R7 ; R8 = matching users doc or null
JUMP_NULL R8, probe_loop ; Skip if no match
; Build composite row
COMPOSITE_CLEAR R5 ; Clear for reuse
COMPOSITE_ADD R5, R8, "u" ; Add users doc with alias "u"
COMPOSITE_ADD R5, R6, "o" ; Add orders doc with alias "o"
; Emit result
COMPOSITE_EMIT_MAPPED R5, mapping_0 ; Emit with column mappings
JUMP probe_loop
probe_done:
CURSOR_CLOSE R4 ; Close orders cursor
CURSOR_CLOSE R1 ; Close users cursor
HT_DESTROY R10 ; Free hash table
HALT
17.6 Outer JOIN Support
17.6.1 LEFT OUTER JOIN
Left outer joins must emit unmatched left-side rows with NULL right-side values:
// Track whether current outer row matched any inner row
bool matched = false;
for (const auto& inner_doc : inner_cursor) {
if (evaluate_condition(outer_doc, inner_doc)) {
emit_composite(outer_doc, inner_doc);
matched = true;
}
}
// Emit unmatched outer row with null inner
if (!matched) {
emit_composite(outer_doc, null_document);
}
Bytecode Pattern:
probe_loop:
CURSOR_NEXT R6, R4
JUMP_NULL R6, probe_done
MOV_FALSE R9 ; matched = false
GET_FIELD R7, R6, "user_id"
HT_PROBE R8, R10, R7
JUMP_NULL R8, check_unmatched
; Match found
COMPOSITE_CLEAR R5
COMPOSITE_ADD R5, R8, "u"
COMPOSITE_ADD R5, R6, "o"
COMPOSITE_EMIT R5
MOV_TRUE R9 ; matched = true
JUMP probe_loop
check_unmatched:
JUMP_TRUE R9, probe_loop ; Skip if matched
; Emit with NULL right side
DOC_NEW R11 ; Create empty document
COMPOSITE_CLEAR R5
COMPOSITE_ADD R5, R11, "u" ; NULL users
COMPOSITE_ADD R5, R6, "o" ; orders doc
COMPOSITE_EMIT R5
JUMP probe_loop
17.6.2 RIGHT OUTER JOIN
Right outer joins require tracking which build-side rows were matched:
17.6.3 FULL OUTER JOIN
Full outer joins combine left and right outer join logic:
- Build hash table from right side with match tracking
- Probe with left side, mark matched right rows
- Emit matches and unmatched left rows
- Emit unmatched right rows
17.7 Memory Management
17.7.1 VMContext Resources
The VMContext manages JOIN-related resources:
class VMContext {
// Hash tables for hash joins (up to 4 concurrent)
static constexpr size_t kMaxHashTables = 4;
std::array<HashTable, kMaxHashTables> hash_tables_;
// Owned composite rows
std::vector<std::unique_ptr<CompositeRow>> owned_composite_rows_;
// Working tables for CTEs and subqueries
std::vector<std::unique_ptr<WorkingTable>> working_tables_;
public:
auto create_hash_table(uint8_t slot) -> HashTable*;
void destroy_hash_table(uint8_t slot);
auto create_composite_row() -> CompositeRow*;
void reset(); // Clears all owned resources
};
17.7.2 Document Lifetime
Zero-copy JOINs depend on source documents remaining valid throughout execution:
Lifetime Guarantees:
- Cursor documents: Valid until cursor advances or closes
- Hash table entries: Build cursor remains open during probe phase
- Composite row references: Valid as long as source cursors are open
Invalidation Prevention:
- Build cursor is NOT closed until probe phase completes
- Composite rows are cleared (not destroyed) between iterations
- Working tables own document copies when persistence is needed
17.7.3 CompositeRow Reuse
The COMPOSITE_CLEAR operation enables efficient memory reuse:
void CompositeRow::clear() {
slots_.clear(); // O(1) - vector size reset to 0
// Capacity retained - no reallocation
}
Memory Pattern:
Without reuse (bad):
Iteration 1: allocate CompositeRow
Iteration 2: allocate CompositeRow
...
Iteration N: allocate CompositeRow
Memory: O(N) allocations
With reuse (good):
Once: allocate CompositeRow
Iteration 1: clear + add_slot
Iteration 2: clear + add_slot
...
Iteration N: clear + add_slot
Memory: O(1) allocation
17.8 External Hash Join
17.8.1 Memory Overflow Handling
When the build side exceeds available memory, Cognica falls back to external (Grace) hash join:
class ExternalHashJoiner {
struct Config {
uint64_t memory_limit = 256 * 1024 * 1024; // 256MB
int32_t num_partitions = 64;
int32_t max_recursion_depth = 4;
CompressionType compression = CompressionType::ZSTD;
};
// In-memory hash table (used when data fits)
std::unordered_map<std::string, std::vector<Document>> hash_table_;
// Partition files (used when memory exceeded)
std::vector<std::unique_ptr<HashPartition>> build_partitions_;
std::vector<std::unique_ptr<HashPartition>> probe_partitions_;
};
17.8.2 Partitioning Strategy
Grace hash join partitions both relations by hash value:
where is the number of partitions.
Key Property: Matching rows are guaranteed to be in the same partition number, enabling independent processing of each partition pair.
17.8.3 Recursive Partitioning
If a partition still exceeds memory, recursive partitioning splits it further:
void process_partition(size_t partition_idx) {
auto& build_part = build_partitions_[partition_idx];
auto& probe_part = probe_partitions_[partition_idx];
size_t build_size = build_part->size_bytes();
if (build_size <= config_.memory_limit) {
// Process in memory
process_in_memory(build_part, probe_part);
} else if (recursion_depth_ < config_.max_recursion_depth) {
// Recursively partition
auto sub_partitions = repartition(build_part, probe_part);
for (size_t i = 0; i < sub_partitions.size(); ++i) {
++recursion_depth_;
process_partition(i);
--recursion_depth_;
}
} else {
// Fall back to nested loop for this partition
nested_loop_join(build_part, probe_part);
}
}
17.8.4 Spill File Management
Partitions are stored in temporary files with optional compression:
class HashPartition {
std::string temp_file_path_;
std::unique_ptr<FileWriter> writer_;
CompressionType compression_;
public:
void add_document(Document&& doc) {
auto serialized = serialize(doc, compression_);
writer_->write(serialized);
}
auto read_all() -> std::vector<Document> {
std::vector<Document> result;
FileReader reader(temp_file_path_, compression_);
while (auto doc = reader.read_next()) {
result.push_back(std::move(*doc));
}
return result;
}
};
Compression Benefits:
- ZSTD compression typically achieves 3-5x compression ratio
- Reduces I/O time for large partitions
- Trades CPU for I/O bandwidth (usually worthwhile)
17.9 Working Tables
17.9.1 Purpose and Use Cases
Working tables provide temporary storage for:
- Recursive CTEs: Accumulating results across iterations
- Multi-pass algorithms: Storing intermediate results
- Subquery materialization: Caching correlated subquery results
class WorkingTable {
std::vector<Document> rows_; // Owned document copies
size_t scan_index_ = 0; // Current scan position
public:
void add_row(Document&& doc); // Move semantics
void swap_with(WorkingTable& other);
void open_scan();
auto scan_next() -> Document*;
void close_scan();
};
17.9.2 Recursive CTE Execution
Recursive CTEs use working tables to iterate until fixpoint:
WITH RECURSIVE employee_hierarchy AS (
-- Base case
SELECT id, name, manager_id, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case
SELECT e.id, e.name, e.manager_id, h.level + 1
FROM employees e
JOIN employee_hierarchy h ON e.manager_id = h.id
)
SELECT * FROM employee_hierarchy;
Execution Pattern:
working_table = execute(base_case)
result_table = copy(working_table)
while (!working_table.empty()):
temp_table = empty
for row in working_table:
for match in execute_recursive(row):
temp_table.add(match)
result_table.add(match)
working_table = temp_table
return result_table
17.9.3 Working Table Bytecode
; Initialize
WORKING_TABLE_NEW R1 ; Create working table
WORKING_TABLE_NEW R2 ; Create result table
; Execute base case, populate both tables
; ... base case bytecode ...
WORKING_TABLE_ADD R1, R3 ; Add to working
WORKING_TABLE_ADD R2, R3 ; Add to result
iteration_loop:
WORKING_TABLE_SWAP R1, R2 ; Swap tables
WORKING_TABLE_CLEAR R2 ; Clear result for new iteration
WORKING_TABLE_SCAN_OPEN R1 ; Open scan on working
scan_loop:
WORKING_TABLE_SCAN_NEXT R4, R1 ; Get next row
JUMP_NULL R4, iteration_done
; Execute recursive case with R4 as input
; ... recursive bytecode ...
WORKING_TABLE_ADD R1, R5 ; Add to working (next iteration)
WORKING_TABLE_ADD R2, R5 ; Add to result (accumulate)
JUMP scan_loop
iteration_done:
WORKING_TABLE_SCAN_CLOSE R1
WORKING_TABLE_EMPTY R6, R1 ; Check if working empty
JUMP_FALSE R6, iteration_loop
; R2 now contains all results
EMIT_WORKING_TABLE R2
HALT
17.10 Multi-Way JOIN Optimization
17.10.1 Join Order Enumeration
For multi-way JOINs, the optimizer enumerates possible join orderings:
SELECT *
FROM A
JOIN B ON A.x = B.x
JOIN C ON B.y = C.y
JOIN D ON C.z = D.z
Possible orderings include:
- ... and many more
The number of orderings for tables is the Catalan number:
For 4 tables: orderings.
17.10.2 Cost-Based Selection
The optimizer selects the ordering with minimum estimated cost:
For hash join specifically:
where due to hash table construction overhead.
17.10.3 Intermediate Result Handling
Multi-way JOINs may require intermediate materialization:
The planner decides when to materialize based on:
- Memory pressure
- Reuse opportunities
- Pipeline boundaries
17.11 Performance Characteristics
17.11.1 Memory Efficiency
| Approach | Memory per Match | 10M Matches |
|---|---|---|
| Full Materialization | 400 bytes | 4 GB |
| Zero-Copy (CompositeRow) | 48 bytes | 480 MB |
| With Reuse | 48 bytes total | 48 bytes |
17.11.2 Execution Time
| Operation | Traditional | Zero-Copy |
|---|---|---|
| Build composite | memcpy (100+ ns) | pointer assign (1 ns) |
| Field access | direct (1 ns) | indirect (3-5 ns) |
| Materialization | N/A | on-demand (100+ ns) |
Trade-off: Zero-copy adds 2-4 ns per field access but saves 100+ ns per row construction. For JOINs outputting many columns, field access overhead is negligible compared to avoided copies.
17.11.3 When Zero-Copy Excels
- High cardinality JOINs: Millions of matching pairs
- Wide rows: Many columns per table
- Selective output: Few columns needed from many available
- Pipeline processing: Results consumed immediately
17.11.4 When Materialization Is Preferred
- Repeated access: Same row accessed multiple times
- Complex expressions: Derived columns requiring multiple source fields
- Persistence required: Results stored in working tables
- Very narrow rows: Pointer overhead exceeds data size
17.12 Summary
Zero-copy JOIN implementation is a critical optimization that enables Cognica to process large JOIN results without excessive memory consumption. The key concepts covered in this chapter are:
CompositeRow Abstraction: Lightweight wrapper storing pointers to source documents with table aliases. Field access resolves references on-demand, avoiding data copying until materialization is required.
Deferred Materialization: Documents are created only when output is emitted, working tables require owned copies, or column reordering is needed. This lazy approach minimizes memory allocations in the critical path.
Hash Join Implementation: Build phase inserts document pointers (not copies) into hash table. Probe phase looks up matching pointers and creates composite rows. The build cursor remains open throughout to keep referenced documents valid.
CVM Opcodes: Dedicated instructions for composite row operations (COMPOSITE_NEW, COMPOSITE_ADD, COMPOSITE_CLEAR, COMPOSITE_EMIT) and hash table operations (HT_NEW, HT_INSERT, HT_PROBE) enable efficient bytecode generation.
Memory Reuse: The COMPOSITE_CLEAR operation resets composite rows for reuse, reducing allocations from O(N) to O(1) per JOIN execution.
External Hash Join: Grace hash join with partition spilling handles datasets exceeding available memory. Recursive partitioning and compression minimize I/O overhead.
Working Tables: Provide owned document storage for recursive CTEs, multi-pass algorithms, and subquery materialization where zero-copy semantics cannot be maintained.
The combination of zero-copy references, deferred materialization, and memory reuse enables Cognica to achieve 10-100x memory efficiency gains for large multi-way JOINs while maintaining familiar SQL semantics.
References
- Graefe, G. (1993). Query Evaluation Techniques for Large Databases. ACM Computing Surveys.
- Shapiro, L. D. (1986). Join Processing in Database Systems with Large Main Memories. ACM TODS.
- Kim, C., Kaldewey, T., Lee, V. W., et al. (2009). Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. PVLDB.
- Barber, R., Lohman, G., Mohan, C., et al. (2014). In-Memory BLU Acceleration in IBM DB2 and dashDB. CIDR.
- Leis, V., Gubichev, A., Mirber, A., et al. (2015). How Good Are Query Optimizers, Really? PVLDB.