Chapter 16: Copy-and-Patch JIT Compilation
16.1 Introduction
While the CVM bytecode interpreter provides excellent portability and debugging capabilities, interpretation overhead limits performance for compute-intensive workloads. Traditional Just-In-Time (JIT) compilation using frameworks like LLVM can generate highly optimized native code but introduces significant compilation latency—often hundreds of milliseconds for complex queries—making it unsuitable for short-running database queries.
Copy-and-patch compilation offers a compelling middle ground: compilation speeds measured in microseconds while achieving code quality within 10-20% of heavily optimized compilers. This technique, pioneered by Haoran Xu and Fredrik Kjolstad at Stanford, pre-compiles parameterized code templates called stencils and generates native code by copying stencils and patching in runtime values.
This chapter explores Cognica's copy-and-patch JIT compiler, which provides 2-5x speedup over interpretation for hot code paths while maintaining sub-millisecond compilation times.
16.1.1 The JIT Compilation Spectrum
JIT compilation strategies span a wide spectrum of compilation time versus code quality:
| Approach | Compilation Time | Code Quality | Use Case |
|---|---|---|---|
| Interpretation | 0 | Baseline | All queries |
| Copy-and-Patch | ~1ms/KB | 80-90% optimal | Hot loops |
| Method JIT | ~10ms/KB | 90-95% optimal | Warm methods |
| Tracing JIT | ~50ms/KB | 95%+ optimal | Hot traces |
| LLVM JIT | ~100ms+/KB | Optimal | Long-running |
For database workloads where queries often execute in milliseconds, the compilation overhead of traditional JIT compilers can exceed query execution time. Copy-and-patch compilation breaks this barrier by achieving native code generation in microseconds.
16.1.2 The Copy-and-Patch Insight
The fundamental insight behind copy-and-patch compilation is that most bytecode instructions map to a small set of machine code patterns. Consider the CVM ADD_I64 instruction:
ADD_I64 R3, R1, R2 ; R3 = R1 + R2
The corresponding x86-64 machine code follows a predictable pattern:
mov rax, [rdi + R1_OFFSET] ; Load R1 from VMContext
add rax, [rdi + R2_OFFSET] ; Add R2
mov [rdi + R3_OFFSET], rax ; Store to R3
The only values that change between instances are the register offsets (R1_OFFSET, R2_OFFSET, R3_OFFSET). Rather than generating this code from scratch, copy-and-patch:
- Pre-compiles the pattern with placeholder offsets
- At JIT time, copies the template
- Patches the placeholders with actual offsets
This approach eliminates instruction selection, register allocation, and code emission from the critical path, reducing compilation to simple memory copies and integer arithmetic.
16.2 Tiered Compilation Architecture
16.2.1 Compilation Tiers
Cognica implements a three-tier compilation strategy:
Tier 0 - Interpreter: All code starts in the interpreter, which collects execution profiles. The interpreter uses computed goto dispatch for competitive baseline performance (Chapter 13).
Tier 1 - Baseline JIT: When execution count exceeds a threshold (default: 100 invocations or 1000 loop iterations), the baseline JIT compiles the function using direct bytecode-to-stencil translation. Compilation takes approximately 1ms per KB of bytecode.
Tier 2 - Optimized JIT: For very hot code (10x baseline threshold), the optimizing JIT applies additional transformations: constant propagation, dead code elimination, type specialization, and loop-invariant code motion. Compilation takes approximately 10ms per KB.
16.2.2 Tier Transition Thresholds
The transition thresholds balance compilation overhead against execution benefit:
where:
- = interpretation time per invocation
- = JIT execution time per invocation
- = expected remaining invocations
- = one-time compilation cost
Setting and solving for :
With typical values (, , ):
The default threshold of 100 is slightly aggressive, assuming that functions called 100 times are likely to be called many more times.
16.2.3 Profiling Infrastructure
The interpreter collects profiles to guide tier transitions:
struct ExecutionProfile {
uint32_t invocation_count; // Function entry count
uint32_t backward_branch_count; // Loop iteration proxy
TypeFeedback type_feedback; // Observed types per instruction
BranchHistory branch_history; // Taken/not-taken statistics
};
Invocation Counting: Each function entry increments a counter. When the counter exceeds the tier-1 threshold, compilation is triggered.
Loop Detection: Backward branches indicate loops. The interpreter counts backward branch executions to identify hot loops that warrant compilation even in cold functions.
Type Feedback: Polymorphic instructions record observed types, enabling type specialization in the optimizing JIT.
16.3 Stencil Architecture
16.3.1 Stencil Definition
A stencil is a pre-compiled machine code template with designated patch points:
struct Stencil {
const uint8_t* code; // Pre-compiled machine code
size_t code_size; // Size in bytes
std::vector<PatchSite> patches; // Locations to patch
size_t alignment; // Required alignment (8 or 16)
const char* name; // Debug name, e.g., "add_i64"
Opcode opcode; // Corresponding CVM opcode
};
Code: The raw machine code bytes, compiled offline using a standard compiler (GCC/Clang) with placeholder values.
Patch Sites: Locations within the code that must be modified at JIT time, along with metadata describing how to compute the patch value.
Alignment: Some stencils require specific alignment for SIMD instructions or branch targets.
16.3.2 Patch Site Specification
Each patch site describes a location that needs runtime modification:
struct PatchSite {
uint32_t offset; // Byte offset in stencil code
PatchType type; // Type of patch needed
uint8_t size; // Patch size: 1, 2, 4, or 8 bytes
int8_t operand_index; // Which instruction operand (-1 for special)
};
enum class PatchType : uint8_t {
kRegisterOffset, // Offset into VMContext::registers_
kFloatRegisterOffset, // Offset into VMContext::float_registers_
kImmediate8, // 8-bit immediate constant
kImmediate16, // 16-bit immediate constant
kImmediate32, // 32-bit immediate constant
kImmediate64, // 64-bit immediate constant
kRelativeJump, // Relative jump offset (signed)
kAbsoluteAddress, // Absolute address (for calls)
kConstantPoolPtr, // Pointer into constant pool
kContextField, // Offset of VMContext field
kCallbackPtr, // External function callback pointer
};
16.3.3 Stencil Example: ADD_I64
Consider the x86-64 stencil for integer addition:
// Stencil source (compiled offline)
void stencil_add_i64(VMContext* ctx) {
int64_t* regs = ctx->registers_;
regs[DST_PLACEHOLDER] =
regs[SRC1_PLACEHOLDER] + regs[SRC2_PLACEHOLDER];
}
After compilation with placeholders, the machine code contains:
; Offset 0x00: Load src1
mov rax, QWORD PTR [rdi + 0xDEAD0001] ; Placeholder offset
; Offset 0x07: Load src2
add rax, QWORD PTR [rdi + 0xDEAD0002] ; Placeholder offset
; Offset 0x0E: Store to dst
mov QWORD PTR [rdi + 0xDEAD0003], rax ; Placeholder offset
The stencil metadata records the patch sites:
static const Stencil kAddI64Stencil = {
.code = add_i64_code,
.code_size = 21,
.patches = {
{.offset = 3, .type = kRegisterOffset, .size = 4, .operand_index = 1},
{.offset = 10, .type = kRegisterOffset, .size = 4, .operand_index = 2},
{.offset = 17, .type = kRegisterOffset, .size = 4, .operand_index = 0},
},
.alignment = 1,
.name = "add_i64",
.opcode = Opcode::ADD_I64,
};
16.3.4 Stencil Library Organization
Cognica maintains separate stencil libraries for each target architecture:
src/cognica/cvm/jit/codegen/
├── x86_64/
│ ├── stencils.hpp # Stencil declarations
│ ├── stencils.cpp # Stencil definitions
│ └── stencil_data.inc # Raw machine code bytes
└── aarch64/
├── stencils.hpp
├── stencils.cpp
└── stencil_data.inc
Stencil Categories:
| Category | Examples | Count |
|---|---|---|
| Arithmetic | ADD_I64, MUL_F64, NEG_I64 | ~30 |
| Comparison | CMP_LT_I64, CMP_EQ_STR | ~20 |
| Control Flow | JMP, JZ, CALL, RET | ~15 |
| Memory | LOAD_CONST, STORE_LOCAL | ~20 |
| Type Ops | CAST_I64_F64, TYPE_CHECK | ~15 |
| Special | PROLOGUE, EPILOGUE, SPILL | ~10 |
Total: approximately 110 stencils per architecture.
16.4 Patch Value Computation
16.4.1 The PatchComputer Class
The PatchComputer class computes patch values from instruction operands:
class PatchComputer {
public:
static auto compute(PatchType type, const Instruction& inst,
int8_t operand_index, const JITContext& ctx)
-> uint64_t;
private:
static auto compute_register_offset(uint8_t reg) -> uint32_t;
static auto compute_relative_jump(uint32_t from, uint32_t to) -> int32_t;
static auto compute_callback_ptr(uint16_t func_id,
const JITContext& ctx) -> uint64_t;
};
16.4.2 Register Offset Computation
CVM registers are stored in the VMContext at fixed offsets. Given the VMValue structure size of 24 bytes:
auto PatchComputer::compute_register_offset(uint8_t reg) -> uint32_t {
constexpr uint32_t kRegistersOffset = offsetof(VMContext, registers_);
constexpr uint32_t kVMValueSize = sizeof(VMValue); // 24 bytes
return kRegistersOffset + reg * kVMValueSize;
}
For register R5:
16.4.3 Relative Jump Computation
Jump instructions use PC-relative addressing. The patch value is the signed offset from the end of the jump instruction to the target:
auto PatchComputer::compute_relative_jump(uint32_t from_offset,
uint32_t to_offset) -> int32_t {
// x86-64 relative jumps are computed from the end of the instruction
// JMP rel32 is 5 bytes, so we add 5 to from_offset
return static_cast<int32_t>(to_offset - (from_offset + 5));
}
For a jump from offset 100 to offset 250:
16.4.4 Callback Pointer Resolution
External function calls require absolute addresses of callback functions:
auto PatchComputer::compute_callback_ptr(uint16_t func_id,
const JITContext& ctx) -> uint64_t {
const auto& callback = ctx.callbacks().get(func_id);
return reinterpret_cast<uint64_t>(callback.function_ptr);
}
The JIT context maintains a table mapping function IDs to native function pointers, enabling dynamic linking of built-in functions, user-defined functions, and system callbacks.
16.5 Code Generation Pipeline
16.5.1 Baseline JIT Compilation
The baseline JIT performs direct bytecode-to-stencil translation:
class BaselineJIT {
public:
auto compile(const BytecodeModule& module) -> JITCode;
private:
void emit_prologue();
void emit_instruction(const Instruction& inst);
void emit_epilogue();
void apply_patches();
void resolve_jumps();
};
Compilation Algorithm:
function compile(module):
output = new CodeBuffer()
emit_prologue(output)
for each instruction in module.code:
stencil = lookup_stencil(instruction.opcode)
offset = output.size()
// Copy stencil code
output.append(stencil.code, stencil.code_size)
// Record patches for later resolution
for each patch in stencil.patches:
pending_patches.add(offset + patch.offset, patch, instruction)
// Record jump target for later resolution
if is_jump(instruction):
pending_jumps.add(instruction.target, offset)
emit_epilogue(output)
// Resolve all patches
for each (offset, patch, instruction) in pending_patches:
value = compute_patch_value(patch, instruction)
output.patch(offset, value, patch.size)
// Resolve jump targets
for each (target_bytecode, jump_offset) in pending_jumps:
target_native = bytecode_to_native[target_bytecode]
output.patch_relative_jump(jump_offset, target_native)
return finalize(output)
16.5.2 Stencil Selection
The baseline JIT selects stencils based on opcode and operand types:
auto StencilSelector::select(Opcode opcode, const Instruction& inst)
-> const Stencil* {
// Primary opcode lookup
if (opcode < Opcode::EXTENDED) {
return &primary_stencils_[static_cast<uint8_t>(opcode)];
}
// Extended opcode lookup
auto ext_opcode = static_cast<ExtendedOpcode>(inst.operand(0));
return &extended_stencils_[static_cast<uint8_t>(ext_opcode)];
}
For polymorphic operations (e.g., ADD that can operate on integers, floats, or strings), type-specialized stencils may be selected based on type feedback:
auto StencilSelector::select_specialized(Opcode opcode,
const TypeFeedback& feedback)
-> const Stencil* {
if (feedback.is_monomorphic()) {
CVMType type = feedback.observed_type();
return lookup_specialized_stencil(opcode, type);
}
// Fall back to polymorphic stencil with runtime dispatch
return lookup_polymorphic_stencil(opcode);
}
16.5.3 Code Buffer Management
The CodeBuffer class manages the output machine code:
class CodeBuffer {
std::vector<uint8_t> code_;
size_t write_offset_;
public:
void append(const uint8_t* data, size_t size) {
code_.insert(code_.end(), data, data + size);
write_offset_ += size;
}
void patch(size_t offset, uint64_t value, uint8_t size) {
switch (size) {
case 1: code_[offset] = static_cast<uint8_t>(value); break;
case 2: memcpy(&code_[offset], &value, 2); break;
case 4: memcpy(&code_[offset], &value, 4); break;
case 8: memcpy(&code_[offset], &value, 8); break;
}
}
auto finalize() -> std::unique_ptr<uint8_t[]>;
};
16.5.4 Prologue and Epilogue Stencils
Every JIT-compiled function begins with a prologue and ends with an epilogue:
Prologue (x86-64):
push rbp ; Save frame pointer
mov rbp, rsp ; Set up frame
push rbx ; Save callee-saved registers
push r12
push r13
push r14
push r15
sub rsp, 0x?? ; Allocate spill slots (patched)
mov rbx, [rdi + R0_OFF] ; Load pinned registers
mov r12, [rdi + R1_OFF]
mov r13, [rdi + R2_OFF]
mov r14, [rdi + R3_OFF]
Epilogue (x86-64):
mov [rdi + R0_OFF], rbx ; Spill pinned registers
mov [rdi + R1_OFF], r12
mov [rdi + R2_OFF], r13
mov [rdi + R3_OFF], r14
add rsp, 0x?? ; Deallocate spill slots
pop r15 ; Restore callee-saved
pop r14
pop r13
pop r12
pop rbx
pop rbp
ret
16.6 Register Mapping
16.6.1 x86-64 Register Allocation
The x86-64 architecture provides 16 general-purpose registers, of which several are reserved or have special purposes:
// x86-64 Register Mapping
struct X64RegisterMap {
// Reserved (cannot be used for values)
static constexpr auto kContext = RDI; // VMContext* always in RDI
static constexpr auto kStackPtr = RSP; // Stack pointer
static constexpr auto kFramePtr = RBP; // Frame pointer
// Pinned VM registers (callee-saved, persist across calls)
static constexpr auto kR0 = RBX; // VM R0
static constexpr auto kR1 = R12; // VM R1
static constexpr auto kR2 = R13; // VM R2
static constexpr auto kR3 = R14; // VM R3
static constexpr auto kFrameCounter = R15; // Special: frame depth
// Scratch registers (caller-saved, may be clobbered)
static constexpr auto kScratch[] = {RAX, RCX, RDX, RSI, R8, R9, R10, R11};
};
Pinned Registers: VM registers R0-R3 are permanently mapped to callee-saved CPU registers. This avoids memory traffic for frequently accessed values.
Scratch Registers: Used for temporary values during complex operations. Must be saved before external calls.
Memory-Resident Registers: VM registers R4-R15 reside in memory (VMContext::registers_) and are loaded/stored as needed.
16.6.2 ARM64 Register Allocation
ARM64 provides more callee-saved registers, enabling more pinned VM registers:
// ARM64 Register Mapping
struct ARM64RegisterMap {
// Reserved
static constexpr auto kContext = X0; // VMContext*
static constexpr auto kStackPtr = SP; // Stack pointer
static constexpr auto kFramePtr = X29; // Frame pointer (FP)
static constexpr auto kLinkReg = X30; // Link register (LR)
// Pinned VM registers (X19-X28 are callee-saved)
static constexpr auto kR0 = X19;
static constexpr auto kR1 = X20;
static constexpr auto kR2 = X21;
static constexpr auto kR3 = X22;
static constexpr auto kR4 = X23;
static constexpr auto kR5 = X24;
static constexpr auto kR6 = X25;
static constexpr auto kR7 = X26;
// Scratch registers
static constexpr auto kScratch[] = {X1, X2, X3, X4, X5, X6, X7,
X9, X10, X11, X12, X13, X14, X15};
};
ARM64's larger callee-saved register file allows pinning 8 VM registers versus only 4 on x86-64, reducing memory traffic for register-heavy code.
16.6.3 Register Spilling
When all scratch registers are in use, values must be spilled to memory:
class RegisterSpiller {
uint32_t spill_slot_offset_;
std::stack<uint32_t> free_slots_;
public:
auto allocate_spill_slot() -> uint32_t {
if (!free_slots_.empty()) {
uint32_t slot = free_slots_.top();
free_slots_.pop();
return slot;
}
uint32_t slot = spill_slot_offset_;
spill_slot_offset_ += 8; // 8 bytes per slot
return slot;
}
void free_spill_slot(uint32_t slot) {
free_slots_.push(slot);
}
};
The prologue allocates spill slots on the stack, and the spiller manages their allocation dynamically.
16.7 Memory Management
16.7.1 Executable Memory Allocation
JIT-compiled code requires memory with execute permission. Cognica uses platform-specific APIs:
class CodeRegion {
void* base_;
size_t size_;
size_t used_;
public:
static auto allocate(size_t size) -> std::unique_ptr<CodeRegion> {
#ifdef __linux__
void* ptr = mmap(nullptr, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
#elif defined(__APPLE__)
void* ptr = mmap(nullptr, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT, -1, 0);
#endif
return std::make_unique<CodeRegion>(ptr, size);
}
void make_executable() {
#ifdef __linux__
mprotect(base_, size_, PROT_READ | PROT_EXEC);
#elif defined(__APPLE__)
pthread_jit_write_protect_np(true);
#endif
}
};
Write-XOR-Execute (W^X): Modern security policies prevent memory from being simultaneously writable and executable. The JIT writes code with write permission, then changes to execute permission before invocation.
Apple Silicon Considerations: macOS on ARM64 requires MAP_JIT flag and uses pthread_jit_write_protect_np() to toggle between write and execute modes.
16.7.2 Code Cache Management
The code cache stores JIT-compiled functions with LRU eviction:
class CodeCache {
struct CacheEntry {
std::unique_ptr<JITCode> code;
uint64_t last_access;
uint32_t access_count;
};
std::unordered_map<uint64_t, CacheEntry> entries_;
size_t total_code_size_;
size_t max_code_size_;
public:
auto lookup(uint64_t key) -> JITCode* {
auto it = entries_.find(key);
if (it != entries_.end()) {
it->second.last_access = current_timestamp();
it->second.access_count++;
return it->second.code.get();
}
return nullptr;
}
void insert(uint64_t key, std::unique_ptr<JITCode> code) {
size_t code_size = code->size();
// Evict if necessary
while (total_code_size_ + code_size > max_code_size_) {
evict_lru();
}
entries_[key] = {std::move(code), current_timestamp(), 1};
total_code_size_ += code_size;
}
private:
void evict_lru();
};
Cache Key: A hash of the bytecode module uniquely identifies JIT code. Different compilation options (optimization level, type specialization) produce different keys.
Size Limits: The cache enforces configurable size limits (default: 64MB) to bound memory usage. LRU eviction removes infrequently accessed code.
16.7.3 Code Region Fragmentation
Over time, code allocation and eviction can fragment the executable memory region:
Cognica addresses fragmentation through:
- Bump Allocation: New code is allocated at the end of the region until full
- Region Compaction: When fragmentation exceeds a threshold, live code is copied to a new region
- Multiple Regions: Large allocations can span multiple regions
16.8 External Function Calls
16.8.1 Call Classification
External calls are classified by their requirements:
enum class CallType {
kBuiltin, // Fast path: no allocation, no side effects
kScalar, // User-defined: may allocate, needs GC safety
kExternal, // Lua/Python: full spill, may yield
kSPI, // Server Programming Interface: transaction-aware
};
Each call type has different spilling requirements:
| Call Type | Spill Pinned | Spill Scratch | Save Context |
|---|---|---|---|
| Builtin | No | Partial | No |
| Scalar | Yes | Yes | No |
| External | Yes | Yes | Yes |
| SPI | Yes | Yes | Yes |
16.8.2 Call Sequence
The JIT generates call sequences appropriate to the call type:
Builtin Call (minimal overhead):
; RAX = builtin function pointer
; Arguments in RDI (ctx), RSI (arg1), RDX (arg2)
call rax
; Result in RAX, no spilling needed
External Call (full spill):
; Spill all pinned registers to VMContext
mov [rdi + R0_OFF], rbx
mov [rdi + R1_OFF], r12
mov [rdi + R2_OFF], r13
mov [rdi + R3_OFF], r14
; Save current PC for stack unwinding
mov QWORD PTR [rdi + PC_OFF], current_pc
; Prepare arguments on operand stack
; ... argument marshaling ...
; Call external function
mov rax, [callback_table + func_id * 8]
call rax
; Check for success
test BYTE PTR [rdi + RESULT_SUCCESS_OFF], 1
jz handle_error
; Copy result to destination
mov rbx, [rdi + RESULT_VALUE_OFF]
; Reload pinned registers (may have been modified)
mov rbx, [rdi + R0_OFF]
mov r12, [rdi + R1_OFF]
mov r13, [rdi + R2_OFF]
mov r14, [rdi + R3_OFF]
16.8.3 Callback Preservation
The JIT preserves callback semantics established by the interpreter:
struct VMCallbacks {
CreateCursorFn create_cursor; // Open table/index cursor
CloseCursorFn close_cursor; // Close cursor
EmitRowFn emit_row; // Output result row
ExecuteSubqueryFn execute_subquery; // Nested query
AllocateStringFn allocate_string; // String allocation
RaiseExceptionFn raise_exception; // Exception handling
};
These callbacks enable the JIT-compiled code to interact with the broader database system without knowledge of its implementation.
16.9 Exception Handling
16.9.1 Exception Model
Cognica supports PL/pgSQL-style exception handling with EXCEPTION blocks:
BEGIN
INSERT INTO accounts VALUES (id, balance);
EXCEPTION
WHEN unique_violation THEN
UPDATE accounts SET balance = balance + amount WHERE id = $1;
END;
The JIT must support exception propagation while maintaining performance.
16.9.2 Handler Stack
Exception handlers form a stack in VMContext:
struct JITExceptionHandler {
uint32_t bytecode_pc; // Handler entry in bytecode
void* native_address; // Corresponding JIT code address
uint32_t call_depth; // Call stack depth at handler
uint32_t operand_depth; // Operand stack depth at handler
uint32_t register_mask; // Registers valid at handler
};
class VMContext {
std::vector<JITExceptionHandler> exception_handlers_;
};
16.9.3 Exception Stencils
The JIT uses specialized stencils for exception handling:
EXCEPTION_PUSH: Establishes a new handler
; Push handler to exception stack
mov rax, [rdi + HANDLER_STACK_PTR]
mov DWORD PTR [rax], handler_pc ; bytecode_pc
mov QWORD PTR [rax + 4], handler_native ; native_address
mov DWORD PTR [rax + 12], call_depth ; call_depth
add QWORD PTR [rdi + HANDLER_STACK_PTR], 20
EXCEPTION_POP: Removes the current handler
sub QWORD PTR [rdi + HANDLER_STACK_PTR], 20
RAISE_EXCEPTION: Triggers exception handling
; Find matching handler
mov rsi, exception_type
call find_exception_handler
; If no handler, propagate
test rax, rax
jz propagate_to_caller
; Jump to handler's native address
mov rsp, [rax + HANDLER_STACK_DEPTH] ; Restore stack
jmp [rax + HANDLER_NATIVE_ADDR] ; Jump to handler
16.9.4 Stack Unwinding
When an exception occurs, the JIT must unwind the stack to the handler:
- Locate Handler: Search the handler stack for a matching exception type
- Restore Stack: Reset RSP to the handler's saved stack depth
- Restore Registers: Reload pinned registers from VMContext
- Transfer Control: Jump to the handler's native address
void JITRuntime::unwind_to_handler(VMContext* ctx,
const JITExceptionHandler& handler) {
// Restore call stack
ctx->call_stack_.resize(handler.call_depth);
// Restore operand stack
ctx->operand_stack_.resize(handler.operand_depth);
// Execution continues at handler.native_address
}
16.10 Optimizing JIT (Tier 2)
16.10.1 IR-Based Optimization
The optimizing JIT builds an intermediate representation for additional transformations:
class JITIRNode {
public:
enum class Kind {
kConstant,
kParameter,
kBinaryOp,
kUnaryOp,
kLoad,
kStore,
kCall,
kBranch,
kPhi,
};
Kind kind_;
CVMType type_;
std::vector<JITIRNode*> inputs_;
std::vector<JITIRNode*> users_;
};
16.10.2 Optimization Passes
The optimizing JIT applies several transformation passes:
Constant Propagation: Replaces operations on known constants with their results.
Before: LOAD_CONST R1, 10
LOAD_CONST R2, 20
ADD_I64 R3, R1, R2
After: LOAD_CONST R3, 30
Dead Code Elimination: Removes instructions whose results are never used.
Common Subexpression Elimination: Reuses previously computed values.
Before: MUL_I64 R1, R0, R0 ; R0 * R0
MUL_I64 R2, R0, R0 ; R0 * R0 (duplicate)
ADD_I64 R3, R1, R2
After: MUL_I64 R1, R0, R0 ; R0 * R0
ADD_I64 R3, R1, R1 ; Reuse R1
Loop-Invariant Code Motion: Hoists computations out of loops.
Before: loop:
LOAD R1, base_ptr
ADD R2, R1, offset
; ... use R2 ...
JMP loop
After: LOAD R1, base_ptr ; Hoisted
ADD R2, R1, offset ; Hoisted
loop:
; ... use R2 ...
JMP loop
Type Specialization: Uses type feedback to generate specialized code:
; Generic (polymorphic)
CALL type_check, R1
CMP result, kInt64
JNE slow_path
; ... int64 fast path ...
; Specialized (monomorphic, after observing only int64)
; Type check eliminated, directly execute int64 path
16.10.3 Code Quality Comparison
The optimization passes improve code quality significantly:
| Metric | Baseline JIT | Optimized JIT |
|---|---|---|
| Instructions per bytecode | 5-8 | 3-5 |
| Memory accesses | 2-3 per op | 1-2 per op |
| Branch mispredictions | Moderate | Low |
| Code size | 1.0x | 0.7x |
16.11 Performance Characteristics
16.11.1 Compilation Time
Copy-and-patch compilation is dramatically faster than traditional JIT:
| Approach | Time per KB | 10KB Function |
|---|---|---|
| LLVM -O0 | ~50ms | 500ms |
| LLVM -O2 | ~200ms | 2000ms |
| Baseline JIT | ~1ms | 10ms |
| Optimized JIT | ~10ms | 100ms |
The 50-200x compilation speedup makes JIT practical for short-running queries.
16.11.2 Execution Speedup
Performance improvement depends on workload characteristics:
| Workload Type | Baseline JIT | Optimized JIT |
|---|---|---|
| Tight arithmetic loops | 3-5x | 5-10x |
| Expression evaluation | 2-3x | 3-4x |
| Aggregate operations | 2-3x | 3-5x |
| Field access heavy | 1.5-2x | 2-3x |
| External call heavy | 1.0-1.2x | 1.2-1.5x |
Example: PL/pgSQL Loop
DO $$
DECLARE
i INTEGER := 0;
sum INTEGER := 0;
BEGIN
WHILE i < 1000000 LOOP
sum := sum + i;
i := i + 1;
END LOOP;
END $$;
| Execution Mode | Cycles/Iteration | Total Time |
|---|---|---|
| Interpreter | ~180 | 180ms |
| Baseline JIT | ~40 | 40ms |
| Optimized JIT | ~16 | 16ms |
Speedup: 11x (Optimized JIT vs. Interpreter)
16.11.3 When JIT Is Beneficial
JIT compilation provides the greatest benefit when:
- High iteration count: Loops executing thousands of times amortize compilation cost
- Compute-intensive: Arithmetic and comparison operations benefit most
- Register-heavy: Code using many VM registers benefits from pinned CPU registers
- Predictable types: Monomorphic operations enable type specialization
JIT provides minimal benefit when:
- I/O bound: Disk or network latency dominates
- External call heavy: Calls to Lua/Python or complex functions dominate
- Short-running: Queries executing in microseconds cannot amortize compilation
- Cold code: Functions executed once or twice
16.12 Summary
Copy-and-patch JIT compilation enables Cognica to achieve native code performance without the compilation latency of traditional JIT compilers. The key concepts covered in this chapter are:
Stencil-Based Compilation: Pre-compiled code templates with designated patch points eliminate instruction selection and code emission from the compilation critical path. Approximately 110 stencils per architecture cover all CVM opcodes.
Tiered Compilation: A three-tier system (Interpreter -> Baseline JIT -> Optimized JIT) balances compilation overhead against execution benefit. Profiling infrastructure guides tier transitions based on execution counts and type feedback.
Patch Value Computation: The PatchComputer class translates bytecode operands to native values: register offsets, relative jumps, and callback pointers. Careful offset arithmetic ensures correct execution.
Register Mapping: Pinned registers (4 on x86-64, 8 on ARM64) eliminate memory traffic for frequently accessed VM registers. Scratch registers handle temporary values with spilling when necessary.
Memory Management: Executable memory allocation with W^X protection, LRU code cache eviction, and region compaction manage JIT code lifecycle within bounded memory.
Exception Handling: Handler stacks and specialized stencils enable PL/pgSQL-style exception blocks with proper stack unwinding.
Optimization Passes: The optimizing JIT applies constant propagation, dead code elimination, CSE, LICM, and type specialization for an additional 2x speedup over baseline JIT.
The combination of fast compilation (~1ms/KB) and significant speedup (2-10x) makes copy-and-patch JIT ideal for database workloads where queries range from microseconds to seconds.
References
- Xu, H., & Kjolstad, F. (2021). Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA.
- Aycock, J. (2003). A Brief History of Just-In-Time. ACM Computing Surveys.
- Holzle, U., & Ungar, D. (1994). Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback. PLDI.
- Lattner, C., & Adve, V. (2004). LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. CGO.
- Deutsch, L. P., & Schiffman, A. M. (1984). Efficient Implementation of the Smalltalk-80 System. POPL.