Chapter 16: Copy-and-Patch JIT Compilation

16.1 Introduction

While the CVM bytecode interpreter provides excellent portability and debugging capabilities, interpretation overhead limits performance for compute-intensive workloads. Traditional Just-In-Time (JIT) compilation using frameworks like LLVM can generate highly optimized native code but introduces significant compilation latency—often hundreds of milliseconds for complex queries—making it unsuitable for short-running database queries.

Copy-and-patch compilation offers a compelling middle ground: compilation speeds measured in microseconds while achieving code quality within 10-20% of heavily optimized compilers. This technique, pioneered by Haoran Xu and Fredrik Kjolstad at Stanford, pre-compiles parameterized code templates called stencils and generates native code by copying stencils and patching in runtime values.

This chapter explores Cognica's copy-and-patch JIT compiler, which provides 2-5x speedup over interpretation for hot code paths while maintaining sub-millisecond compilation times.

16.1.1 The JIT Compilation Spectrum

JIT compilation strategies span a wide spectrum of compilation time versus code quality:

Approach	Compilation Time	Code Quality	Use Case
Interpretation	0	Baseline	All queries
Copy-and-Patch	~1ms/KB	80-90% optimal	Hot loops
Method JIT	~10ms/KB	90-95% optimal	Warm methods
Tracing JIT	~50ms/KB	95%+ optimal	Hot traces
LLVM JIT	~100ms+/KB	Optimal	Long-running

For database workloads where queries often execute in milliseconds, the compilation overhead of traditional JIT compilers can exceed query execution time. Copy-and-patch compilation breaks this barrier by achieving native code generation in microseconds.

16.1.2 The Copy-and-Patch Insight

The fundamental insight behind copy-and-patch compilation is that most bytecode instructions map to a small set of machine code patterns. Consider the CVM ADD_I64 instruction:

ADD_I64 R3, R1, R2  ; R3 = R1 + R2

The corresponding x86-64 machine code follows a predictable pattern:

mov  rax, [rdi + R1_OFFSET]   ; Load R1 from VMContext
add  rax, [rdi + R2_OFFSET]   ; Add R2
mov  [rdi + R3_OFFSET], rax   ; Store to R3

The only values that change between instances are the register offsets (R1_OFFSET, R2_OFFSET, R3_OFFSET). Rather than generating this code from scratch, copy-and-patch:

Pre-compiles the pattern with placeholder offsets
At JIT time, copies the template
Patches the placeholders with actual offsets

This approach eliminates instruction selection, register allocation, and code emission from the critical path, reducing compilation to simple memory copies and integer arithmetic.

16.2 Tiered Compilation Architecture

16.2.1 Compilation Tiers

Cognica implements a three-tier compilation strategy:

Loading diagram...

Tier 0 - Interpreter: All code starts in the interpreter, which collects execution profiles. The interpreter uses computed goto dispatch for competitive baseline performance (Chapter 13).

Tier 1 - Baseline JIT: When execution count exceeds a threshold (default: 100 invocations or 1000 loop iterations), the baseline JIT compiles the function using direct bytecode-to-stencil translation. Compilation takes approximately 1ms per KB of bytecode.

Tier 2 - Optimized JIT: For very hot code (10x baseline threshold), the optimizing JIT applies additional transformations: constant propagation, dead code elimination, type specialization, and loop-invariant code motion. Compilation takes approximately 10ms per KB.

16.2.2 Tier Transition Thresholds

The transition thresholds balance compilation overhead against execution benefit:

\text{Benefit} = T_{\text{interp}} \cdot N_{\text{remaining}} - T_{\text{jit}} \cdot N_{\text{remaining}} - C_{\text{compile}}

where:

$T_{\text{interp}}$ = interpretation time per invocation
$T_{\text{jit}}$ = JIT execution time per invocation
$N_{\text{remaining}}$ = expected remaining invocations
$C_{\text{compile}}$ = one-time compilation cost

Setting $\text{Benefit} > 0$ and solving for $N_{\text{remaining}}$ :

N_{\text{remaining}} > \frac{C_{\text{compile}}}{T_{\text{interp}} - T_{\text{jit}}}

With typical values ( $C_{\text{compile}} = 1\text{ms}$ , $T_{\text{interp}} = 12\mu\text{s}$ , $T_{\text{jit}} = 4\mu\text{s}$ ):

N_{\text{remaining}} > \frac{1000\mu\text{s}}{8\mu\text{s}} = 125

The default threshold of 100 is slightly aggressive, assuming that functions called 100 times are likely to be called many more times.

16.2.3 Profiling Infrastructure

The interpreter collects profiles to guide tier transitions:

struct ExecutionProfile {
  uint32_t invocation_count;      // Function entry count
  uint32_t backward_branch_count; // Loop iteration proxy
  TypeFeedback type_feedback;     // Observed types per instruction
  BranchHistory branch_history;   // Taken/not-taken statistics
};

Invocation Counting: Each function entry increments a counter. When the counter exceeds the tier-1 threshold, compilation is triggered.

Loop Detection: Backward branches indicate loops. The interpreter counts backward branch executions to identify hot loops that warrant compilation even in cold functions.

Type Feedback: Polymorphic instructions record observed types, enabling type specialization in the optimizing JIT.

16.3 Stencil Architecture

16.3.1 Stencil Definition

A stencil is a pre-compiled machine code template with designated patch points:

struct Stencil {
  const uint8_t* code;              // Pre-compiled machine code
  size_t code_size;                 // Size in bytes
  std::vector<PatchSite> patches;   // Locations to patch
  size_t alignment;                 // Required alignment (8 or 16)
  const char* name;                 // Debug name, e.g., "add_i64"
  Opcode opcode;                    // Corresponding CVM opcode
};

Code: The raw machine code bytes, compiled offline using a standard compiler (GCC/Clang) with placeholder values.

Patch Sites: Locations within the code that must be modified at JIT time, along with metadata describing how to compute the patch value.

Alignment: Some stencils require specific alignment for SIMD instructions or branch targets.

16.3.2 Patch Site Specification

Each patch site describes a location that needs runtime modification:

struct PatchSite {
  uint32_t offset;       // Byte offset in stencil code
  PatchType type;        // Type of patch needed
  uint8_t size;          // Patch size: 1, 2, 4, or 8 bytes
  int8_t operand_index;  // Which instruction operand (-1 for special)
};

enum class PatchType : uint8_t {
  kRegisterOffset,       // Offset into VMContext::registers_
  kFloatRegisterOffset,  // Offset into VMContext::float_registers_
  kImmediate8,           // 8-bit immediate constant
  kImmediate16,          // 16-bit immediate constant
  kImmediate32,          // 32-bit immediate constant
  kImmediate64,          // 64-bit immediate constant
  kRelativeJump,         // Relative jump offset (signed)
  kAbsoluteAddress,      // Absolute address (for calls)
  kConstantPoolPtr,      // Pointer into constant pool
  kContextField,         // Offset of VMContext field
  kCallbackPtr,          // External function callback pointer
};

16.3.3 Stencil Example: ADD_I64

Consider the x86-64 stencil for integer addition:

// Stencil source (compiled offline)
void stencil_add_i64(VMContext* ctx) {
  int64_t* regs = ctx->registers_;
  regs[DST_PLACEHOLDER] =
    regs[SRC1_PLACEHOLDER] + regs[SRC2_PLACEHOLDER];
}

After compilation with placeholders, the machine code contains:

; Offset 0x00: Load src1
mov    rax, QWORD PTR [rdi + 0xDEAD0001]  ; Placeholder offset

; Offset 0x07: Load src2
add    rax, QWORD PTR [rdi + 0xDEAD0002]  ; Placeholder offset

; Offset 0x0E: Store to dst
mov    QWORD PTR [rdi + 0xDEAD0003], rax  ; Placeholder offset

The stencil metadata records the patch sites:

static const Stencil kAddI64Stencil = {
  .code = add_i64_code,
  .code_size = 21,
  .patches = {
    {.offset = 3,  .type = kRegisterOffset, .size = 4, .operand_index = 1},
    {.offset = 10, .type = kRegisterOffset, .size = 4, .operand_index = 2},
    {.offset = 17, .type = kRegisterOffset, .size = 4, .operand_index = 0},
  },
  .alignment = 1,
  .name = "add_i64",
  .opcode = Opcode::ADD_I64,
};

16.3.4 Stencil Library Organization

Cognica maintains separate stencil libraries for each target architecture:

src/cognica/cvm/jit/codegen/
├── x86_64/
│   ├── stencils.hpp        # Stencil declarations
│   ├── stencils.cpp        # Stencil definitions
│   └── stencil_data.inc    # Raw machine code bytes
└── aarch64/
    ├── stencils.hpp
    ├── stencils.cpp
    └── stencil_data.inc

Stencil Categories:

Category	Examples	Count
Arithmetic	ADD_I64, MUL_F64, NEG_I64	~30
Comparison	CMP_LT_I64, CMP_EQ_STR	~20
Control Flow	JMP, JZ, CALL, RET	~15
Memory	LOAD_CONST, STORE_LOCAL	~20
Type Ops	CAST_I64_F64, TYPE_CHECK	~15
Special	PROLOGUE, EPILOGUE, SPILL	~10

Total: approximately 110 stencils per architecture.

16.4 Patch Value Computation

16.4.1 The PatchComputer Class

The PatchComputer class computes patch values from instruction operands:

class PatchComputer {
public:
  static auto compute(PatchType type, const Instruction& inst,
                      int8_t operand_index, const JITContext& ctx)
      -> uint64_t;

private:
  static auto compute_register_offset(uint8_t reg) -> uint32_t;
  static auto compute_relative_jump(uint32_t from, uint32_t to) -> int32_t;
  static auto compute_callback_ptr(uint16_t func_id,
                                   const JITContext& ctx) -> uint64_t;
};

16.4.2 Register Offset Computation

CVM registers are stored in the VMContext at fixed offsets. Given the VMValue structure size of 24 bytes:

auto PatchComputer::compute_register_offset(uint8_t reg) -> uint32_t {
  constexpr uint32_t kRegistersOffset = offsetof(VMContext, registers_);
  constexpr uint32_t kVMValueSize = sizeof(VMValue);  // 24 bytes

  return kRegistersOffset + reg * kVMValueSize;
}

For register R5:

\text{Offset} = \text{kRegistersOffset} + 5 \times 24 = 8 + 120 = 128

16.4.3 Relative Jump Computation

Jump instructions use PC-relative addressing. The patch value is the signed offset from the end of the jump instruction to the target:

auto PatchComputer::compute_relative_jump(uint32_t from_offset,
                                          uint32_t to_offset) -> int32_t {
  // x86-64 relative jumps are computed from the end of the instruction
  // JMP rel32 is 5 bytes, so we add 5 to from_offset
  return static_cast<int32_t>(to_offset - (from_offset + 5));
}

For a jump from offset 100 to offset 250:

\text{Relative} = 250 - (100 + 5) = 145

16.4.4 Callback Pointer Resolution

External function calls require absolute addresses of callback functions:

auto PatchComputer::compute_callback_ptr(uint16_t func_id,
                                         const JITContext& ctx) -> uint64_t {
  const auto& callback = ctx.callbacks().get(func_id);
  return reinterpret_cast<uint64_t>(callback.function_ptr);
}

The JIT context maintains a table mapping function IDs to native function pointers, enabling dynamic linking of built-in functions, user-defined functions, and system callbacks.

16.5 Code Generation Pipeline

16.5.1 Baseline JIT Compilation

The baseline JIT performs direct bytecode-to-stencil translation:

class BaselineJIT {
public:
  auto compile(const BytecodeModule& module) -> JITCode;

private:
  void emit_prologue();
  void emit_instruction(const Instruction& inst);
  void emit_epilogue();
  void apply_patches();
  void resolve_jumps();
};

Compilation Algorithm:

function compile(module):
  output = new CodeBuffer()

  emit_prologue(output)

  for each instruction in module.code:
    stencil = lookup_stencil(instruction.opcode)
    offset = output.size()

    // Copy stencil code
    output.append(stencil.code, stencil.code_size)

    // Record patches for later resolution
    for each patch in stencil.patches:
      pending_patches.add(offset + patch.offset, patch, instruction)

    // Record jump target for later resolution
    if is_jump(instruction):
      pending_jumps.add(instruction.target, offset)

  emit_epilogue(output)

  // Resolve all patches
  for each (offset, patch, instruction) in pending_patches:
    value = compute_patch_value(patch, instruction)
    output.patch(offset, value, patch.size)

  // Resolve jump targets
  for each (target_bytecode, jump_offset) in pending_jumps:
    target_native = bytecode_to_native[target_bytecode]
    output.patch_relative_jump(jump_offset, target_native)

  return finalize(output)

16.5.2 Stencil Selection

The baseline JIT selects stencils based on opcode and operand types:

auto StencilSelector::select(Opcode opcode, const Instruction& inst)
    -> const Stencil* {

  // Primary opcode lookup
  if (opcode < Opcode::EXTENDED) {
    return &primary_stencils_[static_cast<uint8_t>(opcode)];
  }

  // Extended opcode lookup
  auto ext_opcode = static_cast<ExtendedOpcode>(inst.operand(0));
  return &extended_stencils_[static_cast<uint8_t>(ext_opcode)];
}

For polymorphic operations (e.g., ADD that can operate on integers, floats, or strings), type-specialized stencils may be selected based on type feedback:

auto StencilSelector::select_specialized(Opcode opcode,
                                         const TypeFeedback& feedback)
    -> const Stencil* {

  if (feedback.is_monomorphic()) {
    CVMType type = feedback.observed_type();
    return lookup_specialized_stencil(opcode, type);
  }

  // Fall back to polymorphic stencil with runtime dispatch
  return lookup_polymorphic_stencil(opcode);
}

16.5.3 Code Buffer Management

The CodeBuffer class manages the output machine code:

class CodeBuffer {
  std::vector<uint8_t> code_;
  size_t write_offset_;

public:
  void append(const uint8_t* data, size_t size) {
    code_.insert(code_.end(), data, data + size);
    write_offset_ += size;
  }

  void patch(size_t offset, uint64_t value, uint8_t size) {
    switch (size) {
      case 1: code_[offset] = static_cast<uint8_t>(value); break;
      case 2: memcpy(&code_[offset], &value, 2); break;
      case 4: memcpy(&code_[offset], &value, 4); break;
      case 8: memcpy(&code_[offset], &value, 8); break;
    }
  }

  auto finalize() -> std::unique_ptr<uint8_t[]>;
};

16.5.4 Prologue and Epilogue Stencils

Every JIT-compiled function begins with a prologue and ends with an epilogue:

Prologue (x86-64):

push   rbp                    ; Save frame pointer
mov    rbp, rsp               ; Set up frame
push   rbx                    ; Save callee-saved registers
push   r12
push   r13
push   r14
push   r15
sub    rsp, 0x??              ; Allocate spill slots (patched)
mov    rbx, [rdi + R0_OFF]    ; Load pinned registers
mov    r12, [rdi + R1_OFF]
mov    r13, [rdi + R2_OFF]
mov    r14, [rdi + R3_OFF]

Epilogue (x86-64):

mov    [rdi + R0_OFF], rbx    ; Spill pinned registers
mov    [rdi + R1_OFF], r12
mov    [rdi + R2_OFF], r13
mov    [rdi + R3_OFF], r14
add    rsp, 0x??              ; Deallocate spill slots
pop    r15                    ; Restore callee-saved
pop    r14
pop    r13
pop    r12
pop    rbx
pop    rbp
ret

16.6 Register Mapping

16.6.1 x86-64 Register Allocation

The x86-64 architecture provides 16 general-purpose registers, of which several are reserved or have special purposes:

// x86-64 Register Mapping
struct X64RegisterMap {
  // Reserved (cannot be used for values)
  static constexpr auto kContext = RDI;     // VMContext* always in RDI
  static constexpr auto kStackPtr = RSP;    // Stack pointer
  static constexpr auto kFramePtr = RBP;    // Frame pointer

  // Pinned VM registers (callee-saved, persist across calls)
  static constexpr auto kR0 = RBX;          // VM R0
  static constexpr auto kR1 = R12;          // VM R1
  static constexpr auto kR2 = R13;          // VM R2
  static constexpr auto kR3 = R14;          // VM R3
  static constexpr auto kFrameCounter = R15; // Special: frame depth

  // Scratch registers (caller-saved, may be clobbered)
  static constexpr auto kScratch[] = {RAX, RCX, RDX, RSI, R8, R9, R10, R11};
};

Pinned Registers: VM registers R0-R3 are permanently mapped to callee-saved CPU registers. This avoids memory traffic for frequently accessed values.

Scratch Registers: Used for temporary values during complex operations. Must be saved before external calls.

Memory-Resident Registers: VM registers R4-R15 reside in memory (VMContext::registers_) and are loaded/stored as needed.

16.6.2 ARM64 Register Allocation

ARM64 provides more callee-saved registers, enabling more pinned VM registers:

// ARM64 Register Mapping
struct ARM64RegisterMap {
  // Reserved
  static constexpr auto kContext = X0;      // VMContext*
  static constexpr auto kStackPtr = SP;     // Stack pointer
  static constexpr auto kFramePtr = X29;    // Frame pointer (FP)
  static constexpr auto kLinkReg = X30;     // Link register (LR)

  // Pinned VM registers (X19-X28 are callee-saved)
  static constexpr auto kR0 = X19;
  static constexpr auto kR1 = X20;
  static constexpr auto kR2 = X21;
  static constexpr auto kR3 = X22;
  static constexpr auto kR4 = X23;
  static constexpr auto kR5 = X24;
  static constexpr auto kR6 = X25;
  static constexpr auto kR7 = X26;

  // Scratch registers
  static constexpr auto kScratch[] = {X1, X2, X3, X4, X5, X6, X7,
                                      X9, X10, X11, X12, X13, X14, X15};
};

ARM64's larger callee-saved register file allows pinning 8 VM registers versus only 4 on x86-64, reducing memory traffic for register-heavy code.

16.6.3 Register Spilling

When all scratch registers are in use, values must be spilled to memory:

class RegisterSpiller {
  uint32_t spill_slot_offset_;
  std::stack<uint32_t> free_slots_;

public:
  auto allocate_spill_slot() -> uint32_t {
    if (!free_slots_.empty()) {
      uint32_t slot = free_slots_.top();
      free_slots_.pop();
      return slot;
    }
    uint32_t slot = spill_slot_offset_;
    spill_slot_offset_ += 8;  // 8 bytes per slot
    return slot;
  }

  void free_spill_slot(uint32_t slot) {
    free_slots_.push(slot);
  }
};

The prologue allocates spill slots on the stack, and the spiller manages their allocation dynamically.

16.7 Memory Management

16.7.1 Executable Memory Allocation

JIT-compiled code requires memory with execute permission. Cognica uses platform-specific APIs:

class CodeRegion {
  void* base_;
  size_t size_;
  size_t used_;

public:
  static auto allocate(size_t size) -> std::unique_ptr<CodeRegion> {
#ifdef __linux__
    void* ptr = mmap(nullptr, size,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
#elif defined(__APPLE__)
    void* ptr = mmap(nullptr, size,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT, -1, 0);
#endif
    return std::make_unique<CodeRegion>(ptr, size);
  }

  void make_executable() {
#ifdef __linux__
    mprotect(base_, size_, PROT_READ | PROT_EXEC);
#elif defined(__APPLE__)
    pthread_jit_write_protect_np(true);
#endif
  }
};

Write-XOR-Execute (W^X): Modern security policies prevent memory from being simultaneously writable and executable. The JIT writes code with write permission, then changes to execute permission before invocation.

Apple Silicon Considerations: macOS on ARM64 requires MAP_JIT flag and uses pthread_jit_write_protect_np() to toggle between write and execute modes.

16.7.2 Code Cache Management

The code cache stores JIT-compiled functions with LRU eviction:

class CodeCache {
  struct CacheEntry {
    std::unique_ptr<JITCode> code;
    uint64_t last_access;
    uint32_t access_count;
  };

  std::unordered_map<uint64_t, CacheEntry> entries_;
  size_t total_code_size_;
  size_t max_code_size_;

public:
  auto lookup(uint64_t key) -> JITCode* {
    auto it = entries_.find(key);
    if (it != entries_.end()) {
      it->second.last_access = current_timestamp();
      it->second.access_count++;
      return it->second.code.get();
    }
    return nullptr;
  }

  void insert(uint64_t key, std::unique_ptr<JITCode> code) {
    size_t code_size = code->size();

    // Evict if necessary
    while (total_code_size_ + code_size > max_code_size_) {
      evict_lru();
    }

    entries_[key] = {std::move(code), current_timestamp(), 1};
    total_code_size_ += code_size;
  }

private:
  void evict_lru();
};

Cache Key: A hash of the bytecode module uniquely identifies JIT code. Different compilation options (optimization level, type specialization) produce different keys.

Size Limits: The cache enforces configurable size limits (default: 64MB) to bound memory usage. LRU eviction removes infrequently accessed code.

16.7.3 Code Region Fragmentation

Over time, code allocation and eviction can fragment the executable memory region:

Loading diagram...

Cognica addresses fragmentation through:

Bump Allocation: New code is allocated at the end of the region until full
Region Compaction: When fragmentation exceeds a threshold, live code is copied to a new region
Multiple Regions: Large allocations can span multiple regions

16.8 External Function Calls

16.8.1 Call Classification

External calls are classified by their requirements:

enum class CallType {
  kBuiltin,    // Fast path: no allocation, no side effects
  kScalar,     // User-defined: may allocate, needs GC safety
  kExternal,   // Lua/Python: full spill, may yield
  kSPI,        // Server Programming Interface: transaction-aware
};

Each call type has different spilling requirements:

Call Type	Spill Pinned	Spill Scratch	Save Context
Builtin	No	Partial	No
Scalar	Yes	Yes	No
External	Yes	Yes	Yes
SPI	Yes	Yes	Yes

16.8.2 Call Sequence

The JIT generates call sequences appropriate to the call type:

Builtin Call (minimal overhead):

; RAX = builtin function pointer
; Arguments in RDI (ctx), RSI (arg1), RDX (arg2)
call   rax
; Result in RAX, no spilling needed

External Call (full spill):

; Spill all pinned registers to VMContext
mov    [rdi + R0_OFF], rbx
mov    [rdi + R1_OFF], r12
mov    [rdi + R2_OFF], r13
mov    [rdi + R3_OFF], r14

; Save current PC for stack unwinding
mov    QWORD PTR [rdi + PC_OFF], current_pc

; Prepare arguments on operand stack
; ... argument marshaling ...

; Call external function
mov    rax, [callback_table + func_id * 8]
call   rax

; Check for success
test   BYTE PTR [rdi + RESULT_SUCCESS_OFF], 1
jz     handle_error

; Copy result to destination
mov    rbx, [rdi + RESULT_VALUE_OFF]

; Reload pinned registers (may have been modified)
mov    rbx, [rdi + R0_OFF]
mov    r12, [rdi + R1_OFF]
mov    r13, [rdi + R2_OFF]
mov    r14, [rdi + R3_OFF]

16.8.3 Callback Preservation

The JIT preserves callback semantics established by the interpreter:

struct VMCallbacks {
  CreateCursorFn create_cursor;      // Open table/index cursor
  CloseCursorFn close_cursor;        // Close cursor
  EmitRowFn emit_row;                // Output result row
  ExecuteSubqueryFn execute_subquery; // Nested query
  AllocateStringFn allocate_string;  // String allocation
  RaiseExceptionFn raise_exception;  // Exception handling
};

These callbacks enable the JIT-compiled code to interact with the broader database system without knowledge of its implementation.

16.9 Exception Handling

16.9.1 Exception Model

Cognica supports PL/pgSQL-style exception handling with EXCEPTION blocks:

BEGIN
  INSERT INTO accounts VALUES (id, balance);
EXCEPTION
  WHEN unique_violation THEN
    UPDATE accounts SET balance = balance + amount WHERE id = $1;
END;

The JIT must support exception propagation while maintaining performance.

16.9.2 Handler Stack

Exception handlers form a stack in VMContext:

struct JITExceptionHandler {
  uint32_t bytecode_pc;       // Handler entry in bytecode
  void* native_address;       // Corresponding JIT code address
  uint32_t call_depth;        // Call stack depth at handler
  uint32_t operand_depth;     // Operand stack depth at handler
  uint32_t register_mask;     // Registers valid at handler
};

class VMContext {
  std::vector<JITExceptionHandler> exception_handlers_;
};

16.9.3 Exception Stencils

The JIT uses specialized stencils for exception handling:

EXCEPTION_PUSH: Establishes a new handler

; Push handler to exception stack
mov    rax, [rdi + HANDLER_STACK_PTR]
mov    DWORD PTR [rax], handler_pc          ; bytecode_pc
mov    QWORD PTR [rax + 4], handler_native  ; native_address
mov    DWORD PTR [rax + 12], call_depth     ; call_depth
add    QWORD PTR [rdi + HANDLER_STACK_PTR], 20

EXCEPTION_POP: Removes the current handler

sub    QWORD PTR [rdi + HANDLER_STACK_PTR], 20

RAISE_EXCEPTION: Triggers exception handling

; Find matching handler
mov    rsi, exception_type
call   find_exception_handler

; If no handler, propagate
test   rax, rax
jz     propagate_to_caller

; Jump to handler's native address
mov    rsp, [rax + HANDLER_STACK_DEPTH]  ; Restore stack
jmp    [rax + HANDLER_NATIVE_ADDR]        ; Jump to handler

16.9.4 Stack Unwinding

When an exception occurs, the JIT must unwind the stack to the handler:

Locate Handler: Search the handler stack for a matching exception type
Restore Stack: Reset RSP to the handler's saved stack depth
Restore Registers: Reload pinned registers from VMContext
Transfer Control: Jump to the handler's native address

void JITRuntime::unwind_to_handler(VMContext* ctx,
                                   const JITExceptionHandler& handler) {
  // Restore call stack
  ctx->call_stack_.resize(handler.call_depth);

  // Restore operand stack
  ctx->operand_stack_.resize(handler.operand_depth);

  // Execution continues at handler.native_address
}

16.10 Optimizing JIT (Tier 2)

16.10.1 IR-Based Optimization

The optimizing JIT builds an intermediate representation for additional transformations:

class JITIRNode {
public:
  enum class Kind {
    kConstant,
    kParameter,
    kBinaryOp,
    kUnaryOp,
    kLoad,
    kStore,
    kCall,
    kBranch,
    kPhi,
  };

  Kind kind_;
  CVMType type_;
  std::vector<JITIRNode*> inputs_;
  std::vector<JITIRNode*> users_;
};

16.10.2 Optimization Passes

The optimizing JIT applies several transformation passes:

Constant Propagation: Replaces operations on known constants with their results.

Before: LOAD_CONST R1, 10
        LOAD_CONST R2, 20
        ADD_I64 R3, R1, R2

After:  LOAD_CONST R3, 30

Dead Code Elimination: Removes instructions whose results are never used.

Common Subexpression Elimination: Reuses previously computed values.

Before: MUL_I64 R1, R0, R0    ; R0 * R0
        MUL_I64 R2, R0, R0    ; R0 * R0 (duplicate)
        ADD_I64 R3, R1, R2

After:  MUL_I64 R1, R0, R0    ; R0 * R0
        ADD_I64 R3, R1, R1    ; Reuse R1

Loop-Invariant Code Motion: Hoists computations out of loops.

Before: loop:
          LOAD R1, base_ptr
          ADD R2, R1, offset
          ; ... use R2 ...
          JMP loop

After:  LOAD R1, base_ptr      ; Hoisted
        ADD R2, R1, offset     ; Hoisted
        loop:
          ; ... use R2 ...
          JMP loop

Type Specialization: Uses type feedback to generate specialized code:

; Generic (polymorphic)
CALL type_check, R1
CMP result, kInt64
JNE slow_path
; ... int64 fast path ...

; Specialized (monomorphic, after observing only int64)
; Type check eliminated, directly execute int64 path

16.10.3 Code Quality Comparison

The optimization passes improve code quality significantly:

Metric	Baseline JIT	Optimized JIT
Instructions per bytecode	5-8	3-5
Memory accesses	2-3 per op	1-2 per op
Branch mispredictions	Moderate	Low
Code size	1.0x	0.7x

16.11 Performance Characteristics

16.11.1 Compilation Time

Copy-and-patch compilation is dramatically faster than traditional JIT:

Approach	Time per KB	10KB Function
LLVM -O0	~50ms	500ms
LLVM -O2	~200ms	2000ms
Baseline JIT	~1ms	10ms
Optimized JIT	~10ms	100ms

The 50-200x compilation speedup makes JIT practical for short-running queries.

16.11.2 Execution Speedup

Performance improvement depends on workload characteristics:

Workload Type	Baseline JIT	Optimized JIT
Tight arithmetic loops	3-5x	5-10x
Expression evaluation	2-3x	3-4x
Aggregate operations	2-3x	3-5x
Field access heavy	1.5-2x	2-3x
External call heavy	1.0-1.2x	1.2-1.5x

Example: PL/pgSQL Loop

DO $$
DECLARE
  i INTEGER := 0;
  sum INTEGER := 0;
BEGIN
  WHILE i < 1000000 LOOP
    sum := sum + i;
    i := i + 1;
  END LOOP;
END $$;

Execution Mode	Cycles/Iteration	Total Time
Interpreter	~180	180ms
Baseline JIT	~40	40ms
Optimized JIT	~16	16ms

Speedup: 11x (Optimized JIT vs. Interpreter)

16.11.3 When JIT Is Beneficial

JIT compilation provides the greatest benefit when:

High iteration count: Loops executing thousands of times amortize compilation cost
Compute-intensive: Arithmetic and comparison operations benefit most
Register-heavy: Code using many VM registers benefits from pinned CPU registers
Predictable types: Monomorphic operations enable type specialization

JIT provides minimal benefit when:

I/O bound: Disk or network latency dominates
External call heavy: Calls to Lua/Python or complex functions dominate
Short-running: Queries executing in microseconds cannot amortize compilation
Cold code: Functions executed once or twice

16.12 Summary

Copy-and-patch JIT compilation enables Cognica to achieve native code performance without the compilation latency of traditional JIT compilers. The key concepts covered in this chapter are:

Stencil-Based Compilation: Pre-compiled code templates with designated patch points eliminate instruction selection and code emission from the compilation critical path. Approximately 110 stencils per architecture cover all CVM opcodes.

Tiered Compilation: A three-tier system (Interpreter -> Baseline JIT -> Optimized JIT) balances compilation overhead against execution benefit. Profiling infrastructure guides tier transitions based on execution counts and type feedback.

Patch Value Computation: The PatchComputer class translates bytecode operands to native values: register offsets, relative jumps, and callback pointers. Careful offset arithmetic ensures correct execution.

Register Mapping: Pinned registers (4 on x86-64, 8 on ARM64) eliminate memory traffic for frequently accessed VM registers. Scratch registers handle temporary values with spilling when necessary.

Memory Management: Executable memory allocation with W^X protection, LRU code cache eviction, and region compaction manage JIT code lifecycle within bounded memory.

Exception Handling: Handler stacks and specialized stencils enable PL/pgSQL-style exception blocks with proper stack unwinding.

Optimization Passes: The optimizing JIT applies constant propagation, dead code elimination, CSE, LICM, and type specialization for an additional 2x speedup over baseline JIT.

The combination of fast compilation (~1ms/KB) and significant speedup (2-10x) makes copy-and-patch JIT ideal for database workloads where queries range from microseconds to seconds.

References

Xu, H., & Kjolstad, F. (2021). Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA.
Aycock, J. (2003). A Brief History of Just-In-Time. ACM Computing Surveys.
Holzle, U., & Ungar, D. (1994). Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback. PLDI.
Lattner, C., & Adve, V. (2004). LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. CGO.
Deutsch, L. P., & Schiffman, A. M. (1984). Efficient Implementation of the Smalltalk-80 System. POPL.