Copy-and-Patch JIT: Achieving Native Code Performance with Microsecond Compilation

Posted on January 17, 2026
Copy-and-Patch JIT: Achieving Native Code Performance with Microsecond Compilation

Copy-and-Patch JIT: Achieving Native Code Performance with Microsecond Compilation

How Cognica Database Engine breaks the JIT compilation latency barrier


When you execute a SQL query, the database engine faces a fundamental tension: interpretation is slow but starts immediately, while compilation produces fast code but takes time. For queries that complete in milliseconds, spending hundreds of milliseconds on LLVM optimization is like warming up your car longer than your commute.

This post explores Copy-and-Patch JIT compilation, a technique we implemented in Cognica Database Engine that achieves 2-10x speedup over interpretation while keeping compilation time under one millisecond per kilobyte of bytecode. The approach, pioneered by Haoran Xu and Fredrik Kjolstad at Stanford, fundamentally reimagines what JIT compilation can be.

The JIT Compilation Dilemma

Traditional JIT compilers like those based on LLVM perform sophisticated optimizations: instruction selection, register allocation, dead code elimination, loop unrolling, and more. These transformations produce excellent code quality—often within a few percent of ahead-of-time compiled C++—but they come at a cost.

Consider the compilation time spectrum:

ApproachCompilation TimeCode QualitySweet Spot
Interpretation0BaselineAll queries
Copy-and-Patch~1ms/KB80-90% optimalHot loops
Method JIT~10ms/KB90-95% optimalWarm methods
LLVM JIT~100ms+/KBOptimalLong-running

For a database query that scans a table and computes an aggregate in 5 milliseconds, spending 200 milliseconds on LLVM compilation is absurd. The query would need to run 40 times just to break even on compilation overhead. Yet without JIT compilation, compute-intensive operations like expression evaluation and aggregation become bottlenecks.

Copy-and-Patch breaks this tradeoff by achieving compilation speeds measured in microseconds while delivering code quality within 10-20% of heavily optimized compilers.

The Core Insight: Most Instructions Are Templates

The key observation behind Copy-and-Patch is deceptively simple: most bytecode instructions map to a small, predictable set of machine code patterns.

Consider an integer addition instruction in Cognica's virtual machine:

ADD_I64 R3, R1, R2 ; R3 = R1 + R2

The corresponding x86-64 machine code always follows this pattern:

mov rax, [rdi + R1_OFFSET] ; Load R1 from VMContext add rax, [rdi + R2_OFFSET] ; Add R2 mov [rdi + R3_OFFSET], rax ; Store to R3

The only values that change between different instances of this instruction are the register offsets. The opcode selection, addressing mode, and instruction encoding are always identical.

Traditional JIT compilers rediscover this pattern every time they compile an ADD instruction. They run instruction selection algorithms, consider register allocation constraints, and emit the same bytes they emitted last time. Copy-and-Patch eliminates this redundancy by pre-compiling these patterns into stencils—machine code templates with designated patch points.

Stencils: Pre-Compiled Code Templates

A stencil is simply pre-compiled machine code with placeholder values that get filled in at JIT time:

struct Stencil { const uint8_t* code; // Pre-compiled machine code bytes size_t code_size; // Size in bytes std::vector<PatchSite> patches; // Locations requiring runtime patching size_t alignment; // Required alignment (8 or 16 bytes) const char* name; // Debug name, e.g., "add_i64" Opcode opcode; // Corresponding VM opcode };

Each patch site describes exactly what needs to be modified:

struct PatchSite { uint32_t offset; // Byte offset within the stencil PatchType type; // What kind of value to patch in uint8_t size; // Patch size: 1, 2, 4, or 8 bytes int8_t operand_index; // Which instruction operand (-1 for special) };

The patch types cover all the runtime values a stencil might need:

enum class PatchType : uint8_t { kRegisterOffset, // Offset into VMContext::registers_ kImmediate32, // 32-bit immediate constant kImmediate64, // 64-bit immediate constant kRelativeJump, // PC-relative jump offset kAbsoluteAddress, // Absolute address for calls kConstantPoolPtr, // Pointer into constant pool kCallbackPtr, // External function pointer };

Cognica maintains approximately 110 stencils per target architecture (x86-64 and ARM64), covering arithmetic operations, comparisons, control flow, memory access, type operations, and special cases like function prologues and epilogues.

The Compilation Algorithm

With stencils prepared, JIT compilation becomes remarkably simple:

function compile(bytecode_module): output = new CodeBuffer() emit_prologue(output) for each instruction in bytecode_module: stencil = lookup_stencil(instruction.opcode) current_offset = output.size() // Step 1: Copy the stencil's machine code output.append(stencil.code, stencil.code_size) // Step 2: Record patches for later resolution for each patch in stencil.patches: pending_patches.add(current_offset + patch.offset, patch, instruction) emit_epilogue(output) // Step 3: Apply all patches for each (offset, patch, instruction) in pending_patches: value = compute_patch_value(patch, instruction) output.patch(offset, value, patch.size) return finalize(output)

The critical insight is what's not happening here. There's no instruction selection—the stencil already contains the right instructions. There's no register allocation—the stencil already encodes register usage. There's no code emission—we're just copying bytes. The entire compilation reduces to memory copies and simple arithmetic to compute patch values.

Patch Value Computation

Computing patch values is straightforward arithmetic. For register offsets:

auto compute_register_offset(uint8_t reg) -> uint32_t { // VMContext stores registers at a fixed offset // Each VMValue is 24 bytes (value + type tag + flags) constexpr uint32_t kRegistersOffset = offsetof(VMContext, registers_); constexpr uint32_t kVMValueSize = 24; return kRegistersOffset + reg * kVMValueSize; }

For relative jumps on x86-64:

auto compute_relative_jump(uint32_t from_offset, uint32_t to_offset) -> int32_t { // x86-64 relative jumps are computed from the end of the instruction // JMP rel32 is 5 bytes long return static_cast<int32_t>(to_offset - (from_offset + 5)); }

These calculations execute in nanoseconds, contributing negligibly to total compilation time.

Tiered Compilation: Balancing Warmup and Peak Performance

Not all code deserves JIT compilation. A function called once should run in the interpreter; a tight loop executing millions of times deserves aggressive optimization. Cognica implements a three-tier system where code progressively advances through compilation stages based on observed execution frequency.

TierNameDescriptionCompilation TimeSpeedup
0InterpreterAll code starts here. Collects execution profiles and uses computed-goto dispatch for baseline performance.0Baseline
1Baseline JIT (Copy-and-Patch)Direct bytecode-to-stencil translation without optimization passes.~1ms/KB2-5x over interpreter
2Optimizing JITBuilds IR and applies optimization passes including constant propagation, CSE, LICM, and type specialization.~10ms/KBAdditional 2x over baseline

Code transitions from Tier 0 to Tier 1 after reaching 100 invocations or 1000 loop iterations. The transition from Tier 1 to Tier 2 occurs after 1000 invocations or 10000 loop iterations. These thresholds ensure that compilation effort is invested only in code that will execute frequently enough to recoup the compilation cost.

The tier transition thresholds are derived from a cost-benefit analysis. Let CcompileC_{\text{compile}} be the one-time compilation cost, TinterpT_{\text{interp}} the interpretation time per invocation, TjitT_{\text{jit}} the JIT execution time per invocation, and NN the remaining expected invocations. Compilation is beneficial when:

(TinterpTjit)N>Ccompile(T_{\text{interp}} - T_{\text{jit}}) \cdot N > C_{\text{compile}}

Solving for NN:

N>CcompileTinterpTjitN > \frac{C_{\text{compile}}}{T_{\text{interp}} - T_{\text{jit}}}

With typical values (Ccompile=1msC_{\text{compile}} = 1\text{ms}, Tinterp=12μsT_{\text{interp}} = 12\mu\text{s}, Tjit=4μsT_{\text{jit}} = 4\mu\text{s}), the break-even point is approximately 125 invocations. The default threshold of 100 is slightly aggressive, based on the observation that functions called 100 times tend to be called many more.

How SQL Queries Get Accelerated

Now let's examine how Copy-and-Patch JIT specifically accelerates SQL query execution. The performance gains come from several complementary mechanisms.

Expression Evaluation

Consider a query with a complex WHERE clause:

SELECT * FROM orders WHERE total_amount > 1000 AND EXTRACT(YEAR FROM order_date) = 2026 AND customer_id IN (SELECT id FROM premium_customers);

The expression total_amount > 1000 AND EXTRACT(YEAR FROM order_date) = 2026 gets compiled to bytecode like:

LOAD_FIELD R1, row, "total_amount" LOAD_CONST R2, 1000 CMP_GT_I64 R3, R1, R2 JZ skip_second_condition, R3 LOAD_FIELD R4, row, "order_date" CALL R5, extract_year, R4 LOAD_CONST R6, 2026 CMP_EQ_I64 R7, R5, R6 AND R3, R3, R7 skip_second_condition: ; R3 contains the final boolean result

In the interpreter, each instruction requires fetching the opcode from memory, dispatching to the handler (indirect jump or switch), decoding operands, executing the operation, and advancing the instruction pointer. This overhead dominates for simple operations. A comparison that executes in 1-2 CPU cycles gets wrapped in 15-20 cycles of dispatch overhead.

The JIT-compiled version eliminates all dispatch overhead. The stencils chain together into a continuous sequence of machine instructions:

; LOAD_FIELD R1, row, "total_amount" mov rax, [rsi + 0x40] ; Load field at offset 0x40 mov [rdi + 0x20], rax ; Store to R1 ; LOAD_CONST R2, 1000 mov QWORD PTR [rdi + 0x38], 1000 ; Store constant to R2 ; CMP_GT_I64 R3, R1, R2 mov rax, [rdi + 0x20] ; Load R1 cmp rax, [rdi + 0x38] ; Compare with R2 setg al ; Set AL if greater movzx eax, al mov [rdi + 0x50], rax ; Store to R3 ; JZ skip_second_condition, R3 mov rax, [rdi + 0x50] test rax, rax jz skip_second_condition ; ... remaining instructions ...

For a table scan processing millions of rows, this 3-5x reduction in per-row overhead translates directly to wall-clock time savings.

Aggregate Functions

Aggregates like SUM, AVG, COUNT, MIN, and MAX execute tight loops that benefit enormously from JIT compilation:

SELECT region, SUM(amount) as total, AVG(amount) as average, COUNT(*) as count FROM sales GROUP BY region;

The aggregation loop compiles to bytecode that accumulates values:

aggregate_loop: LOAD_FIELD R1, row, "amount" ADD_F64 R_sum, R_sum, R1 ; Accumulate sum ADD_I64 R_count, R_count, 1 ; Increment count NEXT_ROW row JNZ aggregate_loop, row DIV_F64 R_avg, R_sum, R_count ; Compute average

The interpreter executes this loop with ~180 cycles per iteration. The baseline JIT reduces this to ~40 cycles. The optimizing JIT, with loop-invariant code motion and register pinning, achieves ~16 cycles per iteration—an 11x improvement.

Join Processing

Hash joins and nested loop joins contain performance-critical inner loops:

SELECT o.*, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE c.country = 'USA';

The hash probe loop in a hash join:

probe_loop: LOAD_FIELD R1, probe_row, "customer_id" HASH R2, R1 AND R2, R2, hash_mask ; R2 = bucket index LOAD R3, hash_table, R2 ; R3 = bucket head bucket_scan: JZ no_match, R3 LOAD_FIELD R4, R3, "id" CMP_EQ R5, R1, R4 JZ next_entry, R5 ; Match found - emit joined row CALL emit_joined_row, probe_row, R3 next_entry: LOAD_FIELD R3, R3, "next" ; Follow chain JMP bucket_scan no_match: NEXT_ROW probe_row JNZ probe_loop, probe_row

JIT compilation accelerates both the hashing and the bucket chain traversal. For hash joins with good hash distribution (short chains), the speedup is 2-3x. For joins with hash collisions requiring longer chain traversals, the speedup can reach 4-5x due to better branch prediction in compiled code.

Stored Procedures and PL/pgSQL

The most dramatic improvements appear in procedural code. Consider a PL/pgSQL function that processes data:

CREATE FUNCTION calculate_bonuses() RETURNS void AS $$ DECLARE emp RECORD; bonus NUMERIC; BEGIN FOR emp IN SELECT * FROM employees WHERE department = 'Sales' LOOP IF emp.sales_total > 100000 THEN bonus := emp.sales_total * 0.05; ELSIF emp.sales_total > 50000 THEN bonus := emp.sales_total * 0.03; ELSE bonus := emp.sales_total * 0.01; END IF; UPDATE employees SET bonus_amount = bonus WHERE id = emp.id; END LOOP; END; $$ LANGUAGE plpgsql;

This function contains a cursor loop (many iterations), conditional branches (amenable to branch prediction), arithmetic operations (pure computation), and nested SQL statements (separate execution). The loop body—everything except the UPDATE statement—benefits significantly from JIT compilation. With 10,000 employees in Sales, the interpreted version might take 1.8 seconds while the JIT-compiled version completes in 160 milliseconds.

Window Functions

Window functions process rows while maintaining state across the window frame:

SELECT date, amount, SUM(amount) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as week_total, AVG(amount) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as month_avg FROM daily_sales;

The sliding window aggregation:

window_loop: ; Remove old values from window LOAD R1, window_start CMP_LT R2, R1, current_row - window_size JZ skip_remove, R2 SUB_F64 R_sum, R_sum, [R1 + amount_offset] SUB_I64 R_count, R_count, 1 ADD R1, R1, row_size STORE window_start, R1 skip_remove: ; Add new value to window LOAD_FIELD R3, current_row, "amount" ADD_F64 R_sum, R_sum, R3 ADD_I64 R_count, R_count, 1 ; Compute window result DIV_F64 R_avg, R_sum, R_count EMIT current_row, R_avg NEXT_ROW current_row JNZ window_loop, current_row

The tight loop with pointer arithmetic and accumulator updates is exactly the pattern where Copy-and-Patch shines. Window function queries commonly see 3-4x speedup.

Register Mapping: Bridging VM and Hardware

One source of overhead in bytecode execution is the impedance mismatch between VM registers (in memory) and CPU registers (in silicon). Cognica's JIT bridges this gap through register pinning.

On x86-64:

struct X64RegisterMap { // Reserved (cannot hold values) static constexpr auto kContext = RDI; // VMContext* always here static constexpr auto kStackPtr = RSP; // Hardware stack static constexpr auto kFramePtr = RBP; // Frame pointer // Pinned VM registers (callee-saved, persist across calls) static constexpr auto kR0 = RBX; // VM R0 lives in RBX static constexpr auto kR1 = R12; // VM R1 lives in R12 static constexpr auto kR2 = R13; // VM R2 lives in R13 static constexpr auto kR3 = R14; // VM R3 lives in R14 // Scratch registers (caller-saved, temporaries) static constexpr auto kScratch[] = {RAX, RCX, RDX, RSI, R8, R9, R10, R11}; };

VM registers R0-R3 are permanently mapped to callee-saved CPU registers. When the JIT compiles code that uses these registers, it generates instructions that operate directly on RBX, R12, R13, R14—no memory loads or stores required.

ARM64's more generous callee-saved register set allows pinning 8 VM registers instead of 4, further reducing memory traffic on Apple Silicon and other ARM64 platforms.

Memory Management: Executable Code in a Secure World

Modern operating systems enforce W^X (Write XOR Execute) policies: memory can be writable or executable, but not both simultaneously. The JIT must navigate this constraint:

class CodeRegion { public: static auto allocate(size_t size) -> std::unique_ptr<CodeRegion> { #ifdef __linux__ void* ptr = mmap(nullptr, size, PROT_READ | PROT_WRITE, // Start writable MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); #elif defined(__APPLE__) void* ptr = mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT, -1, 0); #endif return std::make_unique<CodeRegion>(ptr, size); } void make_executable() { #ifdef __linux__ mprotect(base_, size_, PROT_READ | PROT_EXEC); #elif defined(__APPLE__) pthread_jit_write_protect_np(true); // Toggle to execute mode #endif } };

Apple Silicon requires special handling: the MAP_JIT flag and pthread_jit_write_protect_np() API for toggling between write and execute modes.

The code cache uses LRU eviction to bound memory usage (default: 64MB), and bump allocation with periodic compaction to manage fragmentation.

Optimization Passes in Tier 2

The optimizing JIT goes beyond simple stencil concatenation, building an IR and applying classical compiler optimizations.

Constant Propagation replaces operations on known constants:

Before: After: LOAD_CONST R1, 10 LOAD_CONST R3, 30 LOAD_CONST R2, 20 ADD_I64 R3, R1, R2

Common Subexpression Elimination reuses computed values:

Before: After: MUL_I64 R1, R0, R0 MUL_I64 R1, R0, R0 MUL_I64 R2, R0, R0 ; R2 eliminated ADD_I64 R3, R1, R2 ADD_I64 R3, R1, R1

Loop-Invariant Code Motion hoists computations out of loops:

Before: After: loop: LOAD R1, base_ptr LOAD R1, base_ptr ADD R2, R1, offset ADD R2, R1, offset loop: ; use R2 ; use R2 JMP loop JMP loop

Type Specialization uses profiling data to eliminate type checks:

; Before (polymorphic - handles int, float, string) CALL type_check, R1 CMP result, kInt64 JNE slow_path ; int64 fast path... ; After (monomorphic - observed only int64) ; type check eliminated ; directly execute int64 path

These optimizations yield an additional 2x speedup over the baseline JIT, bringing total improvement to 5-10x for compute-intensive workloads.

Performance Characteristics

Compilation Time

The compilation speed advantage is dramatic:

ApproachTime per KB10KB Function
LLVM -O0~50ms500ms
LLVM -O2~200ms2000ms
Baseline JIT~1ms10ms
Optimized JIT~10ms100ms

Copy-and-Patch is 50-200x faster than LLVM, making JIT compilation practical even for queries that execute in single-digit milliseconds.

Execution Speedup by Workload

Workload TypeBaseline JITOptimized JIT
Tight arithmetic loops3-5x5-10x
Expression evaluation2-3x3-4x
Aggregate operations2-3x3-5x
Field access heavy1.5-2x2-3x
External call heavy1.0-1.2x1.2-1.5x

Concrete Example: PL/pgSQL Loop

DO $$ DECLARE i INTEGER := 0; sum INTEGER := 0; BEGIN WHILE i < 1000000 LOOP sum := sum + i; i := i + 1; END LOOP; END $$;
Execution ModeCycles/IterationTotal Time
Interpreter~180180ms
Baseline JIT~4040ms
Optimized JIT~1616ms

Total speedup: 11x from interpreter to optimized JIT.

When JIT Helps (and When It Doesn't)

JIT compilation shines when the workload involves high iteration counts (loops executing thousands of times amortize compilation cost), compute-intensive operations (arithmetic, comparisons, and type conversions benefit most), register-heavy code (code using many VM registers benefits from CPU register pinning), and predictable types (monomorphic operations enable aggressive type specialization).

JIT provides minimal benefit when the workload is I/O bound (disk reads or network latency dominate), external call heavy (time spent in complex C functions or UDFs), short-running (queries completing in microseconds can't amortize compilation), or involves cold code paths (functions executed once or twice).

Conclusion

Copy-and-Patch JIT compilation represents a paradigm shift in database query execution. By recognizing that most bytecode instructions map to fixed machine code patterns, we eliminate the expensive parts of traditional JIT compilation—instruction selection, register allocation, code emission—from the critical path.

The result for Cognica is compelling: compilation speeds measured in microseconds (not milliseconds), enabling JIT for the short-running queries that dominate database workloads, while still achieving 80-90% of the code quality produced by heavyweight optimizing compilers.

For SQL queries specifically, this translates to 3-5x faster expression evaluation in WHERE clauses and SELECT lists, 3-5x faster aggregation for SUM, AVG, COUNT, MIN, MAX, 2-4x faster join processing in hash and nested loop joins, 5-10x faster stored procedures and PL/pgSQL functions, and 3-4x faster window functions with their sliding window loops.

The combination of sub-millisecond compilation and significant execution speedup makes Copy-and-Patch JIT ideal for the full spectrum of database workloads—from OLTP queries completing in microseconds to OLAP queries scanning millions of rows.


References

  1. Xu, H., & Kjolstad, F. (2021). Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA.
  2. Neumann, T. (2011). Efficiently Compiling Efficient Query Plans for Modern Hardware. VLDB.
  3. Kersten, T., et al. (2018). Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. VLDB.
  4. Lattner, C., & Adve, V. (2004). LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. CGO.

Copyright © 2024 Cognica, Inc.

Made with ☕️ and 😽 in San Francisco, CA.