Copy-and-Patch JIT: Achieving Native Code Performance with Microsecond Compilation

Posted on January 17, 2026

Copy-and-Patch JIT: Achieving Native Code Performance with Microsecond Compilation

How Cognica Database Engine breaks the JIT compilation latency barrier

When you execute a SQL query, the database engine faces a fundamental tension: interpretation is slow but starts immediately, while compilation produces fast code but takes time. For queries that complete in milliseconds, spending hundreds of milliseconds on LLVM optimization is like warming up your car longer than your commute.

This post explores Copy-and-Patch JIT compilation, a technique we implemented in Cognica Database Engine that achieves 2-10x speedup over interpretation while keeping compilation time under one millisecond per kilobyte of bytecode. The approach, pioneered by Haoran Xu and Fredrik Kjolstad at Stanford, fundamentally reimagines what JIT compilation can be.

The JIT Compilation Dilemma

Traditional JIT compilers like those based on LLVM perform sophisticated optimizations: instruction selection, register allocation, dead code elimination, loop unrolling, and more. These transformations produce excellent code quality—often within a few percent of ahead-of-time compiled C++—but they come at a cost.

Consider the compilation time spectrum:

Approach	Compilation Time	Code Quality	Sweet Spot
Interpretation	0	Baseline	All queries
Copy-and-Patch	~1ms/KB	80-90% optimal	Hot loops
Method JIT	~10ms/KB	90-95% optimal	Warm methods
LLVM JIT	~100ms+/KB	Optimal	Long-running

For a database query that scans a table and computes an aggregate in 5 milliseconds, spending 200 milliseconds on LLVM compilation is absurd. The query would need to run 40 times just to break even on compilation overhead. Yet without JIT compilation, compute-intensive operations like expression evaluation and aggregation become bottlenecks.

Copy-and-Patch breaks this tradeoff by achieving compilation speeds measured in microseconds while delivering code quality within 10-20% of heavily optimized compilers.

The Core Insight: Most Instructions Are Templates

The key observation behind Copy-and-Patch is deceptively simple: most bytecode instructions map to a small, predictable set of machine code patterns.

Consider an integer addition instruction in Cognica's virtual machine:

ADD_I64 R3, R1, R2  ; R3 = R1 + R2

The corresponding x86-64 machine code always follows this pattern:

mov  rax, [rdi + R1_OFFSET]   ; Load R1 from VMContext
add  rax, [rdi + R2_OFFSET]   ; Add R2
mov  [rdi + R3_OFFSET], rax   ; Store to R3

The only values that change between different instances of this instruction are the register offsets. The opcode selection, addressing mode, and instruction encoding are always identical.

Traditional JIT compilers rediscover this pattern every time they compile an ADD instruction. They run instruction selection algorithms, consider register allocation constraints, and emit the same bytes they emitted last time. Copy-and-Patch eliminates this redundancy by pre-compiling these patterns into stencils—machine code templates with designated patch points.

Stencils: Pre-Compiled Code Templates

A stencil is simply pre-compiled machine code with placeholder values that get filled in at JIT time:

struct Stencil {
  const uint8_t* code;              // Pre-compiled machine code bytes
  size_t code_size;                 // Size in bytes
  std::vector<PatchSite> patches;   // Locations requiring runtime patching
  size_t alignment;                 // Required alignment (8 or 16 bytes)
  const char* name;                 // Debug name, e.g., "add_i64"
  Opcode opcode;                    // Corresponding VM opcode
};

Each patch site describes exactly what needs to be modified:

struct PatchSite {
  uint32_t offset;       // Byte offset within the stencil
  PatchType type;        // What kind of value to patch in
  uint8_t size;          // Patch size: 1, 2, 4, or 8 bytes
  int8_t operand_index;  // Which instruction operand (-1 for special)
};

The patch types cover all the runtime values a stencil might need:

enum class PatchType : uint8_t {
  kRegisterOffset,       // Offset into VMContext::registers_
  kImmediate32,          // 32-bit immediate constant
  kImmediate64,          // 64-bit immediate constant
  kRelativeJump,         // PC-relative jump offset
  kAbsoluteAddress,      // Absolute address for calls
  kConstantPoolPtr,      // Pointer into constant pool
  kCallbackPtr,          // External function pointer
};

Cognica maintains approximately 110 stencils per target architecture (x86-64 and ARM64), covering arithmetic operations, comparisons, control flow, memory access, type operations, and special cases like function prologues and epilogues.

The Compilation Algorithm

With stencils prepared, JIT compilation becomes remarkably simple:

function compile(bytecode_module):
    output = new CodeBuffer()

    emit_prologue(output)

    for each instruction in bytecode_module:
        stencil = lookup_stencil(instruction.opcode)
        current_offset = output.size()

        // Step 1: Copy the stencil's machine code
        output.append(stencil.code, stencil.code_size)

        // Step 2: Record patches for later resolution
        for each patch in stencil.patches:
            pending_patches.add(current_offset + patch.offset,
                                patch, instruction)

    emit_epilogue(output)

    // Step 3: Apply all patches
    for each (offset, patch, instruction) in pending_patches:
        value = compute_patch_value(patch, instruction)
        output.patch(offset, value, patch.size)

    return finalize(output)

The critical insight is what's not happening here. There's no instruction selection—the stencil already contains the right instructions. There's no register allocation—the stencil already encodes register usage. There's no code emission—we're just copying bytes. The entire compilation reduces to memory copies and simple arithmetic to compute patch values.

Patch Value Computation

Computing patch values is straightforward arithmetic. For register offsets:

auto compute_register_offset(uint8_t reg) -> uint32_t {
  // VMContext stores registers at a fixed offset
  // Each VMValue is 24 bytes (value + type tag + flags)
  constexpr uint32_t kRegistersOffset = offsetof(VMContext, registers_);
  constexpr uint32_t kVMValueSize = 24;

  return kRegistersOffset + reg * kVMValueSize;
}

For relative jumps on x86-64:

auto compute_relative_jump(uint32_t from_offset,
                           uint32_t to_offset) -> int32_t {
  // x86-64 relative jumps are computed from the end of the instruction
  // JMP rel32 is 5 bytes long
  return static_cast<int32_t>(to_offset - (from_offset + 5));
}

These calculations execute in nanoseconds, contributing negligibly to total compilation time.

Tiered Compilation: Balancing Warmup and Peak Performance

Not all code deserves JIT compilation. A function called once should run in the interpreter; a tight loop executing millions of times deserves aggressive optimization. Cognica implements a three-tier system where code progressively advances through compilation stages based on observed execution frequency.

Tier	Name	Description	Compilation Time	Speedup
0	Interpreter	All code starts here. Collects execution profiles and uses computed-goto dispatch for baseline performance.	0	Baseline
1	Baseline JIT (Copy-and-Patch)	Direct bytecode-to-stencil translation without optimization passes.	~1ms/KB	2-5x over interpreter
2	Optimizing JIT	Builds IR and applies optimization passes including constant propagation, CSE, LICM, and type specialization.	~10ms/KB	Additional 2x over baseline

Code transitions from Tier 0 to Tier 1 after reaching 100 invocations or 1000 loop iterations. The transition from Tier 1 to Tier 2 occurs after 1000 invocations or 10000 loop iterations. These thresholds ensure that compilation effort is invested only in code that will execute frequently enough to recoup the compilation cost.

The tier transition thresholds are derived from a cost-benefit analysis. Let $C_{\text{compile}}$ be the one-time compilation cost, $T_{\text{interp}}$ the interpretation time per invocation, $T_{\text{jit}}$ the JIT execution time per invocation, and $N$ the remaining expected invocations. Compilation is beneficial when:

(T_{\text{interp}} - T_{\text{jit}}) \cdot N > C_{\text{compile}}

Solving for $N$ :

N > \frac{C_{\text{compile}}}{T_{\text{interp}} - T_{\text{jit}}}

With typical values ( $C_{\text{compile}} = 1\text{ms}$ , $T_{\text{interp}} = 12\mu\text{s}$ , $T_{\text{jit}} = 4\mu\text{s}$ ), the break-even point is approximately 125 invocations. The default threshold of 100 is slightly aggressive, based on the observation that functions called 100 times tend to be called many more.

How SQL Queries Get Accelerated

Now let's examine how Copy-and-Patch JIT specifically accelerates SQL query execution. The performance gains come from several complementary mechanisms.

Expression Evaluation

Consider a query with a complex WHERE clause:

SELECT * FROM orders
WHERE total_amount > 1000
  AND EXTRACT(YEAR FROM order_date) = 2026
  AND customer_id IN (SELECT id FROM premium_customers);

The expression total_amount > 1000 AND EXTRACT(YEAR FROM order_date) = 2026 gets compiled to bytecode like:

LOAD_FIELD   R1, row, "total_amount"
LOAD_CONST   R2, 1000
CMP_GT_I64   R3, R1, R2
JZ           skip_second_condition, R3

LOAD_FIELD   R4, row, "order_date"
CALL         R5, extract_year, R4
LOAD_CONST   R6, 2026
CMP_EQ_I64   R7, R5, R6
AND          R3, R3, R7

skip_second_condition:
; R3 contains the final boolean result

In the interpreter, each instruction requires fetching the opcode from memory, dispatching to the handler (indirect jump or switch), decoding operands, executing the operation, and advancing the instruction pointer. This overhead dominates for simple operations. A comparison that executes in 1-2 CPU cycles gets wrapped in 15-20 cycles of dispatch overhead.

The JIT-compiled version eliminates all dispatch overhead. The stencils chain together into a continuous sequence of machine instructions:

; LOAD_FIELD R1, row, "total_amount"
mov    rax, [rsi + 0x40]              ; Load field at offset 0x40
mov    [rdi + 0x20], rax              ; Store to R1

; LOAD_CONST R2, 1000
mov    QWORD PTR [rdi + 0x38], 1000   ; Store constant to R2

; CMP_GT_I64 R3, R1, R2
mov    rax, [rdi + 0x20]              ; Load R1
cmp    rax, [rdi + 0x38]              ; Compare with R2
setg   al                             ; Set AL if greater
movzx  eax, al
mov    [rdi + 0x50], rax              ; Store to R3

; JZ skip_second_condition, R3
mov    rax, [rdi + 0x50]
test   rax, rax
jz     skip_second_condition

; ... remaining instructions ...

For a table scan processing millions of rows, this 3-5x reduction in per-row overhead translates directly to wall-clock time savings.

Aggregate Functions

Aggregates like SUM, AVG, COUNT, MIN, and MAX execute tight loops that benefit enormously from JIT compilation:

SELECT
    region,
    SUM(amount) as total,
    AVG(amount) as average,
    COUNT(*) as count
FROM sales
GROUP BY region;

The aggregation loop compiles to bytecode that accumulates values:

aggregate_loop:
    LOAD_FIELD    R1, row, "amount"
    ADD_F64       R_sum, R_sum, R1       ; Accumulate sum
    ADD_I64       R_count, R_count, 1    ; Increment count
    NEXT_ROW      row
    JNZ           aggregate_loop, row

    DIV_F64       R_avg, R_sum, R_count  ; Compute average

The interpreter executes this loop with ~180 cycles per iteration. The baseline JIT reduces this to ~40 cycles. The optimizing JIT, with loop-invariant code motion and register pinning, achieves ~16 cycles per iteration—an 11x improvement.

Join Processing

Hash joins and nested loop joins contain performance-critical inner loops:

SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA';

The hash probe loop in a hash join:

probe_loop:
    LOAD_FIELD    R1, probe_row, "customer_id"
    HASH          R2, R1
    AND           R2, R2, hash_mask         ; R2 = bucket index
    LOAD          R3, hash_table, R2        ; R3 = bucket head

bucket_scan:
    JZ            no_match, R3
    LOAD_FIELD    R4, R3, "id"
    CMP_EQ        R5, R1, R4
    JZ            next_entry, R5

    ; Match found - emit joined row
    CALL          emit_joined_row, probe_row, R3

next_entry:
    LOAD_FIELD    R3, R3, "next"           ; Follow chain
    JMP           bucket_scan

no_match:
    NEXT_ROW      probe_row
    JNZ           probe_loop, probe_row

JIT compilation accelerates both the hashing and the bucket chain traversal. For hash joins with good hash distribution (short chains), the speedup is 2-3x. For joins with hash collisions requiring longer chain traversals, the speedup can reach 4-5x due to better branch prediction in compiled code.

Stored Procedures and PL/pgSQL

The most dramatic improvements appear in procedural code. Consider a PL/pgSQL function that processes data:

CREATE FUNCTION calculate_bonuses() RETURNS void AS $$
DECLARE
    emp RECORD;
    bonus NUMERIC;
BEGIN
    FOR emp IN SELECT * FROM employees WHERE department = 'Sales' LOOP
        IF emp.sales_total > 100000 THEN
            bonus := emp.sales_total * 0.05;
        ELSIF emp.sales_total > 50000 THEN
            bonus := emp.sales_total * 0.03;
        ELSE
            bonus := emp.sales_total * 0.01;
        END IF;

        UPDATE employees SET bonus_amount = bonus WHERE id = emp.id;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

This function contains a cursor loop (many iterations), conditional branches (amenable to branch prediction), arithmetic operations (pure computation), and nested SQL statements (separate execution). The loop body—everything except the UPDATE statement—benefits significantly from JIT compilation. With 10,000 employees in Sales, the interpreted version might take 1.8 seconds while the JIT-compiled version completes in 160 milliseconds.

Window Functions

Window functions process rows while maintaining state across the window frame:

SELECT
    date,
    amount,
    SUM(amount) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as week_total,
    AVG(amount) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as month_avg
FROM daily_sales;

The sliding window aggregation:

window_loop:
    ; Remove old values from window
    LOAD          R1, window_start
    CMP_LT        R2, R1, current_row - window_size
    JZ            skip_remove, R2
    SUB_F64       R_sum, R_sum, [R1 + amount_offset]
    SUB_I64       R_count, R_count, 1
    ADD           R1, R1, row_size
    STORE         window_start, R1

skip_remove:
    ; Add new value to window
    LOAD_FIELD    R3, current_row, "amount"
    ADD_F64       R_sum, R_sum, R3
    ADD_I64       R_count, R_count, 1

    ; Compute window result
    DIV_F64       R_avg, R_sum, R_count
    EMIT          current_row, R_avg

    NEXT_ROW      current_row
    JNZ           window_loop, current_row

The tight loop with pointer arithmetic and accumulator updates is exactly the pattern where Copy-and-Patch shines. Window function queries commonly see 3-4x speedup.

Register Mapping: Bridging VM and Hardware

One source of overhead in bytecode execution is the impedance mismatch between VM registers (in memory) and CPU registers (in silicon). Cognica's JIT bridges this gap through register pinning.

On x86-64:

struct X64RegisterMap {
  // Reserved (cannot hold values)
  static constexpr auto kContext = RDI;     // VMContext* always here
  static constexpr auto kStackPtr = RSP;    // Hardware stack
  static constexpr auto kFramePtr = RBP;    // Frame pointer

  // Pinned VM registers (callee-saved, persist across calls)
  static constexpr auto kR0 = RBX;          // VM R0 lives in RBX
  static constexpr auto kR1 = R12;          // VM R1 lives in R12
  static constexpr auto kR2 = R13;          // VM R2 lives in R13
  static constexpr auto kR3 = R14;          // VM R3 lives in R14

  // Scratch registers (caller-saved, temporaries)
  static constexpr auto kScratch[] = {RAX, RCX, RDX, RSI, R8, R9, R10, R11};
};

VM registers R0-R3 are permanently mapped to callee-saved CPU registers. When the JIT compiles code that uses these registers, it generates instructions that operate directly on RBX, R12, R13, R14—no memory loads or stores required.

ARM64's more generous callee-saved register set allows pinning 8 VM registers instead of 4, further reducing memory traffic on Apple Silicon and other ARM64 platforms.

Memory Management: Executable Code in a Secure World

Modern operating systems enforce W^X (Write XOR Execute) policies: memory can be writable or executable, but not both simultaneously. The JIT must navigate this constraint:

class CodeRegion {
public:
  static auto allocate(size_t size) -> std::unique_ptr<CodeRegion> {
#ifdef __linux__
    void* ptr = mmap(nullptr, size,
                     PROT_READ | PROT_WRITE,  // Start writable
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
#elif defined(__APPLE__)
    void* ptr = mmap(nullptr, size,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT, -1, 0);
#endif
    return std::make_unique<CodeRegion>(ptr, size);
  }

  void make_executable() {
#ifdef __linux__
    mprotect(base_, size_, PROT_READ | PROT_EXEC);
#elif defined(__APPLE__)
    pthread_jit_write_protect_np(true);  // Toggle to execute mode
#endif
  }
};

Apple Silicon requires special handling: the MAP_JIT flag and pthread_jit_write_protect_np() API for toggling between write and execute modes.

The code cache uses LRU eviction to bound memory usage (default: 64MB), and bump allocation with periodic compaction to manage fragmentation.

Optimization Passes in Tier 2

The optimizing JIT goes beyond simple stencil concatenation, building an IR and applying classical compiler optimizations.

Constant Propagation replaces operations on known constants:

Before:                          After:
LOAD_CONST R1, 10                LOAD_CONST R3, 30
LOAD_CONST R2, 20
ADD_I64    R3, R1, R2

Common Subexpression Elimination reuses computed values:

Before:                          After:
MUL_I64 R1, R0, R0               MUL_I64 R1, R0, R0
MUL_I64 R2, R0, R0               ; R2 eliminated
ADD_I64 R3, R1, R2               ADD_I64 R3, R1, R1

Loop-Invariant Code Motion hoists computations out of loops:

Before:                          After:
loop:                            LOAD   R1, base_ptr
    LOAD   R1, base_ptr          ADD    R2, R1, offset
    ADD    R2, R1, offset        loop:
    ; use R2                         ; use R2
    JMP    loop                      JMP    loop

Type Specialization uses profiling data to eliminate type checks:

; Before (polymorphic - handles int, float, string)
CALL   type_check, R1
CMP    result, kInt64
JNE    slow_path
; int64 fast path...

; After (monomorphic - observed only int64)
; type check eliminated
; directly execute int64 path

These optimizations yield an additional 2x speedup over the baseline JIT, bringing total improvement to 5-10x for compute-intensive workloads.

Performance Characteristics

Compilation Time

The compilation speed advantage is dramatic:

Approach	Time per KB	10KB Function
LLVM -O0	~50ms	500ms
LLVM -O2	~200ms	2000ms
Baseline JIT	~1ms	10ms
Optimized JIT	~10ms	100ms

Copy-and-Patch is 50-200x faster than LLVM, making JIT compilation practical even for queries that execute in single-digit milliseconds.

Execution Speedup by Workload

Workload Type	Baseline JIT	Optimized JIT
Tight arithmetic loops	3-5x	5-10x
Expression evaluation	2-3x	3-4x
Aggregate operations	2-3x	3-5x
Field access heavy	1.5-2x	2-3x
External call heavy	1.0-1.2x	1.2-1.5x

Concrete Example: PL/pgSQL Loop

DO $$
DECLARE
  i INTEGER := 0;
  sum INTEGER := 0;
BEGIN
  WHILE i < 1000000 LOOP
    sum := sum + i;
    i := i + 1;
  END LOOP;
END $$;

Execution Mode	Cycles/Iteration	Total Time
Interpreter	~180	180ms
Baseline JIT	~40	40ms
Optimized JIT	~16	16ms

Total speedup: 11x from interpreter to optimized JIT.

When JIT Helps (and When It Doesn't)

JIT compilation shines when the workload involves high iteration counts (loops executing thousands of times amortize compilation cost), compute-intensive operations (arithmetic, comparisons, and type conversions benefit most), register-heavy code (code using many VM registers benefits from CPU register pinning), and predictable types (monomorphic operations enable aggressive type specialization).

JIT provides minimal benefit when the workload is I/O bound (disk reads or network latency dominate), external call heavy (time spent in complex C functions or UDFs), short-running (queries completing in microseconds can't amortize compilation), or involves cold code paths (functions executed once or twice).

Conclusion

Copy-and-Patch JIT compilation represents a paradigm shift in database query execution. By recognizing that most bytecode instructions map to fixed machine code patterns, we eliminate the expensive parts of traditional JIT compilation—instruction selection, register allocation, code emission—from the critical path.

The result for Cognica is compelling: compilation speeds measured in microseconds (not milliseconds), enabling JIT for the short-running queries that dominate database workloads, while still achieving 80-90% of the code quality produced by heavyweight optimizing compilers.

For SQL queries specifically, this translates to 3-5x faster expression evaluation in WHERE clauses and SELECT lists, 3-5x faster aggregation for SUM, AVG, COUNT, MIN, MAX, 2-4x faster join processing in hash and nested loop joins, 5-10x faster stored procedures and PL/pgSQL functions, and 3-4x faster window functions with their sliding window loops.

The combination of sub-millisecond compilation and significant execution speedup makes Copy-and-Patch JIT ideal for the full spectrum of database workloads—from OLTP queries completing in microseconds to OLAP queries scanning millions of rows.

References

Xu, H., & Kjolstad, F. (2021). Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA.
Neumann, T. (2011). Efficiently Compiling Efficient Query Plans for Modern Hardware. VLDB.
Kersten, T., et al. (2018). Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. VLDB.
Lattner, C., & Adve, V. (2004). LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. CGO.

Tags:

#JIT Compilation

#Database Engine

#Performance Optimization

#Cognica

#Copy-and-Patch

#Virtual Machine

Made with ☕️ and 😽 in San Francisco, CA.

Terms Privacy