JIT Toolchain: Building a Disassembler and CPU Emulator for Database Development

Posted on January 19, 2026

JIT Toolchain: Building a Disassembler and CPU Emulator for Database Development

The essential infrastructure that makes Copy-and-Patch JIT development and debugging practical

In our previous post, we explored how Copy-and-Patch JIT compilation achieves native code performance with microsecond compilation times. But generating machine code is only half the battle. How do you debug a stencil that crashes? How do you verify that patched offsets land at the right instruction boundaries? How do you test JIT code on a development machine running a different CPU architecture?

This post dives into the JIT toolchain we built for Cognica Database Engine: a multi-architecture disassembler for validation and a software CPU emulator for cross-platform testing and debugging.

The Problem: JIT Development is Hard

JIT compilation introduces debugging challenges that traditional ahead-of-time compilation avoids:

Invisible Code: JIT-compiled code doesn't exist until runtime. You can't run it through a debugger before execution.
Patch Point Validation: Copy-and-Patch JIT relies on patching specific byte offsets. A patch that lands in the middle of an instruction causes crashes or silent corruption.
Cross-Platform Development: Developers on Apple Silicon need to test x86-64 stencils. Developers on x86-64 need to verify ARM64 code.
Performance Isolation: When a query runs slowly, is it the JIT code, the interpreter, or the query plan? Isolating JIT behavior requires controlled execution.

These challenges demand specialized tooling: a disassembler that understands our stencil format and an emulator that can execute stencils in isolation.

Architecture Overview

Our JIT toolchain consists of three main components that work together:

Loading diagram...

The disassemblers decode native machine code for validation and debugging output. The translators convert native code to an architecture-neutral intermediate representation (IR). The execution engine interprets the IR, enabling cross-platform execution and fine-grained debugging.

The Disassemblers: Understanding What We Generated

Why Not Use Existing Tools?

Tools like objdump, llvm-objdump, and Capstone are excellent for general-purpose disassembly. But our stencils have specific requirements:

Patch Validation: We need to verify that patch offsets align with instruction boundaries and target the correct immediate fields.
Minimal Footprint: Adding a 50MB LLVM dependency for disassembly is excessive when we only use ~110 instruction patterns.
Integration: We want disassembly as a first-class debugging feature, not an external tool invocation.

Our disassemblers support exactly the instruction subset used in stencils---nothing more, nothing less.

x86-64 Disassembler

x86-64's variable-length encoding makes disassembly challenging. An instruction can be 1-15 bytes, with complex prefix combinations:

// x86-64 disassembler structure
class X86_64Disassembler {
public:
  auto disassemble(const Stencil& s) const -> DisassemblyResult;
  auto validate_patches(const Stencil& s) const -> std::vector<std::string>;
  static auto format(const DisassembledInst& inst) -> std::string;

private:
  // REX prefix structure
  struct Rex {
    bool present;
    bool w;  // 64-bit operand size
    bool r;  // ModRM.reg extension
    bool x;  // SIB.index extension
    bool b;  // ModRM.rm extension
  };

  auto decode_one_(const uint8_t* code, size_t len, uint32_t offset) const
      -> DisassembledInst;
  static auto parse_rex_(uint8_t byte) -> Rex;
  auto decode_modrm_mem_(const uint8_t* code, size_t len, const Rex& rex,
                         bool is_64bit) const -> std::pair<std::string, size_t>;
};

The key complexity lies in the REX prefix and ModR/M byte parsing. A REX prefix (0x40-0x4F) extends register addressing to access r8-r15. The ModR/M byte encodes both the addressing mode and register operands:

auto X86_64Disassembler::parse_rex_(uint8_t byte) -> Rex {
  Rex rex;
  rex.present = (byte >= 0x40 && byte <= 0x4f);
  if (rex.present) {
    rex.w = (byte & 0x08) != 0;  // 64-bit operand
    rex.r = (byte & 0x04) != 0;  // Extends ModRM.reg
    rex.x = (byte & 0x02) != 0;  // Extends SIB.index
    rex.b = (byte & 0x01) != 0;  // Extends ModRM.rm
  }
  return rex;
}

ARM64 Disassembler

ARM64's fixed 32-bit instruction encoding is simpler to decode but has its own subtleties. All instructions are 4 bytes, and the instruction category is determined by fixed bit positions:

class AArch64Disassembler {
public:
  auto disassemble(const Stencil& s) const -> DisassemblyResult;
  auto validate_patches(const Stencil& s) const -> std::vector<std::string>;

private:
  auto decode_one_(uint32_t instr, uint32_t offset) const -> DisassembledInst;

  // Bit field extractors
  static auto rd(uint32_t i) -> uint8_t { return i & 0x1f; }
  static auto rn(uint32_t i) -> uint8_t { return (i >> 5) & 0x1f; }
  static auto rm(uint32_t i) -> uint8_t { return (i >> 16) & 0x1f; }
  static auto rt(uint32_t i) -> uint8_t { return i & 0x1f; }
  static auto rt2(uint32_t i) -> uint8_t { return (i >> 10) & 0x1f; }

  // Instruction decoders by category
  auto decode_dp_reg_(uint32_t instr) const -> DisassembledInst;
  auto decode_dp_imm_(uint32_t instr) const -> DisassembledInst;
  auto decode_fp_(uint32_t instr) const -> DisassembledInst;
  auto decode_ldst_(uint32_t instr) const -> DisassembledInst;
  auto decode_branch_(uint32_t instr) const -> DisassembledInst;
};

ARM64 encodes register operands in fixed 5-bit fields (supporting 32 registers). The rd, rn, rm extractors pull these fields from their standard positions.

Patch Validation

The most critical function is validating that patch points are legal:

auto X86_64Disassembler::validate_patches(const Stencil& s) const
    -> std::vector<std::string> {
  std::vector<std::string> errors;
  auto result = disassemble(s);

  for (const auto& patch : s.patches) {
    bool found = false;
    for (const auto& inst : result.instructions) {
      if (patch.offset >= inst.offset
          && patch.offset < inst.offset + inst.length) {
        found = true;

        // Validate patch type matches instruction type
        if (patch.type == PatchType::kImmediate64
            || patch.type == PatchType::kAddress64) {
          if (!inst.has_immediate) {
            errors.push_back(fmt::format(
                "Patch at offset {} is in '{}' which has no immediate",
                patch.offset, inst.mnemonic));
          }
        } else if (patch.type == PatchType::kRelativeJump) {
          if (inst.mnemonic != "jmp" && inst.mnemonic != "jnz"
              && inst.mnemonic != "jz") {
            errors.push_back(fmt::format(
                "Patch at offset {} (type=RelativeJump) is in '{}' "
                "which is not a jump",
                patch.offset, inst.mnemonic));
          }
        }
        break;
      }
    }

    if (!found && patch.offset < s.code.size()) {
      errors.push_back(fmt::format(
          "Patch at offset {} does not match any instruction", patch.offset));
    }
  }

  return errors;
}

This validation catches subtle bugs: a patch offset that's off by one byte would land in the middle of a multi-byte immediate, corrupting the instruction.

The CPU Emulator: Executing Without Hardware

Why Emulate?

Native execution is fast but inflexible:

Cross-Platform Testing: Test ARM64 stencils on x86-64 development machines (or vice versa)
Deterministic Debugging: Step through execution instruction by instruction
Isolation: Execute stencils without affecting system state
Instrumentation: Count instructions, trace register changes, profile hot paths

The emulator trades execution speed for observability and portability.

Architecture-Neutral IR

The heart of the emulator is an intermediate representation that abstracts away architectural differences. Both x86-64 and ARM64 translate to the same IR opcodes:

enum class EmulatorOp : uint8_t {
  // Integer Arithmetic
  kAddInt64,   // dst = src1 + src2
  kSubInt64,   // dst = src1 - src2
  kMulInt64,   // dst = src1 * src2
  kDivInt64,   // dst = src1 / src2 (signed)
  kModInt64,   // dst = src1 % src2 (signed)
  kNegInt64,   // dst = -src1
  kMsubInt64,  // dst = src2 - src1 * imm_reg (multiply-subtract)

  // Bitwise Operations
  kAndInt64,   // dst = src1 & src2
  kOrInt64,    // dst = src1 | src2
  kXorInt64,   // dst = src1 ^ src2
  kNotInt64,   // dst = ~src1
  kMovkInt64,  // dst = (dst & ~mask) | (imm << shift) - MOVK semantics

  // Floating-Point Arithmetic
  kAddDouble,  // dst = src1 + src2
  kSubDouble,  // dst = src1 - src2
  kMulDouble,  // dst = src1 * src2
  kDivDouble,  // dst = src1 / src2
  kNegDouble,  // dst = -src1
  kXorDouble,  // dst = bitwise_xor(src1, src2) - for XORPD

  // Comparison (sets condition flags)
  kCmpInt64,   // flags = compare(src1, src2)
  kCmpDouble,  // flags = compare(src1, src2)
  kTestInt64,  // flags = test(src1 & src2)

  // Conditional Set (reads flags, writes 0 or 1)
  kSetEq, kSetNe, kSetLt, kSetLe, kSetGt, kSetGe,
  kSetAbove, kSetAboveEq, kSetBelow, kSetBelowEq,
  kSetParity, kSetNoParity,

  // Data Movement
  kMovInt64, kMovDouble, kCselInt64,
  kLoadImm64, kLoadImmDouble,
  kZeroExtend8To64, kZeroExtend16To64, kZeroExtend32To64,
  kSignExtend8To64, kSignExtend16To64, kSignExtend32To64,

  // Type Conversions
  kInt64ToDouble,  // f[dst] = (double)r[src1]
  kDoubleToInt64,  // r[dst] = (int64_t)f[src1]

  // Memory Operations
  kLoadMem8, kLoadMem16, kLoadMem32, kLoadMem64, kLoadMemDouble,
  kStoreMem8, kStoreMem16, kStoreMem32, kStoreMem64, kStoreMemDouble,

  // Stack Operations
  kPush64, kPop64, kAllocStack, kDeallocStack,

  // Control Flow
  kBranch,           // pc = target (unconditional)
  kBranchIfZero,     // if (src1 == 0) pc = target
  kBranchIfNotZero,  // if (src1 != 0) pc = target
  kBranchIfEq, kBranchIfNe, kBranchIfLt, kBranchIfLe,
  kBranchIfGt, kBranchIfGe,

  // Function Calls
  kCallHelper,  // call helper function by ID
  kReturn,      // return from stencil

  kNop,
  kBreakpoint,
};

Each IR instruction is a compact struct:

struct EmulatorInst {
  EmulatorOp op;   // Operation type
  uint8_t dst;     // Destination register (0-31)
  uint8_t src1;    // First source register
  uint8_t src2;    // Second source register

  union {
    int64_t imm_i64;     // 64-bit immediate integer
    double imm_f64;      // 64-bit immediate double
    uint32_t target;     // Branch target (instruction index)
    uint32_t helper_id;  // Helper function identifier
    int32_t mem_offset;  // Memory access offset
  };

  uint32_t source_offset;  // Byte offset in original machine code
};

The Translation Pipeline

Translators convert architecture-specific machine code into the neutral IR:

Loading diagram...

The x86-64 translator handles the variable-length encoding complexity:

class X86_64Translator {
public:
  auto translate(const Stencil& stencil) -> EmulatorProgram;
  auto translate(std::span<const uint8_t> code, JITType result_type,
                 StencilOp stencil_op) -> EmulatorProgram;

private:
  struct Rex {
    bool present = false;
    bool w = false;  // 64-bit operand size
    bool r = false;  // ModRM.reg extension
    bool x = false;  // SIB.index extension
    bool b = false;  // ModRM.rm extension
  };

  struct ModRM {
    uint8_t mod;  // Addressing mode (0-3)
    uint8_t reg;  // Register operand / opcode extension
    uint8_t rm;   // Register/Memory operand
  };

  // Get emulator register index from x86-64 register encoding
  static auto gpr_to_emu_reg_(uint8_t reg, bool rex_ext) -> uint8_t;

  // Translate specific instruction types
  void emit_add_(const DecodedInst& inst, const uint8_t* code);
  void emit_mov_(const DecodedInst& inst, const uint8_t* code);
  void emit_cmp_(const DecodedInst& inst, const uint8_t* code);
  void emit_jcc_(uint8_t cc, const DecodedInst& inst, const uint8_t* code);

  // SSE instructions
  void emit_sse_arith_(uint8_t op, const DecodedInst& inst);
  void emit_cvtsi2sd_(const DecodedInst& inst, const uint8_t* code);

  // Resolve branch targets after first pass
  void resolve_branch_targets_();

  std::unordered_map<uint32_t, uint32_t> offset_to_index_;
  std::vector<std::pair<uint32_t, uint32_t>> pending_branches_;
};

The ARM64 translator leverages the regular instruction format:

class AArch64Translator {
public:
  auto translate(const Stencil& stencil) -> EmulatorProgram;

private:
  // Bit field extractors
  static auto rd_(uint32_t instr) -> uint8_t { return instr & 0x1f; }
  static auto rn_(uint32_t instr) -> uint8_t { return (instr >> 5) & 0x1f; }
  static auto rm_(uint32_t instr) -> uint8_t { return (instr >> 16) & 0x1f; }
  static auto sf_(uint32_t instr) -> bool { return (instr >> 31) & 1; }

  // Immediate extractors with sign extension
  static auto imm19_(uint32_t instr) -> int32_t {
    auto imm = static_cast<int32_t>((instr >> 5) & 0x7ffff);
    if (imm & 0x40000) {
      imm |= static_cast<int32_t>(0xfff80000);  // Sign extend
    }
    return imm * 4;  // Scale by 4 for branch offset
  }

  // Instruction category translators
  void translate_dp_reg_(uint32_t instr);   // Data processing (register)
  void translate_dp_imm_(uint32_t instr);   // Data processing (immediate)
  void translate_fp_(uint32_t instr);       // Floating-point
  void translate_ldst_(uint32_t instr);     // Load/store
  void translate_branch_(uint32_t instr);   // Branches
};

Emulator State: Registers, Flags, and Memory

The emulator maintains a complete CPU state:

// Register file holding both integer and floating-point registers
struct RegisterFile {
  std::array<int64_t, 32> r;  // Integer registers
  std::array<double, 32> f;   // Floating-point registers
  uint64_t sp;                // Stack pointer
  uint64_t pc;                // Program counter (instruction index)

  // ARM64 XZR (zero register) handling
  auto get_int(uint8_t reg, bool is_arm64 = false) const -> int64_t {
    if (is_arm64 && reg == 31) return 0;  // XZR always reads as 0
    return r[reg];
  }

  void set_int(uint8_t reg, int64_t value, bool is_arm64 = false) {
    if (is_arm64 && reg == 31) return;  // Writes to XZR are discarded
    r[reg] = value;
  }
};

// Condition flags (NZCV - compatible with both x86-64 and AArch64)
struct ConditionFlags {
  bool n;  // Negative: result is negative
  bool z;  // Zero: result is zero
  bool c;  // Carry: unsigned overflow/borrow
  bool v;  // oVerflow: signed overflow

  // Update flags after integer subtraction (CMP)
  void update_from_sub(int64_t src1, int64_t src2) {
    auto result = src1 - src2;
    n = (result < 0);
    z = (result == 0);
    // Carry for subtraction: set if no borrow (src1 >= src2 unsigned)
    c = (static_cast<uint64_t>(src1) >= static_cast<uint64_t>(src2));
    // Overflow: signs differ and result sign differs from src1
    v = (((src1 ^ src2) & (src1 ^ result)) < 0);
  }

  // IEEE 754 floating-point comparison flags
  void update_from_fcmp(double src1, double src2) {
    if (std::isnan(src1) || std::isnan(src2)) {
      n = false; z = false; c = true; v = true;  // Unordered
    } else if (src1 == src2) {
      n = false; z = true; c = true; v = false;
    } else if (src1 < src2) {
      n = true; z = false; c = false; v = false;
    } else {
      n = false; z = false; c = true; v = false;
    }
  }

  // Condition evaluation
  auto is_equal() const -> bool { return z; }
  auto is_less_than() const -> bool { return n != v; }
  auto is_greater_than() const -> bool { return !z && (n == v); }
  auto is_above() const -> bool { return c && !z; }  // Unsigned
};

Memory is emulated with a stack and optional literal pool:

class EmulatorMemory {
public:
  explicit EmulatorMemory(size_t stack_size = 64 * 1024);

  // Stack operations
  void push(int64_t value, uint64_t& sp);
  auto pop(uint64_t& sp) -> int64_t;
  void alloc(size_t bytes, uint64_t& sp);
  void dealloc(size_t bytes, uint64_t& sp);

  // Memory access
  auto load64(uint64_t addr) const -> int64_t;
  void store64(uint64_t addr, int64_t value);
  auto load_double(uint64_t addr) const -> double;
  void store_double(uint64_t addr, double value);

private:
  std::vector<uint8_t> stack_;
  std::vector<uint8_t> literal_pool_;
  uint64_t stack_base_;
};

The Execution Engine

The execution engine is a straightforward interpreter:

class ExecutionEngine {
public:
  auto execute(const EmulatorProgram& program, EmulatorState& state)
      -> ExecutionResult;
  void execute_instruction(const EmulatorInst& inst, EmulatorState& state);

private:
  // Integer arithmetic
  void exec_add_int64_(const EmulatorInst& inst, EmulatorState& state) {
    auto a = state.regs().get_int(inst.src1, state.is_arm64());
    auto b = state.regs().get_int(inst.src2, state.is_arm64());
    state.regs().set_int(inst.dst, a + b, state.is_arm64());
  }

  void exec_div_int64_(const EmulatorInst& inst, EmulatorState& state) {
    auto dividend = state.regs().get_int(inst.src1, state.is_arm64());
    auto divisor = state.regs().get_int(inst.src2, state.is_arm64());

    if (divisor == 0) {
      throw error::division_by_zero(state.regs().pc, create_snapshot_(state));
    }

    // Handle INT64_MIN / -1 overflow case
    if (dividend == std::numeric_limits<int64_t>::min() && divisor == -1) {
      state.regs().set_int(inst.dst, std::numeric_limits<int64_t>::min(),
                           state.is_arm64());
    } else {
      state.regs().set_int(inst.dst, dividend / divisor, state.is_arm64());
    }
  }

  // Comparison setting flags
  void exec_cmp_int64_(const EmulatorInst& inst, EmulatorState& state) {
    auto a = state.regs().get_int(inst.src1, state.is_arm64());
    auto b = state.regs().get_int(inst.src2, state.is_arm64());
    state.flags().update_from_sub(a, b);
  }

  // Conditional branching
  void exec_branch_if_lt_(const EmulatorInst& inst, EmulatorState& state) {
    if (state.flags().is_less_than()) {
      state.regs().pc = inst.target;
      ++stats_.branches_taken;
    } else {
      ++state.regs().pc;
      ++stats_.branches_not_taken;
    }
  }
};

The main execution loop:

auto ExecutionEngine::execute(const EmulatorProgram& program,
                              EmulatorState& state) -> ExecutionResult {
  state.regs().pc = 0;
  state.set_returned(false);

  while (state.regs().pc < program.instructions.size()
         && !state.has_returned()) {
    // Check instruction limit
    if (config_.max_instructions > 0
        && stats_.instructions_executed >= config_.max_instructions) {
      throw error::max_instructions_reached(config_.max_instructions,
                                            state.regs().pc);
    }

    const auto& inst = program.instructions[state.regs().pc];
    execute_instruction(inst, state);
    ++stats_.instructions_executed;

    // Advance PC (unless branch already modified it)
    if (!is_control_flow(inst.op)) {
      ++state.regs().pc;
    }
  }

  return ExecutionResult::kSuccess;
}

The Debugger: Step Through JIT Code

The debugger provides fine-grained control over emulator execution:

class Debugger {
public:
  // Breakpoint Management
  auto add_breakpoint(uint32_t instruction_index) -> uint32_t;
  void remove_breakpoint(uint32_t instruction_index);
  auto has_breakpoint(uint32_t instruction_index) const -> bool;

  // Execution Control
  void attach(ExecutionEngine* engine, const EmulatorProgram* program,
              EmulatorState* state);
  auto run() -> DebugState;      // Run until breakpoint
  auto step() -> DebugState;     // Single step
  auto continue_execution() -> DebugState;

  // State Inspection
  auto get_int_register(uint8_t reg) const -> std::optional<int64_t>;
  auto get_fp_register(uint8_t reg) const -> std::optional<double>;
  auto get_flags() const -> std::optional<ConditionFlags>;
  auto get_stack_pointer() const -> std::optional<uint64_t>;

  // Execution History
  auto history() const -> const std::deque<ExecutionRecord>&;
  void set_history_enabled(bool enabled);

private:
  ExecutionEngine* engine_;
  const EmulatorProgram* program_;
  EmulatorState* emulator_state_;
  std::vector<Breakpoint> breakpoints_;
  std::deque<ExecutionRecord> history_;
};

Execution history records register changes for each instruction:

struct ExecutionRecord {
  uint32_t instruction_index;
  EmulatorOp op;
  uint32_t source_offset;
  bool is_fp;

  // Integer register values
  int64_t dst_before, src1_before, src2_before, dst_after;

  // FP register values
  double fp_dst_before, fp_src1_before, fp_src2_before, fp_dst_after;
};

This enables powerful debugging workflows: set a breakpoint, run to it, then step through while watching register values change.

The Helper Dispatcher: Bridging to Document API

JIT stencils need to access document fields, but we can't execute real pointer dereferences in emulation. The HelperDispatcher provides document API functions:

enum class HelperId : uint32_t {
  // Field access
  kGetInt64Field,
  kGetDoubleField,
  kGetBoolField,
  kGetStringField,
  kIsFieldNull,

  // Cached field access
  kGetInt64FieldCached,
  kGetDoubleFieldCached,

  // String operations
  kStringEq, kStringNe, kStringLt, kStringContains,
  kStringStartsWith, kStringEndsWith,

  // Array operations
  kArrayGetInt64, kArrayContainsInt64, kGetArrayLength,
};

class HelperDispatcher {
public:
  void dispatch(HelperId id, EmulatorState& state);

private:
  void dispatch_get_int64_field_(EmulatorState& state) {
    // Extract document pointer and field name from calling convention
    auto doc_ptr = reinterpret_cast<const Document*>(
        state.is_arm64()
            ? state.regs().r[arm64_reg::kArg0]
            : state.regs().r[x86_reg::kArg0]);
    auto field_name_ptr = reinterpret_cast<const char*>(
        state.is_arm64()
            ? state.regs().r[arm64_reg::kArg1]
            : state.regs().r[x86_reg::kArg1]);

    // Actually call the document API
    auto result = doc_ptr->get_int64(field_name_ptr);

    // Store result in return register
    if (state.is_arm64()) {
      state.regs().r[arm64_reg::kReturnInt] = result;
    } else {
      state.regs().r[x86_reg::kReturnInt] = result;
    }
  }
};

This allows stencils that call helper functions to execute correctly in emulation while still accessing real document data.

The High-Level Interface

The StencilEmulator ties everything together:

class StencilEmulator {
public:
  explicit StencilEmulator(const EmulatorConfig& config = {});

  // Execute a stencil with a document
  auto execute(const Stencil& stencil, const Document& doc) -> TypedValue;

  // Execute a stencil (pure arithmetic)
  auto execute(const Stencil& stencil) -> TypedValue;

  // Debugging interface
  auto debugger() -> Debugger& { return debugger_; }
  auto prepare(const Stencil& stencil) -> const EmulatorProgram&;
  auto step() -> bool;
  auto run_to_completion() -> TypedValue;

  // Architecture detection
  static auto detect_architecture(const Stencil& stencil) -> ArchType;
  static auto is_x86_64(const Stencil& stencil) -> bool;
  static auto is_aarch64(const Stencil& stencil) -> bool;

  // Translation cache
  void clear_cache();
  auto cache_hits() const -> uint64_t;
  auto cache_misses() const -> uint64_t;

private:
  auto translate_(const Stencil& stencil) -> const EmulatorProgram&;

  EmulatorConfig config_;
  ExecutionEngine engine_;
  EmulatorState state_;
  Debugger debugger_;
  HelperDispatcher dispatcher_;
  std::unordered_map<uint64_t, EmulatorProgram> cache_;
};

Architecture detection looks at the first few bytes to identify the instruction set:

auto StencilEmulator::detect_architecture(const Stencil& stencil) -> ArchType {
  if (stencil.code.size() < 4) return ArchType::kUnknown;

  // ARM64: fixed 32-bit instructions, often start with specific patterns
  // x86-64: variable length, common patterns include REX prefixes (0x40-0x4F)

  // Heuristic: ARM64 instructions have distinctive bit patterns
  uint32_t first_word = stencil.code[0] | (stencil.code[1] << 8)
                      | (stencil.code[2] << 16) | (stencil.code[3] << 24);

  // Check for common ARM64 patterns
  if ((first_word & 0x9F000000) == 0x91000000  // ADD immediate
      || (first_word & 0xFF000000) == 0xD6000000  // BR/BLR/RET
      || (first_word & 0x7F800000) == 0x2A000000) { // ORR shifted register
    return ArchType::kAArch64;
  }

  return ArchType::kX86_64;
}

Usage Example

Here's how you might use the emulator for debugging:

auto stencil = get_add_int64_stencil();
auto emulator = StencilEmulator {};

// Enable debugging
auto& dbg = emulator.debugger();
dbg.set_history_enabled(true);

// Prepare the stencil (translate to IR)
auto& program = emulator.prepare(stencil);

// Set a breakpoint at instruction 3
dbg.add_breakpoint(3);

// Attach debugger
dbg.attach(&emulator.engine(), &program, &emulator.state());

// Run until breakpoint
auto result = dbg.run();
if (result == DebugState::kBreakpoint) {
  // Inspect state
  std::cout << "Stopped at instruction " << dbg.current_instruction() << "\n";
  std::cout << "Registers:\n" << dbg.format_registers() << "\n";

  // Step through remaining instructions
  while (dbg.step() != DebugState::kFinished) {
    std::cout << dbg.format_current_instruction() << "\n";
  }
}

// Print execution history
std::cout << dbg.format_history(20) << "\n";

Performance Characteristics

The emulator prioritizes observability over speed:

Metric	Native Execution	Emulator
Speed	~1 ns/instruction	~50-100 ns/instruction
Debugging	External tools only	Built-in stepping, breakpoints
Platform	Native architecture only	Cross-platform
Instrumentation	Requires sampling	Exact instruction counts

The 50-100x slowdown is acceptable for testing and debugging. For benchmarking JIT code quality, we use native execution.

Conclusion

Building a JIT compiler requires more than just code generation. The disassembler validates that generated code is correct---that patch points land at instruction boundaries, that immediates are in the right positions. The emulator enables cross-platform development, deterministic debugging, and fine-grained performance analysis.

These tools transformed our JIT development workflow:

Stencil Authors can validate patches without running the code
Debuggers can step through compiled expressions instruction by instruction
CI Systems can run the full test suite on any architecture
Performance Engineers can get exact instruction counts without sampling overhead

The investment in tooling pays for itself many times over. JIT compilation is inherently error-prone---invisible code, platform-specific behavior, subtle encoding bugs. Good tools make the invisible visible.

References

Intel 64 and IA-32 Architectures Software Developer's Manual
ARM Architecture Reference Manual for A-profile architecture
Xu, H., & Kjolstad, F. (2021). Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA.
QEMU: A Fast and Portable Dynamic Translator. Bellard, F. (2005). USENIX Annual Technical Conference.

Tags:

Made with ☕️ and 😽 in San Francisco, CA.

Terms Privacy