JIT Toolchain: Building a Disassembler and CPU Emulator for Database Development

JIT Toolchain: Building a Disassembler and CPU Emulator for Database Development
The essential infrastructure that makes Copy-and-Patch JIT development and debugging practical
In our previous post, we explored how Copy-and-Patch JIT compilation achieves native code performance with microsecond compilation times. But generating machine code is only half the battle. How do you debug a stencil that crashes? How do you verify that patched offsets land at the right instruction boundaries? How do you test JIT code on a development machine running a different CPU architecture?
This post dives into the JIT toolchain we built for Cognica Database Engine: a multi-architecture disassembler for validation and a software CPU emulator for cross-platform testing and debugging.
The Problem: JIT Development is Hard
JIT compilation introduces debugging challenges that traditional ahead-of-time compilation avoids:
-
Invisible Code: JIT-compiled code doesn't exist until runtime. You can't run it through a debugger before execution.
-
Patch Point Validation: Copy-and-Patch JIT relies on patching specific byte offsets. A patch that lands in the middle of an instruction causes crashes or silent corruption.
-
Cross-Platform Development: Developers on Apple Silicon need to test x86-64 stencils. Developers on x86-64 need to verify ARM64 code.
-
Performance Isolation: When a query runs slowly, is it the JIT code, the interpreter, or the query plan? Isolating JIT behavior requires controlled execution.
These challenges demand specialized tooling: a disassembler that understands our stencil format and an emulator that can execute stencils in isolation.
Architecture Overview
Our JIT toolchain consists of three main components that work together:
The disassemblers decode native machine code for validation and debugging output. The translators convert native code to an architecture-neutral intermediate representation (IR). The execution engine interprets the IR, enabling cross-platform execution and fine-grained debugging.
The Disassemblers: Understanding What We Generated
Why Not Use Existing Tools?
Tools like objdump, llvm-objdump, and Capstone are excellent for general-purpose disassembly. But our stencils have specific requirements:
-
Patch Validation: We need to verify that patch offsets align with instruction boundaries and target the correct immediate fields.
-
Minimal Footprint: Adding a 50MB LLVM dependency for disassembly is excessive when we only use ~110 instruction patterns.
-
Integration: We want disassembly as a first-class debugging feature, not an external tool invocation.
Our disassemblers support exactly the instruction subset used in stencils---nothing more, nothing less.
x86-64 Disassembler
x86-64's variable-length encoding makes disassembly challenging. An instruction can be 1-15 bytes, with complex prefix combinations:
// x86-64 disassembler structure class X86_64Disassembler { public: auto disassemble(const Stencil& s) const -> DisassemblyResult; auto validate_patches(const Stencil& s) const -> std::vector<std::string>; static auto format(const DisassembledInst& inst) -> std::string; private: // REX prefix structure struct Rex { bool present; bool w; // 64-bit operand size bool r; // ModRM.reg extension bool x; // SIB.index extension bool b; // ModRM.rm extension }; auto decode_one_(const uint8_t* code, size_t len, uint32_t offset) const -> DisassembledInst; static auto parse_rex_(uint8_t byte) -> Rex; auto decode_modrm_mem_(const uint8_t* code, size_t len, const Rex& rex, bool is_64bit) const -> std::pair<std::string, size_t>; };
The key complexity lies in the REX prefix and ModR/M byte parsing. A REX prefix (0x40-0x4F) extends register addressing to access r8-r15. The ModR/M byte encodes both the addressing mode and register operands:
auto X86_64Disassembler::parse_rex_(uint8_t byte) -> Rex { Rex rex; rex.present = (byte >= 0x40 && byte <= 0x4f); if (rex.present) { rex.w = (byte & 0x08) != 0; // 64-bit operand rex.r = (byte & 0x04) != 0; // Extends ModRM.reg rex.x = (byte & 0x02) != 0; // Extends SIB.index rex.b = (byte & 0x01) != 0; // Extends ModRM.rm } return rex; }
ARM64 Disassembler
ARM64's fixed 32-bit instruction encoding is simpler to decode but has its own subtleties. All instructions are 4 bytes, and the instruction category is determined by fixed bit positions:
class AArch64Disassembler { public: auto disassemble(const Stencil& s) const -> DisassemblyResult; auto validate_patches(const Stencil& s) const -> std::vector<std::string>; private: auto decode_one_(uint32_t instr, uint32_t offset) const -> DisassembledInst; // Bit field extractors static auto rd(uint32_t i) -> uint8_t { return i & 0x1f; } static auto rn(uint32_t i) -> uint8_t { return (i >> 5) & 0x1f; } static auto rm(uint32_t i) -> uint8_t { return (i >> 16) & 0x1f; } static auto rt(uint32_t i) -> uint8_t { return i & 0x1f; } static auto rt2(uint32_t i) -> uint8_t { return (i >> 10) & 0x1f; } // Instruction decoders by category auto decode_dp_reg_(uint32_t instr) const -> DisassembledInst; auto decode_dp_imm_(uint32_t instr) const -> DisassembledInst; auto decode_fp_(uint32_t instr) const -> DisassembledInst; auto decode_ldst_(uint32_t instr) const -> DisassembledInst; auto decode_branch_(uint32_t instr) const -> DisassembledInst; };
ARM64 encodes register operands in fixed 5-bit fields (supporting 32 registers). The rd, rn, rm extractors pull these fields from their standard positions.
Patch Validation
The most critical function is validating that patch points are legal:
auto X86_64Disassembler::validate_patches(const Stencil& s) const -> std::vector<std::string> { std::vector<std::string> errors; auto result = disassemble(s); for (const auto& patch : s.patches) { bool found = false; for (const auto& inst : result.instructions) { if (patch.offset >= inst.offset && patch.offset < inst.offset + inst.length) { found = true; // Validate patch type matches instruction type if (patch.type == PatchType::kImmediate64 || patch.type == PatchType::kAddress64) { if (!inst.has_immediate) { errors.push_back(fmt::format( "Patch at offset {} is in '{}' which has no immediate", patch.offset, inst.mnemonic)); } } else if (patch.type == PatchType::kRelativeJump) { if (inst.mnemonic != "jmp" && inst.mnemonic != "jnz" && inst.mnemonic != "jz") { errors.push_back(fmt::format( "Patch at offset {} (type=RelativeJump) is in '{}' " "which is not a jump", patch.offset, inst.mnemonic)); } } break; } } if (!found && patch.offset < s.code.size()) { errors.push_back(fmt::format( "Patch at offset {} does not match any instruction", patch.offset)); } } return errors; }
This validation catches subtle bugs: a patch offset that's off by one byte would land in the middle of a multi-byte immediate, corrupting the instruction.
The CPU Emulator: Executing Without Hardware
Why Emulate?
Native execution is fast but inflexible:
- Cross-Platform Testing: Test ARM64 stencils on x86-64 development machines (or vice versa)
- Deterministic Debugging: Step through execution instruction by instruction
- Isolation: Execute stencils without affecting system state
- Instrumentation: Count instructions, trace register changes, profile hot paths
The emulator trades execution speed for observability and portability.
Architecture-Neutral IR
The heart of the emulator is an intermediate representation that abstracts away architectural differences. Both x86-64 and ARM64 translate to the same IR opcodes:
enum class EmulatorOp : uint8_t { // Integer Arithmetic kAddInt64, // dst = src1 + src2 kSubInt64, // dst = src1 - src2 kMulInt64, // dst = src1 * src2 kDivInt64, // dst = src1 / src2 (signed) kModInt64, // dst = src1 % src2 (signed) kNegInt64, // dst = -src1 kMsubInt64, // dst = src2 - src1 * imm_reg (multiply-subtract) // Bitwise Operations kAndInt64, // dst = src1 & src2 kOrInt64, // dst = src1 | src2 kXorInt64, // dst = src1 ^ src2 kNotInt64, // dst = ~src1 kMovkInt64, // dst = (dst & ~mask) | (imm << shift) - MOVK semantics // Floating-Point Arithmetic kAddDouble, // dst = src1 + src2 kSubDouble, // dst = src1 - src2 kMulDouble, // dst = src1 * src2 kDivDouble, // dst = src1 / src2 kNegDouble, // dst = -src1 kXorDouble, // dst = bitwise_xor(src1, src2) - for XORPD // Comparison (sets condition flags) kCmpInt64, // flags = compare(src1, src2) kCmpDouble, // flags = compare(src1, src2) kTestInt64, // flags = test(src1 & src2) // Conditional Set (reads flags, writes 0 or 1) kSetEq, kSetNe, kSetLt, kSetLe, kSetGt, kSetGe, kSetAbove, kSetAboveEq, kSetBelow, kSetBelowEq, kSetParity, kSetNoParity, // Data Movement kMovInt64, kMovDouble, kCselInt64, kLoadImm64, kLoadImmDouble, kZeroExtend8To64, kZeroExtend16To64, kZeroExtend32To64, kSignExtend8To64, kSignExtend16To64, kSignExtend32To64, // Type Conversions kInt64ToDouble, // f[dst] = (double)r[src1] kDoubleToInt64, // r[dst] = (int64_t)f[src1] // Memory Operations kLoadMem8, kLoadMem16, kLoadMem32, kLoadMem64, kLoadMemDouble, kStoreMem8, kStoreMem16, kStoreMem32, kStoreMem64, kStoreMemDouble, // Stack Operations kPush64, kPop64, kAllocStack, kDeallocStack, // Control Flow kBranch, // pc = target (unconditional) kBranchIfZero, // if (src1 == 0) pc = target kBranchIfNotZero, // if (src1 != 0) pc = target kBranchIfEq, kBranchIfNe, kBranchIfLt, kBranchIfLe, kBranchIfGt, kBranchIfGe, // Function Calls kCallHelper, // call helper function by ID kReturn, // return from stencil kNop, kBreakpoint, };
Each IR instruction is a compact struct:
struct EmulatorInst { EmulatorOp op; // Operation type uint8_t dst; // Destination register (0-31) uint8_t src1; // First source register uint8_t src2; // Second source register union { int64_t imm_i64; // 64-bit immediate integer double imm_f64; // 64-bit immediate double uint32_t target; // Branch target (instruction index) uint32_t helper_id; // Helper function identifier int32_t mem_offset; // Memory access offset }; uint32_t source_offset; // Byte offset in original machine code };
The Translation Pipeline
Translators convert architecture-specific machine code into the neutral IR:
The x86-64 translator handles the variable-length encoding complexity:
class X86_64Translator { public: auto translate(const Stencil& stencil) -> EmulatorProgram; auto translate(std::span<const uint8_t> code, JITType result_type, StencilOp stencil_op) -> EmulatorProgram; private: struct Rex { bool present = false; bool w = false; // 64-bit operand size bool r = false; // ModRM.reg extension bool x = false; // SIB.index extension bool b = false; // ModRM.rm extension }; struct ModRM { uint8_t mod; // Addressing mode (0-3) uint8_t reg; // Register operand / opcode extension uint8_t rm; // Register/Memory operand }; // Get emulator register index from x86-64 register encoding static auto gpr_to_emu_reg_(uint8_t reg, bool rex_ext) -> uint8_t; // Translate specific instruction types void emit_add_(const DecodedInst& inst, const uint8_t* code); void emit_mov_(const DecodedInst& inst, const uint8_t* code); void emit_cmp_(const DecodedInst& inst, const uint8_t* code); void emit_jcc_(uint8_t cc, const DecodedInst& inst, const uint8_t* code); // SSE instructions void emit_sse_arith_(uint8_t op, const DecodedInst& inst); void emit_cvtsi2sd_(const DecodedInst& inst, const uint8_t* code); // Resolve branch targets after first pass void resolve_branch_targets_(); std::unordered_map<uint32_t, uint32_t> offset_to_index_; std::vector<std::pair<uint32_t, uint32_t>> pending_branches_; };
The ARM64 translator leverages the regular instruction format:
class AArch64Translator { public: auto translate(const Stencil& stencil) -> EmulatorProgram; private: // Bit field extractors static auto rd_(uint32_t instr) -> uint8_t { return instr & 0x1f; } static auto rn_(uint32_t instr) -> uint8_t { return (instr >> 5) & 0x1f; } static auto rm_(uint32_t instr) -> uint8_t { return (instr >> 16) & 0x1f; } static auto sf_(uint32_t instr) -> bool { return (instr >> 31) & 1; } // Immediate extractors with sign extension static auto imm19_(uint32_t instr) -> int32_t { auto imm = static_cast<int32_t>((instr >> 5) & 0x7ffff); if (imm & 0x40000) { imm |= static_cast<int32_t>(0xfff80000); // Sign extend } return imm * 4; // Scale by 4 for branch offset } // Instruction category translators void translate_dp_reg_(uint32_t instr); // Data processing (register) void translate_dp_imm_(uint32_t instr); // Data processing (immediate) void translate_fp_(uint32_t instr); // Floating-point void translate_ldst_(uint32_t instr); // Load/store void translate_branch_(uint32_t instr); // Branches };
Emulator State: Registers, Flags, and Memory
The emulator maintains a complete CPU state:
// Register file holding both integer and floating-point registers struct RegisterFile { std::array<int64_t, 32> r; // Integer registers std::array<double, 32> f; // Floating-point registers uint64_t sp; // Stack pointer uint64_t pc; // Program counter (instruction index) // ARM64 XZR (zero register) handling auto get_int(uint8_t reg, bool is_arm64 = false) const -> int64_t { if (is_arm64 && reg == 31) return 0; // XZR always reads as 0 return r[reg]; } void set_int(uint8_t reg, int64_t value, bool is_arm64 = false) { if (is_arm64 && reg == 31) return; // Writes to XZR are discarded r[reg] = value; } }; // Condition flags (NZCV - compatible with both x86-64 and AArch64) struct ConditionFlags { bool n; // Negative: result is negative bool z; // Zero: result is zero bool c; // Carry: unsigned overflow/borrow bool v; // oVerflow: signed overflow // Update flags after integer subtraction (CMP) void update_from_sub(int64_t src1, int64_t src2) { auto result = src1 - src2; n = (result < 0); z = (result == 0); // Carry for subtraction: set if no borrow (src1 >= src2 unsigned) c = (static_cast<uint64_t>(src1) >= static_cast<uint64_t>(src2)); // Overflow: signs differ and result sign differs from src1 v = (((src1 ^ src2) & (src1 ^ result)) < 0); } // IEEE 754 floating-point comparison flags void update_from_fcmp(double src1, double src2) { if (std::isnan(src1) || std::isnan(src2)) { n = false; z = false; c = true; v = true; // Unordered } else if (src1 == src2) { n = false; z = true; c = true; v = false; } else if (src1 < src2) { n = true; z = false; c = false; v = false; } else { n = false; z = false; c = true; v = false; } } // Condition evaluation auto is_equal() const -> bool { return z; } auto is_less_than() const -> bool { return n != v; } auto is_greater_than() const -> bool { return !z && (n == v); } auto is_above() const -> bool { return c && !z; } // Unsigned };
Memory is emulated with a stack and optional literal pool:
class EmulatorMemory { public: explicit EmulatorMemory(size_t stack_size = 64 * 1024); // Stack operations void push(int64_t value, uint64_t& sp); auto pop(uint64_t& sp) -> int64_t; void alloc(size_t bytes, uint64_t& sp); void dealloc(size_t bytes, uint64_t& sp); // Memory access auto load64(uint64_t addr) const -> int64_t; void store64(uint64_t addr, int64_t value); auto load_double(uint64_t addr) const -> double; void store_double(uint64_t addr, double value); private: std::vector<uint8_t> stack_; std::vector<uint8_t> literal_pool_; uint64_t stack_base_; };
The Execution Engine
The execution engine is a straightforward interpreter:
class ExecutionEngine { public: auto execute(const EmulatorProgram& program, EmulatorState& state) -> ExecutionResult; void execute_instruction(const EmulatorInst& inst, EmulatorState& state); private: // Integer arithmetic void exec_add_int64_(const EmulatorInst& inst, EmulatorState& state) { auto a = state.regs().get_int(inst.src1, state.is_arm64()); auto b = state.regs().get_int(inst.src2, state.is_arm64()); state.regs().set_int(inst.dst, a + b, state.is_arm64()); } void exec_div_int64_(const EmulatorInst& inst, EmulatorState& state) { auto dividend = state.regs().get_int(inst.src1, state.is_arm64()); auto divisor = state.regs().get_int(inst.src2, state.is_arm64()); if (divisor == 0) { throw error::division_by_zero(state.regs().pc, create_snapshot_(state)); } // Handle INT64_MIN / -1 overflow case if (dividend == std::numeric_limits<int64_t>::min() && divisor == -1) { state.regs().set_int(inst.dst, std::numeric_limits<int64_t>::min(), state.is_arm64()); } else { state.regs().set_int(inst.dst, dividend / divisor, state.is_arm64()); } } // Comparison setting flags void exec_cmp_int64_(const EmulatorInst& inst, EmulatorState& state) { auto a = state.regs().get_int(inst.src1, state.is_arm64()); auto b = state.regs().get_int(inst.src2, state.is_arm64()); state.flags().update_from_sub(a, b); } // Conditional branching void exec_branch_if_lt_(const EmulatorInst& inst, EmulatorState& state) { if (state.flags().is_less_than()) { state.regs().pc = inst.target; ++stats_.branches_taken; } else { ++state.regs().pc; ++stats_.branches_not_taken; } } };
The main execution loop:
auto ExecutionEngine::execute(const EmulatorProgram& program, EmulatorState& state) -> ExecutionResult { state.regs().pc = 0; state.set_returned(false); while (state.regs().pc < program.instructions.size() && !state.has_returned()) { // Check instruction limit if (config_.max_instructions > 0 && stats_.instructions_executed >= config_.max_instructions) { throw error::max_instructions_reached(config_.max_instructions, state.regs().pc); } const auto& inst = program.instructions[state.regs().pc]; execute_instruction(inst, state); ++stats_.instructions_executed; // Advance PC (unless branch already modified it) if (!is_control_flow(inst.op)) { ++state.regs().pc; } } return ExecutionResult::kSuccess; }
The Debugger: Step Through JIT Code
The debugger provides fine-grained control over emulator execution:
class Debugger { public: // Breakpoint Management auto add_breakpoint(uint32_t instruction_index) -> uint32_t; void remove_breakpoint(uint32_t instruction_index); auto has_breakpoint(uint32_t instruction_index) const -> bool; // Execution Control void attach(ExecutionEngine* engine, const EmulatorProgram* program, EmulatorState* state); auto run() -> DebugState; // Run until breakpoint auto step() -> DebugState; // Single step auto continue_execution() -> DebugState; // State Inspection auto get_int_register(uint8_t reg) const -> std::optional<int64_t>; auto get_fp_register(uint8_t reg) const -> std::optional<double>; auto get_flags() const -> std::optional<ConditionFlags>; auto get_stack_pointer() const -> std::optional<uint64_t>; // Execution History auto history() const -> const std::deque<ExecutionRecord>&; void set_history_enabled(bool enabled); private: ExecutionEngine* engine_; const EmulatorProgram* program_; EmulatorState* emulator_state_; std::vector<Breakpoint> breakpoints_; std::deque<ExecutionRecord> history_; };
Execution history records register changes for each instruction:
struct ExecutionRecord { uint32_t instruction_index; EmulatorOp op; uint32_t source_offset; bool is_fp; // Integer register values int64_t dst_before, src1_before, src2_before, dst_after; // FP register values double fp_dst_before, fp_src1_before, fp_src2_before, fp_dst_after; };
This enables powerful debugging workflows: set a breakpoint, run to it, then step through while watching register values change.
The Helper Dispatcher: Bridging to Document API
JIT stencils need to access document fields, but we can't execute real pointer dereferences in emulation. The HelperDispatcher provides document API functions:
enum class HelperId : uint32_t { // Field access kGetInt64Field, kGetDoubleField, kGetBoolField, kGetStringField, kIsFieldNull, // Cached field access kGetInt64FieldCached, kGetDoubleFieldCached, // String operations kStringEq, kStringNe, kStringLt, kStringContains, kStringStartsWith, kStringEndsWith, // Array operations kArrayGetInt64, kArrayContainsInt64, kGetArrayLength, }; class HelperDispatcher { public: void dispatch(HelperId id, EmulatorState& state); private: void dispatch_get_int64_field_(EmulatorState& state) { // Extract document pointer and field name from calling convention auto doc_ptr = reinterpret_cast<const Document*>( state.is_arm64() ? state.regs().r[arm64_reg::kArg0] : state.regs().r[x86_reg::kArg0]); auto field_name_ptr = reinterpret_cast<const char*>( state.is_arm64() ? state.regs().r[arm64_reg::kArg1] : state.regs().r[x86_reg::kArg1]); // Actually call the document API auto result = doc_ptr->get_int64(field_name_ptr); // Store result in return register if (state.is_arm64()) { state.regs().r[arm64_reg::kReturnInt] = result; } else { state.regs().r[x86_reg::kReturnInt] = result; } } };
This allows stencils that call helper functions to execute correctly in emulation while still accessing real document data.
The High-Level Interface
The StencilEmulator ties everything together:
class StencilEmulator { public: explicit StencilEmulator(const EmulatorConfig& config = {}); // Execute a stencil with a document auto execute(const Stencil& stencil, const Document& doc) -> TypedValue; // Execute a stencil (pure arithmetic) auto execute(const Stencil& stencil) -> TypedValue; // Debugging interface auto debugger() -> Debugger& { return debugger_; } auto prepare(const Stencil& stencil) -> const EmulatorProgram&; auto step() -> bool; auto run_to_completion() -> TypedValue; // Architecture detection static auto detect_architecture(const Stencil& stencil) -> ArchType; static auto is_x86_64(const Stencil& stencil) -> bool; static auto is_aarch64(const Stencil& stencil) -> bool; // Translation cache void clear_cache(); auto cache_hits() const -> uint64_t; auto cache_misses() const -> uint64_t; private: auto translate_(const Stencil& stencil) -> const EmulatorProgram&; EmulatorConfig config_; ExecutionEngine engine_; EmulatorState state_; Debugger debugger_; HelperDispatcher dispatcher_; std::unordered_map<uint64_t, EmulatorProgram> cache_; };
Architecture detection looks at the first few bytes to identify the instruction set:
auto StencilEmulator::detect_architecture(const Stencil& stencil) -> ArchType { if (stencil.code.size() < 4) return ArchType::kUnknown; // ARM64: fixed 32-bit instructions, often start with specific patterns // x86-64: variable length, common patterns include REX prefixes (0x40-0x4F) // Heuristic: ARM64 instructions have distinctive bit patterns uint32_t first_word = stencil.code[0] | (stencil.code[1] << 8) | (stencil.code[2] << 16) | (stencil.code[3] << 24); // Check for common ARM64 patterns if ((first_word & 0x9F000000) == 0x91000000 // ADD immediate || (first_word & 0xFF000000) == 0xD6000000 // BR/BLR/RET || (first_word & 0x7F800000) == 0x2A000000) { // ORR shifted register return ArchType::kAArch64; } return ArchType::kX86_64; }
Usage Example
Here's how you might use the emulator for debugging:
auto stencil = get_add_int64_stencil(); auto emulator = StencilEmulator {}; // Enable debugging auto& dbg = emulator.debugger(); dbg.set_history_enabled(true); // Prepare the stencil (translate to IR) auto& program = emulator.prepare(stencil); // Set a breakpoint at instruction 3 dbg.add_breakpoint(3); // Attach debugger dbg.attach(&emulator.engine(), &program, &emulator.state()); // Run until breakpoint auto result = dbg.run(); if (result == DebugState::kBreakpoint) { // Inspect state std::cout << "Stopped at instruction " << dbg.current_instruction() << "\n"; std::cout << "Registers:\n" << dbg.format_registers() << "\n"; // Step through remaining instructions while (dbg.step() != DebugState::kFinished) { std::cout << dbg.format_current_instruction() << "\n"; } } // Print execution history std::cout << dbg.format_history(20) << "\n";
Performance Characteristics
The emulator prioritizes observability over speed:
| Metric | Native Execution | Emulator |
|---|---|---|
| Speed | ~1 ns/instruction | ~50-100 ns/instruction |
| Debugging | External tools only | Built-in stepping, breakpoints |
| Platform | Native architecture only | Cross-platform |
| Instrumentation | Requires sampling | Exact instruction counts |
The 50-100x slowdown is acceptable for testing and debugging. For benchmarking JIT code quality, we use native execution.
Conclusion
Building a JIT compiler requires more than just code generation. The disassembler validates that generated code is correct---that patch points land at instruction boundaries, that immediates are in the right positions. The emulator enables cross-platform development, deterministic debugging, and fine-grained performance analysis.
These tools transformed our JIT development workflow:
- Stencil Authors can validate patches without running the code
- Debuggers can step through compiled expressions instruction by instruction
- CI Systems can run the full test suite on any architecture
- Performance Engineers can get exact instruction counts without sampling overhead
The investment in tooling pays for itself many times over. JIT compilation is inherently error-prone---invisible code, platform-specific behavior, subtle encoding bugs. Good tools make the invisible visible.
References
- Intel 64 and IA-32 Architectures Software Developer's Manual
- ARM Architecture Reference Manual for A-profile architecture
- Xu, H., & Kjolstad, F. (2021). Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA.
- QEMU: A Fast and Portable Dynamic Translator. Bellard, F. (2005). USENIX Annual Technical Conference.