Magic boxes

If you have a thing that looks like source code->magic box->program you run, compilers are the magic box. If you pass source code around everywhere and run the magic box late enough¹ people call it “just-in-time” and then act very impressed. If the magic box just runs the source code and doesn’t output a program you run then they call it an interpreter, though, and are no longer impressed anymore.

Sometimes those interpreter magic boxes have another magic box inside them that do bytecode->second, smaller magic box->program you run. If you run the second, smaller magic box on snippets of the program that you notice you’re running a lot, then you call it a hotspot JIT that you integrated into an interpreter, and I’m pretty sure Google or Oracle immediately offers you $300k/yr to suffer through absolutely unhinged CS PhD code for a living².

We’re gonna write a magic box today.

Someone else’s magic box

I could write a JIT background introduction post here, but really there are a bunch of them already and I don’t think I would add very much writing another near-identical one. Instead, I’ll be referring to this blog post: none of this post will make sense unless you read it. I’ll wait.

Ok cool you’re back. Good blog post, right? That’s basically JIT compilers 101: you can JIT compile an interpreter by abstractly interpreting what operations the program would do if you started at the first bytecode operation and ran each sequential one, but instead of doing the operations immediately you instead record what the operation would do, and then at the end have one linear set of operations that, if you jump to the entry, will have the same result of executing all the operations in sequence. This final set of operations is compiled assembly code, and so Blazing Fast(tm).

Or, well. That’s the hope. The final set of operations is only as fast as what you generate: you could in fact record every single instruction that was executed while you were doing the interpreter part, and then run all those instructions again for your “JIT compiled output”, and while it would technically be JIT compiled it wouldn’t be a very good magic box and probably wouldn’t actually be much faster or different³.

A smarter magic box

Smart JITs record what information they were able to glean statically from the first pass, in order to emit final output code that doesn’t have to do operations that are redundant or useless for the output. Very smart JITs like V8 or HotSpot have so many of these that they are basically as complex, or more complex, internally as LLVM or GCC are for normal compilation, because the job is essentially the same: optimizing code. That’s too hard so we’re going to mostly ignore it for now.⁴ There is one thing we can make slightly smarter, though.

That BF JIT from the blog post you read (you did read it, right?) put all the operands for each operation at a constant location each time. BF is simple enough that this isn’t a problem for it at all, since it’s really a 1-register machine: > implicitly depends on the BF data pointer being in register r13 for it to manipulate, and there are no other live values in the VM other than values stored in the data array. The problem you then have if you were, say, making a calculator, is how would you JIT compile +, which has two operands and not just a static “current register”? For this example, we’ll be using the expression (0+1)+(2+3), which you can evaluate with a simple program in Rust:

enum Expr {
    Num(i64),
    Add(Box<Expr>, Box<Expr>),
    // ... more operations
}
type Value = i64;

fn eval(e: &Expr) -> Value {
    match e {
        Expr::Num(u) => *u,
        Expr::Add(left, right) => {
            let left = eval(left);
            let right = eval(right);
            left + right
        }
    }
}

First, a possibly-obvious observation: a recursive evaluation of a tree can be turned into an equivalent set of linear operations, by instead recording what operation you need to evaluate in order⁵ to a linear list. (0+1)+(2+3) becomes 0, 1, +, 2, 3, +, + in this way. I’ve always called it “linearization”, presumably after stealing it from a paper I read once upon a time; other people call it “flattening” an AST or (if they’re nerds) “A-Normal Form conversion”. This is part-and-parcel of JIT compilation of expression trees, or any non-JIT compiler: you can think of the expression tree as identical to this instead:

produce 0
produce 1
evaluate +
produce 2
produce 3
evalaute +
evaluate +

And, instead of first recording your traversal and then processing the linear list later, you can just do your processing in the recursive traversal directly.

If the equivalent code you are emitting for your JIT compiler for + expects that the left and right operands and the resulting value are always in the same register each time (say, rax and rdi, and the output in rax), then you also have to make sure that trees of operations correctly put the operands in the correct place in between operations, and none of them ever clobber values that you put in those registers and depend on being kept the same across operations. Consider how you would evaluate (0+1)+(2+3) if each + was emitting identically: you evaluate the left tree 0+1 which come in rax and rdi, and set the output as rax = 1 as a result, and now have to execute 2+3 which…needs rax = 2 and rdi = 3 when it evaluates. Messy!

Here a lot of simple JIT blog posts cheat. They just use the processor stack. + takes two operands and outputs one result; you could write it in assembly as pop rax; pop rcx; lea rax, [rax+rcx]; push rax so the operands are results need to be at the top of the stack for each assembly snippet, and then whatever registers it uses are anything you want because none of them have to be kept untouched across operations. You only have to make sure that you don’t clobber any stack values, which is pretty easy - your stack when running the right subtree of the above example is then [3][2][1] and the + only twiddles the top of the stack and leaves 1 untouched, producing [5][1].

Here’s a neat trick, which someone told me is technically a variant of “Sethi–Ullman local register allocation” but I can never remember and so call “Oberon-0 abstract stack machine register windowing stuff”⁶ or similar every time I mention it: Keep track of what stack offset you would be running at, if you did the push/pop stack stuff for operation inputs and outputs. And then map values at each stack offset to a register, and instead of needing to shuffle values to different registers to keep them alive, you emit a different assembly snippet for each operation depending on what offset it would be executing at, if it was using the stack.

Here’s an example, using the yaxpeax_86 and dynasm-rs crates for dynamic assembly creation.

use dynasmrt::x64::Assembler;
use dynasmrt::{dynasm, DynasmApi};
use yaxpeax_x86::long_mode::RegSpec;
use once_cell::sync::Lazy;

static regmap: Lazy<Vec<RegSpec>> = Lazy::new(|| vec![RegSpec::rax(), RegSpec::rdi(), RegSpec::rsi()]);
fn compile(e: &Expr, offset: usize, ops: &mut Assembler) -> usize {
    match e {
        Expr::Num(u) => {
            dynasm!(ops
                ; mov Rq(regmap[offset].num()), QWORD *u
            );
            offset + 1 // abstract stack push
        },
        Expr::Add(left, right) => {
            let after_left = compile(left, offset, ops);
            let after_right = compile(right, after_left, ops);
            let pop_right = after_right - 1; // abstract stack pop
            let pop_left = pop_right - 1; // abstract stack pop
            dynasm!(ops
                ; add Rq(regmap[pop_left].num()), Rq(regmap[pop_right].num())
            );
            pop_left + 1 // abstract stack push
        },
    }
}

fn jit_and_run(e: &Expr) -> Value {
    let mut ops = Assembler::new().unwrap();
    let start = ops.offset();
    let res = compile(e, 0, &mut ops);
    assert_eq!(res, 1); // all tree of operations should boil down to one result
    dynasm!(ops
        ; ret
    );
    let end = ops.offset();
    let buf = ops.finalize().unwrap();
    let compiled: extern "C" fn()->Value = unsafe { core::mem::transmute(buf.ptr(start)) };
    compiled()
}

If we had the stack->register map as rax, rdi, rsi for example⁷, then the final assembly for (0+1)+(2+3) would instead be

mov rax, 0; // 0
mov rdi, 1; // 1
add rax, rdi; // +
mov rdi, 2; // 2
mov rsi, 3; // 3
add rdi, rsi; // +
add rax, rdi; // +
ret

Does that set of operations look familiar? It’s the same as the linearized traversal of the tree from one of the above paragraphs. This program doesn’t use any stack, only registers, and is Blazing Fast(tm). If you run out of registers because you abstract stack offset gets too high, then you can fall back to normal push and pop like the “cheating” example - or implement some dumb last-recent-used cache for the registers and insert spilling when you need values at abstract stack offsets that don’t map to registers and you’ve accidentally re-created Linear Scan Register Allocation (which is a good thing).

This is a specific case of optimizing an interpreter when JITting, where you are able to pre-compute some unique location for each operation inputs and outputs and emitting code tailored for those specific locations, instead of emitting extraneous operations shuffling data around unneeded - and a case that isn’t quite touched on in the original BF JIT series of blog posts, which more talks about optimizing sequences of operations themselves via peephole optimizations, and then skipping to using LLVM which handles global register allocation and removing extraneous moves for entire functions, removing a need for doing this style of optimization yourself. Most other JIT-related blog posts I’ve seen online also don’t handle anything more complex than BF, or implement anything more than the push/pop machinery.

A cute paper that extends this idea is “Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode” by Haoran Xu, Fredrik Kjolstad. It uses C++ to generate several different versions of functions which implement bytecode operations, using templating in order to create at compile-time “stencil” versions of the function which pad the argument and result values with dummy values. This has the result of essentially tricking the compiler into generating the various shifted assembly snippets itself, with the register used for each abstract stack index being the corrosponding register in the function ABI.⁸

This is, in my head at least, also closely related to the “Delayed Code Generation in Smalltalk-80” paper which not only allocates a unique location of values while doing code generation, but also delays emitting the code to evaluate those values until they are observed by some use, which itself has where it would like the value to be put at the end - and thus can emit code that computes values in that desired destination location directly. This technique was used in the V8 JavaScript JIT for their one-pass code generation at one point in time: they had a JIT that was “being stupid”, and they replaced it with something “not being stupid”, and reaped some performance gains despite being fairly conceptually simple. And, hopefully, now that you know this exists, you can too.

Footnote:

Instead of only running the magic box once, and then distributing the output program to people via unhinged gzip tarball infrastructure, which is the source of numerous headaches for everyone involved and leads to worse things, like learning NixOS. ↩
I can’t confirm this though. If Google or Oracle people are reading this, feel free to prove me right by offering me $300k/yr. ↩
I of course am a fool and have another side-project that basically does exactly this. It probably isn’t a good idea, but it’s also kind of a fun idea, which is what really matters. ↩
The sequel of the blog post I just made you read has a very nice gradual introduction to what kind of optimizations you can do with a JIT compiler, if you’re interested. ↩
Technically, the reverse post-order traversal, since you record which operations are completed first, since you want 0 and 1 to be recorded before + which use the results is. ↩
In loving reference to the paper for the Oberon-0 compiler, section 10.1, which uses this technique. ↩
These sound like they were picked out of a hat, but are fairly intentional: The SystemV ABI that we will be telling Rust the resulting function follows has a list of registers you are allowed to overwrite, and a set of register you aren’t, and all of these registers are allowed to be overwritten. Additionally, rax is the register for the output of our function: since we always evaluate to one register-sized value, we have the first abstract stack slot be that register so the result of running the expression leaves the result in where Rust expects the output to be. ↩
I tried to ~~steal~~ lightly adapt the paper to a Rust implementation, and hit Several Problems. They use a continuation-passing-style method of having the stencils fit together, with the functions ending in a tailcall to the next operation (which they then patch via ELF object crimes when manually stitching them together in the JIT code buffer). Rust doesn’t have guaranteed tailcalls, unlike C, however - in fact, Rust even in the extremely trivial cases where anyone with half a brain could look at the source code and reason “oh yeah you could tailcall here” will not, in fact, have LLVM emit tailcalls for functions. Which means you can’t really stitch together a bunch of function bodies in Rust, because each function will be creating stack frames and only tearing down those stack frames after the next function returns and basically it’s a giant mess. Rust also doesn’t even expose the GHC ABI that Copy-and-Patch uses to get Bonus Registers from the function ABI: the closest thing is an "unadjusted" ABI that has absolutely zero documentation for what it does or why it exists and also a big unstable feature warning saying “DO NOT USE THIS EVER” which I of course ignored in the course of the experiment. But it doesn’t have GHC ABI, which is kinda important for tricking the compiler into giving you good codegen from the abstract stack offset stencils. ↩