Binary Coverage the Wrong Way

The Right Way

Way back in the age of the dinosaurs, if you wrote a program and wanted to test that it was correct, you mostly had two choices: you could either manually construct malformed inputs and try them in succession, or hook up a program which would randomly generate inputs and try them automatically ¹. Those programs were called fuzzers, or sometimes generators, and this was “fuzz testing”. It also for quite a long time wasn’t very good: the generator would have no feedback from the program under test except “did this random input crash or not”. This feedback, although useful, doesn’t allow the fuzzer to tell if it is making progress relative to the last random input or doing a good job, and so your fuzzer would just repeatedly hit the same shallow code. You’d leave it running for a week and then when you come back realize it was never generating inputs that would hit interesting parts of your program, and they were all immediately being thrown out or never progressing past a if(packet_header == 0xdeadbeef) check.

American Fuzzy Lop revolutionized the field by introducing coverage-guided fuzzing. Instead of having the only feedback to your fuzzer be whether an input crashed or not, it also recoded what code coverage it hit over the course of a program execution; if an input executes a piece of the program that it never reached before, then it hit new code, and probably is now able to also test some new interesting surface area of the program which might have some bugs. This was immensely more effective than “dumb fuzzing” of yesteryear.

AFL collects coverage for programs by compiling the program under test with a custom clang compiler pass, which adds coverage feedback on CFG edges: if you hit a new basic block in the code, you get a new coverage bitmap entry set². If running the program on an input sets results in a bitmap which has new bits set than previously, it hit new code; that input can be saved to use as the seed input to mutate for future executions.

You can also do feedback based fuzzing for blackbox binaries that you don’t have the source code to. AFL++ for example has a qemu_mode backend: it executes the program under test with qemu-user, which will lift your x86 assembly to QEMU’s TCG IR and then lower it back with its JIT compiler to x86, but now which can safely execute under its type-2 hypervisor environment. AFL++ has a fork of QEMU which adds a custom TCG hook that adds coverage instrumentation to blocks in the middle-end; the IR for each block is extended with TCG operations that write a coverage bitmap entry the same way that AFL enhances clang IR with extra LLVM operations. Despite sounding like a lot of work, this qemu_mode is only like ~3-5x slower than running a program natively.

The gold standard for blackbox binary coverage is to let hardware do it for you. Instead of needing to add extra operations to the program under test, which need to be executed when you run a fuzz testcase, you can instead use Intel PT. Enabling certain performance counter configuration registers will start processor tracing: your CPU will then automatically record metadata as part of its execution pipeline, and the fuzzer can look at the resulting metadata to pull out a trace of what basic blocks it executed. This is unreasonably effective, with Intel PT based tracing only being something like ~10% slower than normal native execution.

The issue, of course, is that if you’re like me your laptop has an AMD processor and not Intel. AMD have their own magic performance counting features, including a branch trace buffer that records what code is being executed, but its 1) much worse documented 2) much slower than Intel PT 3) impossible to use in a precise manner - that is, the counters are either sampling based and so only give you 1/n events, or may end up dropping events if a metadata buffer fill up before you’re able to drain it. In either case that makes it sadly unsuitable for fuzzing (you technically could build a fuzzer on top of merely stochastic coverage, and it might even not be that bad. But in practice no one has done so, and it’s probably not worth it.)

Which brings me to my bad idea. I wanted to write my own fuzzer for a side-project, and so wanted to collect precise blackbox binary coverage. But didn’t want to write my own dynamic recompilation engine (again), or buy a new laptop. What is a software engineer to do?

The Wrong Way

We can instead instrument the guest code by replacing all the branches with INT3 instructions. Wait come back I can explain.

In order to collect precise coverage we need the program to do something whenever it hits a branch. We can’t insert new behavior in each block, because it already is using all of the bytes in the block for its normal instructions: trying to resize blocks requires fixing up code pointers, and then you’re just back at QEMU-but-bad. The minimal length of a branch instruction is two bytes long: Jcc has a form which is only a single opcode byte plus a displacement to branch with. INT3 is only one instruction long, hijacks control flow in a way that doesn’t clobber GPRs which would prevent continuing execution after the instruction, and gives us the location of the branch that we overwrite as context so we can compute the coverage bitmap entry that we want to record in our coverage map. It’s perfect!

Unfortunately, if you tried to do this on Linux with ptrace you’d probably want to die. Linux will turn hit INT3 into a SIGTRAP. That can be caught by a ptrace based debugger that is supervising your process. That signal will include the tracee’s PC and register state, which you could use to implement the coverage hook, and then PTRACE_CONT to continue the process. It’s just that this entire dance would be several context switches: guest program in ring 3 -> INT3 trap to x86 interrupt handler in ring 0 -> Linux kernel delivers a ptrace event in your debugger in ring 3 -> your debugger PTRACE_CONT via syscall to ring 0 -> Linux kernel syscall implementation resumes the guest in ring 3. Each of those context switches are very expensive, especially because context switching between privilege levels requires even more work from your processor than same-ring context switches. Oh, if only there was another way!

The answer, of course, is virtualization extensions. Your computer can run a smaller computer inside of it through the magical power of Intel VT-x (or, as we call it on Linux, kvm). The code inside the smaller computer thinks it is running completely bare metal: that guest code can configure interrupt handlers and MSRs and paging however it wants, just like a normal kernel, and even run userspace code under the guest kernel ³. All while, in reality, the guest kernel is being ran in a virtualized machine hosted by your normal Linux kernel; you can configure it so that when the guest kernel tries to do I/O to the real world, they actually trap to the host kernel, which turns them into a userspace event that the “virtual machine manager” can implement and then resume the virtual machine with the result of. Crucially, while the virtual machine is executing, it is executing bare metal - your processor is actually running all of the instructions for the guest VM, and they all use the normal processor state like general purpose registers and hardware page tables, just with some extra bits that say to also do a hypervisor translation for the physical memory so it can’t reach outside the bounds of the VM and things. The important thing is that it means the VM runs at basically 100% native speed, except for when it needs to VMExit back to the host kernel because the guest kernel tried to do something that traps to the hypervisor. It turns out this works so well, in fact, that basically the entire internet runs on top of kvm in the form of cloud based hypervisors like Firecracker.

TinyKVM is a neat project by the author of Varnish⁴. It implements a VMM as a C++ library, which you can use to run arbitrary Linux binaries: it sets up a minimal guest VM with the required x86 processor state to execute usermode code, with a tiny guest kernel that has interrupts which mostly just trap to the VMM in the host. That means that you can run a guest binary in the VM, and when it does a syscall like write it bubbles up to a VMExit, the host VMM sees that it did a syscall with rax=1 and all the other GPRs for arguments, and then do its own write syscall with the real FD and arguments on the host.

This doesn’t automatically lets us solve the issue of too many context switches for the INT3 hooks. But because the guest kernel is a full kernel, it can configure the interrupt handle that INT3 context switches to however it wants - and I was able to fork tinykvm into something I call TinyCOV, which hacks up the guest kernel a bit. x86 actually has the privilege level for interrupts configured as part of the IDT data structure describing the interrupt gate! That means that we can have INT3 trap from the guest userspace code to the guest kernel interrupt handler but stay in ring 3, and even put our coverage hook implementation directly in the guest “kernel” code to be executed before returning from the interrupt back to continue execution. This cuts down each INT3 coverage hook down to only two context switches, to and from the guest interrupt routine, and zero privilege level changes. And it means we never even need to VMExit either, because the guest kernel is able to handle the interrupt entirely itself instead of needing the VMM to emulate some privileged operation.

Because we overwrite the branch itself with an INT3, just returning from the interrupt back to where we trapped doesn’t actually work and we’d end up skipping the actual branch operation. Since INT3 is only one byte but the minimum length of a branch is two however, we can use the second byte to record some metadata for the branch hook: we can store an index there, and then use the index plus the offset of the INT3 into the page to uniquely identify the location, and then JIT compile the same branch instruction that we overwrote (but with some jump instructions for either side of the branch back to the original successor instructions). Handling dynamic jumps, like JNZ rax is a bit trickier, but still possible.

Since we still have the ability to VMExit from the kernel back to the VMM if we want, we also are able to leverage that in order to implement code discovery: tinycov hooks branch instructions a block at a time, starting from the binary entrypoint. It writes a sentinel value as the metadata byte for the INT3 the first time it hooks a branch, and then in the guest kernel traps to the hypervisor if it sees the coverage hook still has the sentinel value and thus its the first time it’s ran. The VMM is then able to disassemble the successors of the basic block and hook their branches, pushing the code discovery frontier forward, and replace the sentinel value with the correct trampoline index value for the JIT code so that future hits of the branch can stay entirely in the guest kernel.

Results

Despite how scuffed all of this sounds, it does actually work: I’m able to run random binaries under tinycov and collect a coverage bitmap of code that was hit, plus a mode to instead generate a full program trace. Because it works by hijacking control flow entirely, we can even put arbitrary code in the INT3 kernel handler: I was able to implement CmpCOV style dictionary building by recording if a branch ends in a cmp instruction and recording the register operand value, since the kernel interrupt handler (and the VMM VMExit handler) have access to the full set of GPRs from the program under test synchronously, which you can’t quite do with Intel PT style asynchronous coverage traces.

That said, how fast is tinycov compared to QEMU? The answer is, uh, really slow. I ran some benchmarks, and it’s like 700% slower - and worse on branch heavy code like bzip4, which will be taking more INT3 handlers per instruction ran. CPU out-of-order pipelines really don’t like taking an interrupt every few instructions, even if the context switch stays at the same privilege level. I am happy that the idea of using userspace interrupt handlers under kvm works; I did some benchmarks and it’s like 50% faster than using “proper” ring 3 -> ring 0 interrupts instead! But even 50% faster than doing it the slower way is not enough for this to really be usable when QEMU is so much faster.

Technically, there are some neat things that this would enable you to do which you can’t with QEMU or Intel PT based coverage. For one, because we’re trapping to a single coverage hook which is actually implemented in native code, it’s really easy to add custom coverage hooks: for some fuzzing targets you want to introduce some extra state into the coverage bitmap calculation in order to get the fuzzer to treat identical control flow traces as new program states to continue exploring from, like if you’re fuzzing a language runtime with a virtual PC or some other data-driven loop. If you want to do this with QEMU you have to either write your own TCG hook (hard), or set code callbacks which end up taking a software-VMExit from the JIT code (slow).

Because we’re hooking code in place we also have the ability to unhook code, which I think you could do something cool with. With QEMU the codegen will emit direct branches between JIT blocks to be fast, but the instrumentation being emitted in TCG and thus part of the translated block instructions means that you can’t remove the hook in a block afterwards as you’ll still have those incoming edges. With tinycov we could remove the coverage hook from branches which we decided “aren’t worth it”, or that we have seen both the true and false exits for and so know that it can’t ever lead to new code - meaning that those branches can be rewritten back to their original bytes, and so execute at literally native speed again.

In practice, I think this is mostly just a toy and cute idea, although I am pleased that it works at all. I’ll probably try plugging it in to be used as the coverage collection implementation for LibAFL just to see how much worse it does than QEMU as part of a full fuzzer, but unfortunately no one should really try to use this in production. It was pretty fun to get working though, and I was pleasantly surprised how easy it was to hack up tinykvm to support the functionality I wanted. Any excuse to do very cursed x86 kernel dev things in a side-project, and staying small enough to actually finish, is a win in my books.

Footnote:

Some people would tell you there was a third option, which was “formally prove your program”. They’re lying to you. ↩
In reality it’s a bit more complicated; most fuzzers now actually use N-gram coverage, where you combine the last N branches together to calculate the bitmap entry to set in order to expose more path-dependent information. Having some weird coverage metric is still, to a first approximation, recording “what code was hit” though, so the distinction doesn’t matter that much - and if you have some way of recording branch history, you can probably also compute N-gram coverage instead. ↩
In fact, if you’re using Windows 11, the “bare metal” Windows kernel you think you’re running has actually been running as a virtualized guest this entire time. Isn’t that kind of neat? ↩
Yes, that Varnish. The HTTP cache. ↩