The fastest Intel-PT decoder for fuzzing
libxdc (eXtremely fast DeCoder) aims to be the best Intel-PT decoding library for fuzzing purposes. It was designed to power various of our binary-only fuzzers such as kAFL, Redqueen, Grimoire and Nyx. We heavily optimized the library towards decoding similar traces over and over again. In contrast to other options with similar goals, we decode the full information to yield precise coverage information. Thanks to the magic of computed gotos, struct of arrays, heavy caching, branchless code and tons of unreasonable micro optimizations, it is by far the fastest Intel-PT decoder we found.
Using libxdc is rather simple:
page_cache_t* page_cache = page_cache_new(page_cache_file); void* bitmap = malloc(0x10000); libxdc_t* decoder = libxdc_init(filter, &page_cache_fetch, page_cache, bitmap, 0x10000); ret = libxdc_decode(decoder, trace, trace_size);
To create a a decoder you use
libxdc_t* libxdc_init(uint64_t filter, void* (*page_cache_fetch_fptr)(void*, uint64_t, bool*), void* page_cache_fetch_opaque, void* bitmap_ptr, size_t bitmap_size). This function takes a set of 4
filterranges as you configured them during tracing. To disassemble the target, a callback
page_cache_fetch_fptrhas to be provided that allows libxdc to request memory. To help with this task, a opaque
void*can be passed to
libxdc_initthat will be passed to this callback. In our fuzzers we use this callback to access the targets memory. This can also be used to deal with dynamically generated code such as manually loaded libraries, swapped out pages or unpackers. If the callback requests a page that is not currently available in our memory snapshot, we place a hardware breakpoint on the address that was requested for disassembly. Then we perform a second execution without tracing, dumping the page to the cache when the instruction is executed.
Lastly, a pointer to the bitmap and it's size needs to be provided. After we obtained a decoder object, we can use
libxdc_decodeto decode a trace. The bitmap will be filled according to the trace by the decoder. Between individual test cases, the callee is responsible for resetting the bitmap to 0 after each test case.
To obtain more information on the decoding process you can register additional callbacks:
You can register a callback that is triggered for each new basic block found (
libxdc_register_bb_callback). This can be used to perform more in depth analysis of the code that is being executed.
By adding an edge callback (
libxdc_register_edge_callback) and enabling trace mode (
libxdc_enable_tracing) you can perform a decoding where each edge taken is passed back to you for inspection. Manually enabling/disabling the trace mode is necessary, since in contrast to the basic block callback this slows down the decoding process.
You need to set at least one of the filter ranges. They need to be set to the same values used during the intel-pt tracing. If you do not configure the right ranges you will see an assertion
count_tnt(self->tnt_cache_state) != 0.
You need to append a single byte 0x55 to the trace, but excluded it from the trace size. This byte is used to mark the end of the input and allows us to increase the performance during decoding. If you do not do this, an assertion is raised.
This decoder expects that return compression and generation of CYC, TSC and MTC packets are disabled. You can disable those options if you are using perf. KVM-PT disables them by default. To manually disable return compression set bit 11 (
IA32_RTIT_CTLMSR to 1. CYC, MTC and TSC packet generation can be disabled if bit 1 (
CYCEn), bit 9 (
MTCEn) and bit 10 (
TSCEn) are set to 0.
Extraordinary claims require extraordinary evidence. As such, we of course perform a set of entirely unfair and unscientific experiments to substantiate our performance claims. To check the decoding/disassembly performance of various available options for dealing with Intel PT data we setup an experiment in which we picked about a GB of traces from fuzzing various targets. We extracted various alternative approaches from other fuzzers such as WinAFL, Honggfuzz, PTrix and Killerbeez to create a controlled evaluation setup. Most of these tools only consider a subset of the data for performance reasons. It should be noted that some of these tools claim performance improvements over full decoders such as our approach used in kAFL. Notably, PTrix evaluated against various alternatives, but measured only a single execution on cold caches. As we shall see, while these tools provide data that is far less useful for fuzzing, they do not actually provide performance advantages.
We compare five alternatives to libxdc:
libipt, the original decoding library provided by Intel. libipt can fully decoding all information available in Intel-PT data streams.
WinAFL-libipt, a version of libipt with an additional caching layer introduced by Ivan Fratric for use during fuzzing in WinAFL. Like libipt, this approach is decoding full edge coverage.
PTrix, to increase performance, the authors of PTrix chose not to recover full edge coverage. Instead, individual runs of taken/not taken decisions are hashed together. This way, PTrix avoids to disassemble the target, but it will result in significantly noisier coverage.
Hongfuzz: Similarly to PTrix, honggfuzz avoids disassembling the stream and instead only considers the contend of the TIP packages. It should be noted that it seems like the authors ignored TIP compression and hence cause a significant amount of collisions in the bitmap.
The last option, a Intel PT "decoder" used by Killerbeez takes this idea even further: All of the data is simply hashed in two distinct large streams. Any change to the path taken, no matter how small will result in a completely different hash. We have strong reason to suspect that this approach is almost useless for fuzzing, as it will generate a very large number of useless queue entries. However, for the sake of completeness, we are including it in our comparison.
We should mention that we did not compare against PTFuzz, as the authors used the outdated kAFL code base that libxdc is also based on. As an amusing Sidenote: the authors of PTFuzz chose to not mention this fact in the paper, instead claiming that they developed their own decoder.
We used these libraries to decode a few thousand iterations of Intel-PT data resulting from fuzzing some random targets that we used for debugging purposes and measured the time used.
As you can see in this entirely unscientific benchmark, libxdc is about 15x-30x faster than libipt and quite a bit faster than all other options that are returning a coverage bitmap. Only the last option (Killerbeez) manages to perform similar, while producing vastly less useful results.
The bars show the average slowdown over the best decoder. The error bars represent the best and the worst performance across 3 runs each. The experiment setup and data can be found here
libxdc depends on capstone v4. Unfortunately, many distributions don't contain this version in their package management. To install libcapstone v4:
git clone https://github.com/aquynh/capstone.git cd capstone git checkout v4 make sudo make install
If libcapstone v4 is installed, you can install libxdc:
git clone https://github.com/nyx-fuzz/libxdc.git cd libxdc make install
This builds both a .so and a static .a version of the library. For performance reasons we recommend to use the static version IF you are able to use link time optimizations during the compilation of your tool.
Should you find a bug in libxdc's decoding and need help fixing it, please make sure that the report includes the target memory dump and the trace dump. If you found and fixed a bug on your own: We are very open to patches, please create a pull request!
The library is provided under MIT license.
Free Software Hell Yeah!