Need help with stoke?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

596 Stars 60 Forks Other 3.5K Commits 60 Opened issues


STOKE: A stochastic superoptimizer and program synthesizer

Services available


Need anything else?

Contributors list

Build Status


STOKE is a stochastic optimizer and program synthesizer for the x86-64 instruction set. STOKE uses random search to explore the extremely high-dimensional space of all possible program transformations. Although any one random transformation is unlikely to produce a code sequence that is desirable, the repeated application of millions of transformations is sufficient to produce novel and non-obvious code sequences. STOKE can be used in many different scenarios, such as optimizing code for performance or size, synthesizing an implementation from scratch or to trade accuracy of floating point computations for performance. As a superoptimizer, STOKE has been shown to outperform the code produced by general-purpose and domain-specific compilers, and in some cases expert hand-written code.

In addition to searching over programs, STOKE contains verification infrastructure to show the equivalence between x86-64 programs. STOKE can consider test-cases, perform bounded verification all the way to fully formal verification that shows the equivalence for all possible inputs.

STOKE has appeared in a number of publications. For a thorough introduction to the design of STOKE, see:

Additionally, the work Semantic Program Alignment for Equivalence Checking (PLDI 2019), was developed from this codebase. The fork is available here.

Status of this Work

STOKE isn't production ready. It's a research prototype that demonstrates the viability of superoptimization techniques in various domains. It's not a general-purpose tool. The papers above describe specific areas where successes have been shown beyond the state of the art: in optimizing straight line code, code where correctness can be relaxed (e.g. floating point), synthesizing sematic specifications for an instruction set, and in optimizing code containing loops with special compilation requirements (e.g. Google Native Client). We're not quite at the point where we can take a generic loop and expect to improve gcc/llvm -O3 code. In part, this is because these compilers have decades of work behind them to make them really great.

At this point, nobody is actively working on this code base. This repository now serves as an artifact for several of the above research papers. We will accept pull requests and answer inquiries as time permits.

Table of Contents

  1. Hardware Prerequisites
  2. Using Docker
  3. Downloading and Building STOKE
  4. Using STOKE
    1. Running Example
    2. Compiling and Disassembling Your Code
    3. Test Case Generation
    4. Final Configuration
    5. Starting STOKE
    6. Rewriting the Binary
    7. Using the formal validator
  5. Additional Features
  6. User FAQ
  7. Developer FAQ
  8. Extending STOKE
    1. Code Organization
    2. Gadgets
    3. Initial Search State
    4. Search Transformations
    5. Cost Function
    6. Live-out Error
    7. Verification Strategy
    8. Command Line Args
  9. Contact

Hardware Prerequisites

STOKE will run on modern 64-bit x86 processors. It will run best on Haswell or newer machines, but it can also run okay on Sandy Bridge. With Sandy Bridge processors, there won't be support for AVX2 instructions.

It should run on newer architectures, but we haven't tested it. However, stoke only supports a subset of instructions that were widely available when it was initially developed. As a result, targets generated by newer compilers might not work with STOKE. (Adding support for an instruction mostly involves adding a spreadsheet entry in the x64asm project, and optionally adding validator support).

Using Docker

STOKE has many dependencies and we think the best way to get up-and-running with a development environment is to use docker. Simply:

$ sudo docker pull stanfordpl/stoke:latest

These docker images run an SSH server. We recommend starting the image like so:

$ sudo docker run -d -P --name stoke stanfordpl/stoke:ARCH

then one can SSH as follows:

$ sudo docker port stoke 22

$ ssh -pXXXXX [email protected] (password is 'stoke')


Note that there are other docker images from other travis-ci builds available in the stanfordpl/stoke-test repository. These should be available for recent branches and pull requests, for example.

You can build your own docker images by running

docker build .
in the top level of this repository. These are built upon the
image, which contains compiled versions of Z3 and CVC4. If you want to upgrade Z3 or CVC4, it will require rebuilding these imsages. The
dockerfile may be used for this purpose (but it's not part of continuous integration, so it may require some manual work to get it to happen).

Downloading and Building STOKE

STOKE should work on Ubuntu 14.04 and Ubuntu 16.04. Regardless of distribution, the key to making stoke right is using gcc version 4.9. Below that, the compiler doesn't support enough features to build our code. Above that, there are some issues with an ABI change in gcc-5.

$ sudo apt-get install bison ccache cmake doxygen exuberant-ctags flex g++-4.9 g++-multilib gcc-4.9 ghc git libantlr3c-dev libboost-dev libboost-filesystem-dev libboost-thread-dev libcln-dev libghc-regex-compat-dev libghc-regex-tdfa-dev libghc-split-dev libjsoncpp-dev python subversion libiml-dev libgmp-dev libboost-regex-dev autoconf libtool antlr pccts pkg-config

Note that your distribution might not have g++-4.9 by default. You may consider installing a PPA as described here:

The rest of the dependencies will be fetched automatically as part of the build process.

The entire STOKE code base is available on GitHub under the Apache Software License version 2.0 at To check it out, type:

$ git clone

This will check out the default

branch. Unless you are looking for a specific version or modification of STOKE, this is the branch to use. It contains all the latest changes and is reasonably stable. This branch is supposed to always pass all tests.

See the previous sections on how to download STOKE, a list of dependencies, and to check your hardware support level. The remainder of STOKE's software dependencies are available on GitHub and will be downloaded automatically the first time that STOKE is built. When you build STOKE the first time, run

$ ./

This will figure out the correct build parameters (such as the platform). To build STOKE, run

$ make

To add STOKE and its related components to your path, type:

$ export PATH=$PATH://bin

To run the tests, choose the appropriate command:

$ make test

The files generated during the build process can be deleted by typing:

$ make clean

To delete STOKE's dependencies as well (this is useful if an error occurs during the first build), type:

$ make dist_clean


The following toy example shows a typical workflow for using STOKE. All of the following code can be found in the

directory. As this code is tested using our continuous integration system, the code there will always be up-to-date, but this README can fall behind.

Running Example

Consider a C++ program that repeatedly counts the number of bits (population count) in the 64-bit representation of an integer. (Keeping track of a running sum prevents

from eliminating the calls to

#include #include <stddef.h> #include <stdint.h>

using namespace std;

size_t popcnt(uint64_t x) { int res = 0; for ( ; x > 0; x >>= 1 ) { res += x & 0x1ull; } return res; }

int main(int argc, char** argv) { const auto itr = atoi(argv[1]);

auto ret = 0; for ( auto i = 0; i < itr; ++i ) { ret += popcnt(i); }

return ret; } </stdint.h></stddef.h>

Compiling and Disassembling your Code

STOKE is a compiler and programming language agnostic optimization tool. It can be applied to any x86-64 ELF binary. Although this example uses the GNU toolchain, nothing prevents the use of other tools. To build this code with full optimizations, type:

$ g++ -std=c++11 -O3 -fno-inline

Note that turning on optimizations (at least -O1) is helpful to remove unneeded stack operations, and gives STOKE a better program to start from. Usually, if using STOKE for optimization, starting with a better program often results in a better program. To measure runtime, type:

$ time ./a.out 100000000

real 0m1.046s user 0m1.047s sys 0m0.000s

A profiler will reveal that the runtime of

is dominated by calls to the
function. STOKE can be used to improve the implementation of this function as follows. The first step is to disassemble the program by typing:
$ stoke extract -i ./a.out -o bins

This will produce a directory named

that contains the text of every function contained in the binary

Help for stoke or any of its subcommands can be obtained by typing:

$ stoke -h
$ stoke  -h

STOKE can accept arguments either through the command line or through a configuration file. The invocation of

stoke extract
shown above is equivalent to the following:
$ stoke extract --config extract.conf


##### stoke extract config file

-i ./a.out # Path to the elf binary to disassemble -o bins # Path to the directory to store disassembled text in

Every STOKE subcommand can be used to generate example configuration files by typing:

$ stoke  --example_config 

was compiled using
, the text of the
function will appear under the mangled name
  .globl _Z6popcntm
  .type _Z6popcntm, @function
  xorl   %eax,%eax
  testq  %rdi,%rdi
  je     .L_4005b0
  movq   %rdi,%rdx
  andl   $0x1,%edx
  addq   %rdx,%rax
  shrq   $0x1,%rdi
  jne    .L_4005a0
  .size _Z6popcntm, .-_Z6popcntm

Test Case Generation

The next step is to generate a set of testcases for guiding STOKE's search procedure. There are a few ways of genrating testcases:

  1. Random generation + backtracking
  2. Symbolic Execution + random search
  3. Custom test case generator
  4. Dynamically recording execution data from a sample program

Option 1 is the easiest to start with, but can be limitted. It reliably works if there are no branches or instructions that trigger exceptions (like division). It tends to have trouble if the input code has both control flow branches and memory dereferences. To give this a try, one can run:

stoke testcase --target bins/_Z6popcntm.s --max_testcases 1024 -o

It will generate 1024 random test cases to standard output and save them in
. It will put random values in registers, and then try to fill in dereferenced memory locations with random values.

Option 2 takes more compute time than option 1, and does well in different circumstances. It uses STOKE's formal verification tools to symbolically execute the code on paths up to a certain bound, generating a few test cases for each path. It uses random search to produce extra test cases beyond these. It's good for exercising corner cases in code. It tends to do poorly in cases where (i) a loop executes a fixed, large number of iterations, meaning there are no short paths through the program; (ii) where there's an exponential number of paths; and (iii) when there are a lot of memory dereferences and the bound is high. In the case of the tutorial, a bound of 64 is needed to exercise all the relevant program paths, but the tool handles this:

stoke_tcgen --target bins/_Z6popcntm.s --bound 64 --output

Option 3 means writing code to generate your own test cases for your problem. This gives you the most versitility and can be used in almost any situation. Often you can use a combination of domain knowledge and randomness to create test cases that thoroughly explore paths through the program, especially paths involving long-running loops. Combining option 3 and option 2 is often a powerful combination. For an example of this, see the code in

in the
branch. This code generates random test cases for a particular set of benchmarks that need arrays of fixed length and some static read-only data.

Lastly, option 4, dynamic instrumentation of a program, offers the flexibility of option 3 and (ideally) is easier to use. Right now, unfortunately, the tools are a bit buggy (see #971). When the tool is working, these can be obtained by typing:

$ stoke testcase --config testcase.conf


##### stoke testcase config file

--bin ./a.out # The name of the binary to use to generate testcases --args 10000000 # Command line arguments that should be passed to ./a.out --functions bins # Disassembly directory created by stoke extract

-o tcs/_Z6popcntm # Path to file to write testcases to

--fxn _Z6popcntm # The name of the function to generate testcases for --max_testcases 1024 # The maximum number of testcases to generate.

The resulting file will contain 1024 entires, all of the form:

Testcase 0:

%rax 00 00 00 00 00 98 96 80 %rcx 00 00 00 00 00 00 00 00 %rdx 00 00 00 00 00 00 00 0a %rbx 00 00 00 00 00 00 00 01 %rsp 00 00 7f ff 97 44 36 28 %rbp 00 00 00 00 00 00 00 00 %rsi 19 99 99 99 99 99 99 99 %rdi 00 00 00 00 00 00 00 00 %r8 00 00 2a c9 68 1a 50 40 %r9 00 00 7f ff 97 44 46 01 %r10 00 00 00 00 00 98 96 80 %r11 00 00 00 00 00 00 00 0a %r12 00 00 00 00 00 98 96 80 %r13 00 00 7f ff 97 44 37 20 %r14 00 00 00 00 00 00 00 00 %r15 00 00 00 00 00 00 00 00

%ymm0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 %ymm1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f %ymm2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 00 00 00 00 00 ff %ymm4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm5 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm6 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm7 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm9 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm12 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm13 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 %ymm15 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

%cf 0 %1 1 %pf 1 %0 0 %af 0 %0 0 %zf 0 %sf 0 %tf 0 %if 1 %df 0 %of 0 %iopl[0] 0 %iopl[1] 0 %nt 0 %0 0 %rf 0 %vm 0 %ac 0 %vif 0 %vip 0 %id 0

[ 00007fff 97443630 - 00007fff 97443620 ] [ 1 valid rows shown ]

00007fff 97443628 d d d d d d d d 00 00 00 00 00 40 04 6c

[ 00000000 00000000 - 00000000 00000000 ] [ 0 valid rows shown ]

[ 00000000 00000000 - 00000000 00000000 ] [ 0 valid rows shown ]

0 more segment(s)

Each entry corresponds to the hardware state that was observed just prior to an execution of the

function. The first 60 rows represent the contents of general purpose, sse, and eflags registers, and the remaining rows represent the contents of memory, both on the stack and the heap. Memory is shown eight bytes at a time, where a block of eight bytes appears only if the target dereferenced at least one of those bytes. Each row contains values and state flags. Bytes are flagged as either (v)alid (the target dereferenced this byte), or (.)invalid (the target did not dereference this byte).

Final Configuration

Each of the random transformations performed by STOKE are evaluated with respect to the contents of this file. Rewrites are compiled into a sandbox and executed beginning from the machine state represented by each entry. Rewrites are only permitted to dereference defined locations. This includes registers that are flagged as

, below), memory locations that are flagged as 'v', or locations that were written previously. Rewrites are permitted to write values to all registers and to any memory location that is flagged as valid.

STOKE will produce optimal code that works on the testcases. The testcases need to be selected to help ensure that STOKE doesn't produce an incorrect rewrite. In our file in

we choose arguments to the
function to make sure that it sometimes provides arguments that use more than 32 bits. Otherwise, STOKE will sometimes produce a rewrite using the
instruction, which only operates on the bottom half of the register, instead of the
instruction, which operates on the whole thing. Alternatively you can use the formal validator in bounded mode with a large bound (over 32). This large bound is tractable because this example doesn't has a small number of cases for memory aliasing (namely, none at all!). If a counterexample is found it can be automatically added to the search so STOKE won't make this mistake again.

The STOKE sandbox will safely halt the execution of rewrites that perform undefined behavior. This includes leaving registers in a state that violates the x86-64 callee-save ABI, dereferencing invalid memory, performing a computation that results in a floating-point exception, or becoming trapped in a loop that performs more than

, below).

Starting STOKE

The final step is to use these testcases and the target code contained in

to run STOKE search in synthesis mode (i.e., trying to find a program starting from the empty program) by typing:
$ stoke synthesize --config synthesize.conf


##### stoke search config file

--out result.s # Path to write results to

--target bins/_Z6popcntm.s # Path to the function to optimize

--def_in "{ %rdi }" # The registers that are defined on entry to the target --live_out "{ %rax }" # The registers that are live on exit from the target

--testcases # Path to testcase file --training_set "{ 0 ... 7 }" # Testcases to use for measuring correctness during search --test_set "{ 8 ... 1023 }" # Testcases to use as holdout set for checking correctness

--distance hamming # Metric for measuring error between live-outs --misalign_penalty 1 # Penalty for results that appear in the wrong location --reduction sum # Method for summing errors across testcases --sig_penalty 9999 # Score to assign to rewrites that produce non-zero signals

--cost "correctness + latency" # Measure performance by summing instruction latencies

--global_swap_mass 0 # Proposal mass --instruction_mass 1 # Proposal mass --local_swap_mass 1 # Proposal mass --opcode_mass 1 # Proposal mass --operand_mass 1 # Proposal mass --rotate_mass 0 # Proposal mass

--beta 1 # Search annealing constant --initial_instruction_number 5 # The number of nops to start with

--statistics_interval 100000 # Print statistics every 100k proposals --timeout_iterations 16000000 # Propose 16m modifications total before giving up --cycle_timeout 1000000 # Try 1m modifications before restarting

--strategy hold_out # Verify results using a larger hold out testcase set

STOKE search will produce two types of status messages. Progress update messages will be printed whenever STOKE discovers a new lowest cost verified or unverified rewrite. The code shown on the left is not equivalent to the target code; the code shown on the right is with respect to the current set of testcases.

Progress Update: 

Lowest Cost Discovered (9) Lowest Known Correct Cost (15)

btrq $0xffffffffffffffc0, %rdi testq %rdi, %rdi
retq je .L_X64ASM_0
xorl %eax, %eax
movl %edi, %edx
andl $0x1, %edx
addl %edx, %eax
shrq $0x1, %rdi
jne .L_X64ASM_1
xorl %eax, %eax

Statistics updates will be printed every

proposals. Statistics are shown for the number of proposals that have taken place, elapsed time, proposal throughput, and for each of the transformations specified to have non-zero mass in
Statistics Update: 

Iterations: 100000 Elapsed Time: 0.0836948s Iterations/s: 1.19482e+06

Move Type Proposed Succeeded Accepted

Instruction 16.791% 5.83% 2.009%
Opcode 16.646% 8.857% 4.013%
Operand 16.593% 10.444% 6.864%
Rotate 16.611% 0.791% 0.789%
Local Swap 16.597% 1.556% 1.128%
Global Swap 16.762% 7.066% 6.08%
Extension 0% 0% 0%

Total 100% 34.544% 20.883%

When search has run to completion, STOKE will write the lowest cost verified rewrite that it discovered to

. Because this is a particularly simple example, STOKE is almost guaranteed to produce the optimal rewrite:
  .globl _Z6popcntm
  .type _Z6popcntm @function
  popcnt %rdi, %rax
  .size _Z6popcntm, .-_Z6popcntm

Rewriting the Binary

This result can then be patched back into the original binary by typing:

$ stoke replace --config replace.conf


##### stoke replace config file

-i ./a.out # Path to the elf binary to patch --rewrite result.s # Path to the replacement function

And runtime can once again be measured by typing:

$ time ./a.out 100000000

real 0m0.133s user 0m0.109s sys 0m0.000s

As expected, the results are close to an order of magnitude faster than the original.

Using the Formal Validator

STOKE includes a formal validator. It's design and interface are detailed in the

file. To use the formal validator instead of hold out testing, specify
--strategy bounded
for any STOKE binary that you use. For code with loops, all paths will be explored up to a certain depth, specified using the --bound argument, which defaults to 2. There's also
--strategy ddec
which attempts to run the data-driven equivalence checking algorithm; however, the current implementation isn't very robust -- please file bug reports with (target, rewrite) pairs that fail to validate but should.

The bounded validator can be used to verify the example, but it takes a little while! One can run

make check
to do a fast check or
stoke debug verify --def_in "{ %rax %rdi }" --live_out "{ %rax }" --target bins/_Z6popcntm.s --rewrite result.s --abi_check --strategy bounded --bound 64
to do a complete proof of equivalence. The faster check uses a bound of 8. Roughly speaking, this checks that the rewrite is correct when the input value is only 8 bits in size. Increasing the boundwill check more cases, and when the bound is 64 it will check all of them, but running time is exponential in the bound.

Another example of using the validator can be found in the

folder; this example has a Makefile much like the tutorial's and should be easy to follow. The key difference is that the pairity example does not use testcases to guide search. Instead, after producing a candidate rewrite, the validator checks for equivalence. If the codes are not equivalent, a counterexample is found, and this is used as a new testcase to help guide search.

There are some important limitations to keep in mind while using the validator:

  • Only some instructions are supported. The
    flag can be used to only propose instructions that can be validated.
  • Only the general purpose registers, SSE registers (
    ) and five of the status flags (
    ) are supported.
  • Memory is now fully supported, even in the presence of complex aliasing.

Additional Features

In addition to the subcommands described above, STOKE has facilities for debugging and benchmarking the performance of each of its core components. See

stoke --help
for an up-to-date list.
  • stoke debug cfg
    : Generate a pdf of a control flow graph.
  • stoke debug cost
    : Compute the cost of a rewrite.
  • stoke debug diff
    : Diff the resulting state of two functions.
  • stoke debug effect
    : Show the effect of a function on the state.
  • stoke debug formula
    : Show the SMT formula for a straight-line piece of code.
  • stoke debug sandbox
    : Step through the execution of a rewrite.
  • stoke debug search
    : View the changes produced by performing and undoing a program transformation.
  • stoke debug simplify
    : Take an x86 program and simplify it (by removing redundant instructions).
  • stoke debug state
    : Check the behavior of operators that manipulate hardware machine states.
  • stoke debug tunit
    : Show the instruction sizes and RIP-offsets for a code.
  • stoke debug verify
    : Check the equivalence of two programs.
  • stoke benchmark cfg
    : Measure the time required to recompute a control flow graph.
  • stoke benchmark cost
    : Measure the time required to compute a cost function.
  • stoke benchmark sandbox
    : Measure the time required to execute a program in a STOKE sandbox.
  • stoke benchmark search
    : Measure the time required to perform and undo a transformation to a program.
  • stoke benchmark state
    : Measure the time required to reset the memory of a hardware machine state.
  • stoke benchmark verify
    : Measure the time required to check the equivalence of two programs.

Shell completion

STOKE also comes with support for bash and zsh completion. To enable either, type:

$ make bashcompletion $ make zshcompletion

Using functions to be proposed by STOKE

STOKE can not only propose instructions when searching for programs, but also propose calls to a list of known functions using the

command-line argument. To decide whether these functions read any undefined state (before proposing them), we use a dataflow analysis. Sometimes, the dataflow analysis can be too imprecise, which is why STOKE allows the user to annotate dataflow information in comments. Here is an example of a function that clears the overflow flag. STOKEs dataflow analysis is too imprecise for this code.
  .globl clear_of
  .type clear_of, @function
#! maybe-read { }
#! maybe-write { %of %r15 }
#! maybe-undef { }
  popq %r15
  andq $0xfffff7ff, %r15
  pushq %r15

.size clear_of, .-clear_of

Note that it is enough to specify the maybe sets, as STOKE will automatically realize that the must sets need to be contained in the maybe set.

User FAQ

What is the different between
stoke synthesize
stoke optimize

Both use the same core search algorithm, but in synthesis mode, STOKE starts from the empty program and tries to find a rewrite from scratch. This is great for finding implementations that are very different than the target. In optimization mode however, STOKE starts from an initial program, usually the target. This allows STOKE to work on much longer programs (because it already starts with a correct program) and apply optimizations to that program.

stoke replace
errors with
New function has N bytes, but the old one had M
. What does that mean?

Right now,

stoke replace
has a limitation where it can only replace a function if the old implementation has at least the size (in bytes) of the new implementation.

If you still want to use

stoke replace
, and if you control the compilation of the binary, a workaround is to make the old implementation artificially larger by using the compiler flag
for some large enough
, say 64. In this case, the compiler will align functions at
bytes, which typically requires padding the functions with
s. This increases the chance of
stoke replace
to succeed.

Developer FAQ

How does the assembler work (and how do I debug it?)

There is a good explanation in the issue tracker. We also have a script to compare how gcc and the x64asm assembler assemble an instruction.

How can I run STOKE in gdb?

STOKEs sandbox catches SIGFPEs, and thus running STOKEs search in the sandbox causes gdb to pause very often. To not have it stop for SIGFPEs (they are almost never a problem for STOKE), run this inside gdb:

handle SIGFPE nostop noprint

You can enable this by default by running the following command:

echo "handle SIGFPE nostop noprint" > .gdbinit

Extending STOKE

This repository contains a minimal implementation of STOKE as described in the academic papers about STOKE. Most, but not all of the features described in those papers appear here. Developers who are interested in refining these features or adding their own extensions are encouraged to try modifying this implementation as described below.

Code Organization

The STOKE source is organized into modules, each of which correspond to a subdirectory of the

  • src/analysis
    : An aliasing analysis used by the validator.
  • src/cfg
    : Control flow graph representation and program analysis.
  • src/cost
    : Different cost functions that can be used in the search.
  • src/disassembler
    : Runs objdump and parses the results into STOKE's format.
  • src/expr
    : A helper used to parse arithmetic expressions in config files.
  • src/ext
    : External dependencies.
  • src/sandbox
    : A sandbox for testing proposed code on the hardware.
  • src/search
    : An implementation of an MCMC-like algorithm for search.
  • src/state
    : Data structures to represent concrete machine states (testcases).
  • src/stategen
    : Generates concrete machine states (testcaes) for a piece of code.
  • src/symstate
    : Models the symbolic state of the hardware; only used by the formal validator.
  • src/target
    : Code to find which instruction sets the CPU supports.
  • src/transform
    : Transforms used during search to mutate the code.
  • src/tunit
    : Classes for representing a function (x86-64 code along with a name and other metadata).
  • src/verifier
    : Wrappers around verification techniques such as testing for formal validation.
  • src/validator
    : The formal validator for proving two codes equivalent.


directory has the code that performs application logic and reads command line arguments.
  • tools/apps
    : The application logic for stoke binaries
  • tools/args
    : Lists of command line arguments used by a gadget (see below).
  • tools/gadgets
    : Modules used by applications to configure internal APIs with command line arguments.
  • tools/io
    : Code to read/write certain kinds of command line arguments.
  • tools/scripts
    : Where we put stuff when we don't have a better place. Nothing to see here.
  • tools/target
    : Arbitrarily named directory with code to read CPU features from cpuinfo.


The stoke codebase is setup in a very modular way. We have components like the

, which emulates execution of a rewrite on hardware. Or, we have subclasses of
which evaluate the quality of a rewrite. Or, we have an
which is used by the formal validator to query a backend like Z3 or CVC4.

Often, several stoke applications will wish to configure one of these modules in the same way, depending on command line arguments. Thus, we have "Gadgets". A "Gadget" is a subclass of the class you wish to configure which takes care of extracting all the appropriate command line arguments. Some Gadgets, like

just define a constructor so that modifies the object's configuration. More involved ones, like
actually do work to create a new
object and define methods that act as a wrapper.

Therefore, if you want to add a command line option to an existing component of stoke, you normally are going to want to modify the gadget for that component in

and add the argument in
. Once you do that, it should show up uniformly in all of the stoke tools that use that module.

Initial Search State

Initial state types are defined in

along with an additional type for user-defined extensions.
enum class Init {

// Add user-defined extensions here ... EXTENSION };

Initial state is specified using the

command line argument which controls the initial values given to the current, lowest cost, and lowest cost correct search states. This value affects the behavior of the
Search::configure() const
method, which dispatches to the family of
Search::configure_xxxxx() const
methods. User-defined extensions should be placed in the
Search::configure_extension() const
method, which can be triggered by specifying
--init extension
void Search::configure_extension(const Cfg& target, SearchState& state) const {
  // Add user-defined logic here ...

// Invariant 1: Search state should agree with target on boundary conditions. assert(state.current.def_ins() == target.def_ins()); assert(state.current.live_outs() == target.live_outs());

assert(state.best_yet.def_ins() == target.def_ins()); assert(state.best_yet.live_outs() == target.live_outs());

assert(state.best_correct.def_ins() == target.def_ins()); assert(state.best_correct.live_outs() == target.live_outs());

// Invariant 2: Search state must be in a valid state. This function isn't on // a critical path, so this can safely be accomplished by calling state.current.recompute(); state.best_yet.recompute(); state.best_correct.recompute();

// Invariant 3: Search state must agree on first instruction. This instruction // must be the label definition that appears in the target. assert(state.current.get_code()[0] == target.get_code()[0]); assert(state.best_yet.get_code()[0] == target.get_code()[0]); assert(state.best_correct.get_code()[0] == target.get_code()[0]);

// See Search::configure for additional invariants }

Search Transformations

Transformation types are defined in the

directory. Each transform is a subclass of the abstract class
. Existing transforms are,

| Name | Description | | ---- | ----------- | | addnops | Adds one extra nop instruction into the rewrite. | | delete | Deletes one instruction at random. | | instruction | Replaces an instruction with another one chosen at random. | | opcode | Replaces an instruction's opcode with a new one that takes operands of the same type. | | operand | Replaces an operand of one instruction with another. | | rotate | Formerly "resize". Moves an instruction from one basic block to another, and shifts all the instructions in between. | | localswap | Takes two instructions in the same basic block and swaps them. | | global_swap | Takes two instructions in the entire program and swaps them. | | weighted | Selects from among several other transforms at random. |

These subclasses each implement

operator()(Cfg& cfg)
to mutate a Cfg. This function returns an object,
that contains all the information needed to undo this transformation, and also whether the transform succeeded (transforms are allowed to fail). It's common for this object to be set with indexes of instructions in the code that were modified, for example. The subclass also implements
undo(Cfg& cfg, TransformInfo ti)

Transforms will often want to select from a collection of operands and opcodes, and for this purpose they can access the

protected variable of the Transform
superclass.  This is of type
and allows access to
these collections.  This makes it possible to configure the collection of
available opcodes and operands independently of the transforms.  Also, the
superclass has a
gen_` member which is used to produce random numbers with a known seed.

Transformation weights are specified using the family of

command line arguments. These values control the distribution of proposals that are made by the WeightedTransform, which is the transform used by the search.

A simple example of how to impelement a transform is in

. Note that all transforms must appropriately make a call to recompute any
information that needs to be updated and ensure that
returns true when done (you can assume it returns true at the beginning of the function).

Cost Function

A cost function is specified using the

command line argument. It's an expression composed using standard unsigned arithmetic operators. As variables, you can use several measurements of the current rewrite. The most important of these is
. The value
is (by default) the number of bits that differ in the outputs of the target versus the rewrite summed across all testcases. There are some tunable options for this, for example, for floating point computations. In all cases, lower cost is better.

Some other important cost-variables you can use are:

| Name | Description | | ---- | ----------- | | binsize | The size (in bytes) of the assembled rewrite using the x64asm library. | | correctness | How "correct" the rewrite's output appears. Very configurable. | | size | The number of instructions in the assembled rewrite. | | latency | A poor-man's estimate of the rewrite latency, in clock cycles, based on the per-opcode latency table in

. | | measured | An estimate of running time by counting the number of instructions actually executed on the testcases. Good for loops and algorithmic improvements. | | sseavx | Returns '1' if both avx and sse instructions are used (this is usually bad!), and '0' otherwise. Often used with a multiplier like
correctness + 1000*sseavx
| | nongoal | Returns '1' if the code (after minimization) is found to be equivalent to one in
. Can also be used with a multiplier. |

In typical usage, you will combine the value of

with other values you want to optimize for. A good starting point is
correctness + measured
correctness + latency
(the latter being default). Improvements might assign an SSE-AVX penalty, like
correctness + latency + 10000*sseavx

To add a new cost function, drop a file into

that subclasses
. Look at
for a simple example. It comes down to overloading the
function to return the value you want. Look at
for an example of how to use runtime data from the sandbox to generate values. Then, add an entry to the map in
so that your new function can be found on the command line.

Live-out Error

Live-out error measurement types are defined in

along with an additional type for user-defined extensions.
enum class Distance {

// Add user-defined extensions here ... EXTENSION };

Measurement type is specified using the

command line argument. This value controls the behavior of the
method, which dispatches to the family of
CorrectnessCost::xxxxx_distance() const
methods, each of which represent a method computing the distance between 64-bit values. User-defined extensions should be placed in the
CostFunction::extension_distance() const
method, which can be triggered by specifying
--distance extension
Cost CostFunction::extension_distance(uint64_t x, uint64_t y) const {                                            
  Cost res = 0;

// Add user-defined implementation here ...

// Invariant 1: Return value should not exceed max_error_cost assert(res <= max_error_cost);

return res;

Verification Strategy

The verification strategy specifies what kind of verification to do on the rewrite. It's controlled using the

command line argument. Right now, the options are 'holdout', 'straightline' or 'bounded'.

Command Line Args

Command line arguments can be added to any of the STOKE subcommands using the following syntax. Argument separators which are printed as part of help messages are specified by defining a heading variable:

auto& heading = Heading::create("Heading Description:");

Command line flags are specified by declaring a

auto& flag = FlagArg::create("flag_name")
  .description("What this flag does");

Any of the built-in c++ primitive types are specified by declaring a

auto& val = ValueArg::create("value_name")
  .description("What this value represents")

User-defined types are specified by additionally providing function objects that define I/O methods.

struct T {
  int x, y, z;

struct Reader { void operator()(istream& is, T& t) const { is >> t.x >> t.y >> t.z; } };

struct Writer { void operator()(ostream& os, const T& t) const { os << t.x << " " << t.y << " " << t.z; } };

auto& val = ValueArg::create("value_name") .alternate("alternate_value_name") .usage(" ") .description("What this value represents") .default_val({0,0,0});

For complex values that are better suited to being read from files, a

may be more appropriate than a
. The syntax is identical.
auto& val = FileArg::create("value_name")
  .description("What this value represents")


Questions and comments are encouraged. Please reach us through the GitHub issue tracker, or alternatively at

[email protected]

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.