Thermite SIMD: Melt your CPU
NOTE: This crate is not yet on crates.io, but I do own the name and will publish it there when ready
Thermite is a WIP SIMD library focused on providing portable SIMD acceleration of SoA (Structure of Arrays) algorithms, using consistent-length1 SIMD vectors for lockstep iteration and computation.
Thermite provides highly optimized feature-rich backends for SSE2, SSE4.2, AVX and AVX2, with planned support for AVX512, ARM/Aarch64 NEON, and WASM SIMD extensions.
In addition to that, Thermite includes a highly optimized vectorized math library with many special math functions and algorithms, specialized for both single and double precision.
1 All vectors in an instruction set are the same length, regardless of size.
Refer to issue #1
Thermite was conceived while working on Raygon renderer, when it was decided we needed a state of the art high-performance SIMD vector library focused on facilitating SoA algorithms. Using SIMD for AoS values was a nightmare, constantly shuffling vectors and performing unnecessary horizontal operations. We also weren't able to take advantage of AVX2 fully due to 3D vectors only using 3 or 4 lanes of a regular 128-bit register.
faster, or redesigning
packed_simdwere all considered, but each has their flaws. SIMDeez is rather limited in functionality, and their handling of
target_featureleaves much to be desired.
fasterfits well into the SoA paradigm, but the iterator-based API is rather unwieldy, and it is lacking many features.
packed_simdisn't bad, but it's also missing many features and relies on the Nightly-only
"platform-intrinsic"s, which can produce suboptimal code in some cases.
Therefore, the only solution was to write my own, and thus Thermite was born.
The primary goal of Thermite is to provide optimal codegen for every backend instruction set, and provide a consistent set of features on top of all of them, in such a way as to encourage using chunked SoA or AoSoA algorithms regardless of what data types you need. Furthermore, with the
#[dispatch]macro, multiple instruction sets can be easily targetted within a single binary.
#[dispatch]procedural macro to ensure optimal codegen.
For optimal performance, ensure you
Cargo.tomlprofiles looks something like this: ```toml [profile.dev] opt-level = 2 # Required to inline SIMD intrinsics internally
[profile.release] opt-level = 3 # Should be at least 2; level 1 will not use SIMD intrinsics lto = 'thin' # 'fat' LTO may also improve things, but will increase compile time codegen-units = 1 # Required for optimal inlining and optimizations
incremental = false # Release builds will take longer to compile, but inter-crate optimizations may work better panic = 'abort' # Very few functions in Thermite panic, but aborting will avoid the unwind mechanism overhead ```
nonequeries, consider using the bitmask directly to avoid recomputing.
alloc(enabled by default)
allocfeature enables aligned allocation of buffers suitable to reading/writing to with SIMD.
nightlyfeature enables nightly-only optimizations such as accelerated half-precision encoding/decoding.
math(enabled by default)
Enables the vectorized math modules
Enables the vectorized random number modules
Real fused multiply-add instructions are only enabled for AVX2 platforms. However, as FMA is used not only for performance but for its extended precision, falling back to a split multiply and addition will incur two rounding errors, and may be unacceptable for some applications. Therefore, the
emulate_fmaCargo feature will enable a slower but more accurate implementation on older platforms.
For single-precision floats, this is easiest done by simply casting it to double-precision, doing seperate multiply and additions, then casting back. For double-precision, it will use an infinite-precision implementation based on libm.
On SSE2 platforms, double-precision may fallback to scalar ops, as the effort needed to make it branchless will be more expensive than not. As of writing this, it has not been implemented, so benchmarks will reveal what is needed later.