Need help with sse-popcount?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

WojciechMula
245 Stars 41 Forks BSD 2-Clause "Simplified" License 130 Commits 0 Opened issues

Description

SIMD (SSE) population count --- http://0x80.pl/articles/sse-popcount.html

Services available

!
?

Need anything else?

Contributors list

# 78,587
Shell
cpp11
c-plus-...
automat...
84 commits
# 88,092
Shell
MATLAB
bundle-...
c-plus-...
3 commits
# 498,232
C++
C
Shell
avx512
3 commits
# 52,400
Shell
C++
nacl
TeX
2 commits
# 128,892
C++
C
Shell
avx512
2 commits
# 270,247
C
C++
avx512
arm-neo...
2 commits
# 207,816
onedriv...
Linux
D
Shell
2 commits
# 603,786
C++
C
Shell
avx512
1 commit
# 486,798
C++
C
Shell
avx512
1 commit

========================================================================

SIMD popcount

Sample programs for my article http://0x80.pl/articles/sse-popcount.html

.. image:: https://travis-ci.org/WojciechMula/sse-popcount.svg?branch=master :target: https://travis-ci.org/WojciechMula/sse-popcount

Paper

Daniel Lemire, Nathan Kurz and I published an article

Faster Population Counts using AVX2 Instructions
__.

__ https://arxiv.org/abs/1611.07612

Introduction

Subdirectory original contains code from 2008 --- it is 32-bit and GCC-centric. The root directory contains fresh C++11 code, written with intrinsics and tested on 64-bit machines.

There are two programs:

  • verify
    --- it tests if all non-lookup implementations counts bits properly;
  • speed
    --- benchmarks different implementations of popcount procedure; please read help to find all options (run the program without arguments).

There are several targets:

  • default --- builtin functions, SSE and popcnt instructions;
  • AVX2 --- all above plus AVX2 implementations;
  • AVX512BW --- all above plus experimental AVX512BW code;
  • AVX512VBMI --- all above plus experimental AVX512VBMI code;
  • AVX512 VPOPCNT --- all above plus experimental AVX512 VPOPCNT code (should be compilable with very recent GCC__, software emulator doesn't support this extension yet);
  • arm --- builtin and ARM Neon implementations.

Type

make help
to find out details. To run the default target benchmark simply type
make
.

__ https://github.com/gcc-mirror/gcc/commit/e0aa57d6b04908affdf4655a6b4a9f2d4d03483b

Available implementations

+---------------------------------------+------------------------------------------------------------------+ | procedure | description | +=======================================+==================================================================+ | lookup-8 | lookup in std::uint8t[256] LUT | +---------------------------------------+------------------------------------------------------------------+ | lookup-64 | lookup in std::uint64t[256] LUT | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel | naive bit parallel method | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized | a bit better bit parallel | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized2 | better utilization of 2- and 4-bit subwords | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-mul | bit-parallel with fewer instructions | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel32 | naive bit parallel method (32 bit) | +---------------------------------------+------------------------------------------------------------------+ | bit-parallel-optimized32 | a bit better bit parallel (32 bit) | +---------------------------------------+------------------------------------------------------------------+ | harley-seal | Harley-Seal popcount (4th iteration) | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel | SSE implementation of bit-parallel-optimized (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel-original | SSE implementation of bit-parallel-optimized | +---------------------------------------+------------------------------------------------------------------+ | sse-bit-parallel-better | SSE implementation of bit-parallel with fewer instructions | +---------------------------------------+------------------------------------------------------------------+ | sse-harley-seal | SSE implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | sse-lookup | SSSE3 variant using pshufb instruction (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | sse-lookup-original | SSSE3 variant using pshufb instruction | +---------------------------------------+------------------------------------------------------------------+ | avx2-lookup | AVX2 variant using pshufb instruction (unrolled) | +---------------------------------------+------------------------------------------------------------------+ | avx2-lookup-original | AVX2 variant using pshufb instruction | +---------------------------------------+------------------------------------------------------------------+ | avx2-harley-seal | AVX2 implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | cpu | CPU instruction popcnt (64-bit variant) | +---------------------------------------+------------------------------------------------------------------+ | sse-cpu | load data with SSE, then count bits using popcnt | +---------------------------------------+------------------------------------------------------------------+ | avx2-cpu | load data with AVX2, then count bits using popcnt | +---------------------------------------+------------------------------------------------------------------+ | avx512-harley-seal | AVX512 implementation of Harley-Seal | +---------------------------------------+------------------------------------------------------------------+ | avx512bw-shuf | AVX512BW implementation uses shuffle instruction | +---------------------------------------+------------------------------------------------------------------+ | avx512vbmi-shuf | AVX512VBMI implementation uses shuffle instruction | +---------------------------------------+------------------------------------------------------------------+ | avx512-vpopcnt | AVX512 VPOPCNT | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt | builtin for popcnt | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt32 | builtin for popcnt (32-bit variant) | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled | unrolled builtin-popcnt | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled32 | unrolled builtin-popcnt32 | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled-errata | unrolled builtin-popcnt avoiding false-dependency | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-unrolled-errata-manual | unrolled builtin-popcnt avoiding false-dependency (asembly code) | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq | builtin-popcnt where data is loaded via SSE registers | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq-unrolled | builtin-popcnt-movdq unrolled | +---------------------------------------+------------------------------------------------------------------+ | builtin-popcnt-movdq-unrolled_manual | builtin-popcnt-movdq unrolled (assembly code) | +---------------------------------------+------------------------------------------------------------------+ | neon-vcnt | ARM Neon using VCNT | +---------------------------------------+------------------------------------------------------------------+ | neon-HS | Harley-Seal using Neon VCNT | +---------------------------------------+------------------------------------------------------------------+ | aarch64-cnt | ARMv8 Neon using CNT | +---------------------------------------+------------------------------------------------------------------+

Performance results

The subdirectory results__ contains performance results from various computers. If you can, please contribute.

__ results/README.rst

Acknowledgments

  • Kim Walisch (@kimwalisch) wrote Harley-Seal scalar implementation.
  • Simon Lindholm (@simonlindholm) added unrolled versions of procedures.
  • Dan Luu (@danluu) agreed to include his procedures (
    builint-*
    ) into this project. More details in Dan's article
    Hand coded assembly
    beats intrinsics in speed and simplicity
    __

__ http://danluu.com/assembly-intrinsics/

See also

  • libpopcnt__ --- library by Kim Walisch utilizing methods from our paper.

__ https://github.com/kimwalisch/libpopcnt

.. vim: nowrap

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.