experimental and very fast implementation of a grep
This is my own, experimental, parallel version of grep so I can test various strategies to speed up access to large directory trees. On Flash storage or SSDs, you can easily outsmart common greps by up a factor of 8.
-O -- print file offset of match -l -- do not print the matching line (Useful if you want to see _all_ offsets; if you also print the line, only the first match in the line counts) -s -- single match; dont search file further after first match (similar to grep on a binary) -H -- use hyerscan lib for scanning (see build instructions) -S -- only for hyperscan: interpret pattern as string literal instead of regex -L -- machine has low mem; half chunk-size (default 1GB) may be used multiple times -I -- enable highlighting of matches (useful) -n -- Use n cores in parallel (recommended for flash/SSD) n <= 1 uses single-core -r -- recurse on directory -R -- same as -r
grab uses the pcre library, so basically its equivalent to a
grep -P -a. The
-Pis important, since Perl-Compatible Regular Expressions have different characteristics than basic regexes.
There are two branches.
greppin. Master is the 'traditional' grab that should compile and run on most POSIX systems.
greppincomes with its own optimized and parallelized version of
readdir(), which again doubles speed on the top of speedup that the
masterbranch already provides. The
greppinbranch runs on Linux, BSD and OSX.
greppinalso comes with support for Intel's hyperscan libraries that try to exploit CPU's SIMD instructions if possible (AVX2, AVX512 etc.) when compiling the regex pattern into JIT code.
You will most likely want to build the
$ git checkout greppin [...] $ cd src; make [...]
On BSD systems you need
make. If you want to do cutting edge tech with greppin's multiple regex engine and hyperscan support, you first need to get and build that:
$ git clone https://github.com/intel/hyperscan [...] $ cd hyperscan $ mkdir build; cd build $ cmake -DFAT_RUNTIME=1 -DBUILD_STATIC_AND_SHARED=1 .. [...] $ make [...]
This will build so called fat runtime of the hyperscan libs which contain support for all CPU families in order to select the right compilation pattern at runtime for most performance. Once the build finishes, you build greppin against that:
(inside grab cloned repo)
$ cd src $ HYPERSCAN_BUILD=/path/to/hyperscan/build make -f Makefile.hs [...]
This will produce a
greppinbinary that enables the
-Hoption to load a different engine at runtime, trying to exploit all possible performance bits.
You could link it against already installed libs, but the API just recently added some functions in the 5.x version and most distros ship with 4.x.
grab is using
mmap(2)and matches the whole file blob without counting newlines (which grep is doing even if there is no match [as of a grep code review of mine in 2012; things may be different today]) which is a lot faster than
read(2)-ing the file in small chunks and counting the newlines. If available, grab also uses the PCRE JIT feature. However, speedups are only measurable on large file trees or fast HDDs or SSDs. In the later case, the speedup can be really drastically (up to 3 times faster) if matching recursively and in parallel. Since storage is the bottleneck, parallelizing the search on HDDs makes no sense, as the seeking takes more time than just doing stuff in linear.
Additionally, grab is skipping files which are too small to contain the regular expression. For larger regex's in a recursive search, this can skip quite good amount of files without even opening them.
A quite new pcre lib is required, on some older systems the build can fail due to a missing
Files are mmaped and matched in chunks of 1Gig. For files which are larger, the last 4096 byte (1 page) of a chunk are overlapped, so that matches on a 1 Gig boundary can be found. In this case, you see the match doubled (but with the same offset).
If you measure grep vs. grab, keep in mind to drop the dentry and page caches between each run:
echo 3 > /proc/sys/vm/drop_caches
Note, that grep will print only a 'Binary file matches', if it detects binary files, while grab will print all matches, unless
-sis given. So, for a speed test you have to search for an expression that does not exist in the data, in order to enforce searching of the entire files.
grab was made to quickly grep through large directory trees without indexing. The original grep has by far a more complete option-set. The speedup for a single file match is very small, if at all measureable.
For SSDs, the multicore option makes sense. For HDDs it does not, since the head has to be positioned back and forth between the threads, potentially destroying the locality principle and killing performance.
greppinbranch features its own parallel version of
nftw(), so the time of idling of N - 1 cores when the 1st core builds the directory tree can also be used for working. Additional to that, since locking is required in the threads anyway, it also comes with its own faster and lockfree
readdir()implementation to save quite some
Whats left to note: grab will traverse directories physically, i.e. it will not follow symlinks.
This shows the speedup on a 4-core machine with a search on a SSD:
[email protected]:~# echo 3 > /proc/sys/vm/drop_caches [email protected]:~# time grep -r foobardoesnotexist /source/linux
real 0m13.135s user 0m4.023s sys 0m5.581s
[email protected]:~# echo 3 > /proc/sys/vm/drop_caches [email protected]:~# time grep -a -P -r linus /source/linux/|wc -l 16918
real 0m8.773s user 0m4.670s sys 0m5.837s [email protected]:~#
Yes! ~ 9s vs. ~ 72s! Thats 8x as fast on a 4-core SSD machine as the traditional grep.
Just to proof that it resulted in the same output:
[email protected]:~# echo 3 > /proc/sys/vm/drop_caches [email protected]:~# greppin -n 4 -r linus /source/linux/|sort|md5sum a1f9fe635bd22575a4cce851e79d26a0 - [email protected]:~# echo 3 > /proc/sys/vm/drop_caches [email protected]:~# grep -P -a -r linus /source/linux/|sort|md5sum a1f9fe635bd22575a4cce851e79d26a0 - [email protected]:~#
In the single core comparison, speedup also depends on which CPU the kernel actually scheduls the grep, so a grab may or may not be faster (mostly it is). If the load is equal among the single-core tests, grab will see a speedup if searching on large file trees. On multi-core setups, grab can benefit ofcorse.
I recently learned about this project via twitter when I was asked for performance comparisons. The project can be found here. I tested their official build
ripgrep-12.1.1-x86_64-unknown-linux-musl.tar.gzand run it on the same 4-core SSD machine as in the above tests. The net result is: both runs have about the same speed with a tendency of greppin being a few percent faster when ripgrep is invoked with
-P. ripgrep with
-eis a few percent faster than greppin with
-H. ITW, it actually also depends on the directory tree layout since my
nftw()implementation takes care to distribute load across cores even on unbalanced directory trees at the price of locking. ripgrep claims to use a lockfree parallel directory iterator written in Rust. I see small speedup of ripgrep when scanning
/usrbut I see the same amount of speedup (just few %) in greppin when scanning my Linux source trees.
The main speedup thats inside their benchmark tables stems from the fact that ripgrep ignores a lot of files when invoked without special options as well as treating binary files as a single-match target (similar to grep). In order to have comparable results, keep in mind to (4 is the number of cores):
echo 3 > /proc/sys/vm/drop_cachesbetween each run
-j 4 -a --no-unicode --no-pcre2-unicode -uuu --mmapto ripgrep, since it will by default match Unicode which is 3 times slower, and tries to compensate the speedloss by skipping 'ignore'-based files.
-eis faster than
-P, so better choose
-e, but thats not as powerful as a PCRE
wc -lto check whether the amount of match reports is equal and no files were missing in the scan
-H -n 4to greppin if you want best performance.
-His PCRE compatible with only very few exceptions (according to hyperscan docu)
setfattr -n user.pax.flags -v "m" /path/to/binaryif you run on grsec systems and require rwx JIT mappings
Then just go ahead and check the timings! :)