A tool for bulk text comparison and analysis
This is a new version of Superfastmatch written in C++ to improve matching performance and with an index running totally in memory to improve response times.
The point of the software is to index large amounts of text in memory. Therefore there isn't much reason to run it on a 32-bit OS with a 4GB cap on memory and a 64-bit OS is assumed
The process for installation is as follows:
Superfastmatch depends on these libraries:
You might be able to get away with installing the .deb packages on the listed project pages, but this is untested.
The easier route is to run:
and wait for everything to build. The script will ask you for your sudo password, which is required to install the libraries.
On Ubuntu you'll need to do this first:
sudo apt-get install libunwind7-dev mercurial curl build-essential zlib1g-dev
And you might also need a:
after the script has finished.
On Fedora/Amazon AMI this will to allow bootstrap.sh to complete:
sudo yum update sudo yum install git sudo yum install svn sudo yum install gcc sudo yum install gcc-c++ sudo yum install zlib-devel sudo yum install mercurial wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99.tar.gz tar xzf libunwind-0.99.tar.gz cd libunwind-0.99 ./configure && make && sudo make install
and you might have to add /usr/local/lib to /etc/ld.so.conf
After the libraries are installed, you can run:
to run the unit tests for the code.
After that you can run:
to get a superfastmatch instance running. Nothing is currently configurable from the command line yet. Coming soon...
Visit http://127.0.0.1:8080 to test the interface.
For a quick introduction to what can be found with superfastmatch try this:
If you have a machine with less than 8GB of memory and less than 4 cores run:
./superfastmatch -debug -hash_width 24 -reset -slot_count 2 -thread_count 2 -window_size 30
otherwise this will be much faster:
./superfastmatch -debug -reset -window_size 30
And then finally, in another terminal window, run:
to load some example documents and associate them with each other. You can view the results in the browser at:
See contrib/init.d for an example init.d script. Makes use of fuser which may require:
sudo apt-get install psmisc
This is still an early release halfway between Alpha and Beta! There are known issues with large documents affecting the document list and detail pages and the full REST specification is not yet implemented. Lots of fixes, new features and performance improvements are currently in development so keep checking the commit log!
Thanks to Martin Moore and Ben Campbell at Media Standards Trust for ongoing support for the project and to Tom Lee, Drew Vogel, Kaitlin Lee and James Turk at Sunlight Labs for being willing testers, early adopters and proponents of open source!
Thanks also to Mikio Hirabayashi for assistance and the excellent open source Kyoto Cabinet and Kyoto Tycoon, to Craig Silverstein for accepting and improving this patch, to Neil Fraser for useful hints and inspiration from Diff-Match-Patch and to Austin Appleby for hashing advice.