String distance functions for R
The package offers the following main functions:
stringdistcomputes pairwise distances between two input character vectors (shorter one is recycled)
stringdistmatrixcomputes the distance matrix for one or two vectors
stringsimcomputes a string similarity between 0 and 1, based on
amatchis a fuzzy matching equivalent of R's native
ainis a fuzzy matching equivalent of R's native
afindfinds the location of fuzzy matches of a short string in a long string.
seq_ainfor distances between, and matching of integer sequences. (see also the hashr package).
These functions are built upon
C-code that re-implements some common (weighted) string distance functions. Distance functions include:
Also, there are some utility functions:
qgrams()tabulates the qgrams in one or more
seq_qrams()tabulates the qgrams (somtimes called ngrams) in one or more
phonetic()computes phonetic codes of strings (currently only soundex)
printable_ascii()is a utility function that detects non-printable ascii or non-ascii characters.
As of version
0.9.5.0you can call a number of
stringdistfunctions directly from the
Ccode of your R package. The description of the API can be found
?stringdist_apiin the R console
User guides, package vignettes and other documentationand clicking on
Examples of packages that link to
stringdistcan be found here and here.
To install the latest release from CRAN, open an R terminal and type
To obtain the package from the very latest source code open a
git bashif you work under Windows with
msysgit) and type
git clone https://github.com/markvanderloo/stringdist.git cd stringdist bash ./build.bash R CMD INSTALL output/stringdist_*.tar.gz
Warning: the github version can change any time and may not even build properly. As most of the code is written in
C, the development version may crash your
The following arguments have been obsolete since 2015 and have been removed in the 0.9.5.0 release (spring 2018)
Parallelization used to be based on R's
parallelpackage, that works by spawning several R sessions in the background. As of version 0.9.0,
stringdistuses the more efficient
openMPprotocol to parallelize everything under the hood.
The following arguments have become obsolete and will be removed somewhere in 2016: * Argument
stringdistmatrix. * Argument
amatch). * Argument