Benchmark of Nearest Neighbor Search on High Dimensional Data
No Data
Nearest neighbor search (NNS) is a fundamental and essential operation in applications from many domains, such as databases, machine learning, multimedia, and computer vision. Hundreds of algorithms have been proposed to attack the problem from different perspectives, yet there is no open and comprehensive comparison. By "comprehensive", we mean using state-of-the-art algorithms proposed from different research areas, and evaluating them on a diverse range of datasets.
To aid researchers and practitioners working on or whose work depends on the problem, we setup this benchmark for Nearest Neighbor Search (NNS) based on the Euclidean distance on High Dimensional Data. The benefit is twofold:
We also would like to leverage the entire community to collaboratively build and maintain this benchmark. For example, submitting new algorithms, useful datasets, or offering suggestions or improvements.
We limit the scope of this benchmark by imposing the following constraints.
Representative and competitive approximate NNS algorithms. It has been well recognized that the exact NNS algorithms often cannot even outperform even the simple linear scan algorithm when the dimensionality is high. We therefore only consider the approximate solutions. In this benchmark, we consider the state-of-the-art algorithms in several research areas and those from practitioners.
No hardware specific optimizations. Not all the implementations we obtained or implemented have the same level of sophistication in utilizing the hardware specific features to speed up the query processing. To perform a (relatively) fair comparison and gain insight into the real strength and weakness of algorithms, we modified several implementations so that, as far as we are aware, no implementation of algorithm in this benchmark uses multiple threads, multiple CPUs, SIMD instructions, hardware pre-fetching, or GPUs.
Dense vectors. We mainly focus on the case where the input data vectors are dense, i.e., non-zero in most of the dimensions.
Support the Euclidian distance. The Euclidean distance is one of the most widely used measures on high-dimensional datasets. It is also supported by most of the NNS algorithms.
Exact kNN as the ground truth. In some existing works, each data point has a label (typically in classification or clustering applications) and the labels are regarded as the ground truth when evaluating the recall of approximate NN algorithms. In our benchmark, we use the exact kNN points as the ground truth as this works for all datasets and majority of the applications.
Currently, we evaluate 15 representative NNS algorithms on 20 datasets where details are reported in our Nearest Neighbor Search (NNS) Experimental Evaluation Paper[1].
Thanks to the original authors, all the algorithms considered in this benchmark have their sources publicly available. Algorithms are implemented in C++ unless otherwise specified. We carefully go through all the implementations and make necessary modifications to for fair comparisons. For instance, we re-implement the search process of some algorithms in C++. We also disable the multi-threads, SIMD instructions, fast-math, and hardware prefetching technique.
Below are brief introuductions of the algorithms evaluated in the benchmark as well as our revisions, which are grouped into three/four categories.
QALSH Query-Aware LSH ([2], PVLDB’15). Originial source code
SRS ([3], PVLDB’14). Originial source code. Note that both external memory and in-memory versions of SRS are available.
SGH Scalable Graph Hashing ([4], IJCAI’15). Originial source code.
AGH Anchor Graph Hashing ([5], ICML’11). Originial source code
NSH Neighbor-Sensitive Hashing ([6], PVLDB’15). Originial source code
SH Selective Hashing ([7], KDD’15). Originial source code. Note that we have confirmed with the authors that the source code released implements the algorithm in their KDD'15 paper, not that in their previous SIGMOD'14 paper.
OPQ Optimal Product Quantization ([8], TPAMI’14). Originial source code. Note that, in our impementation, we use the inverted multi-indexing technique [17] to perform non-exhaustive search for OPQ.
NAPP Neighborhood APProximation index ([9], PVLDB’15). Originial source code
-Ofastcompiler option.
FLANN ([10], TPAMI’14). Originial source code
Annoy ([11]). Originial source code
-ffast-mathcompiler option in Annoy.
VP-Tree Vantage-Point tree ([12], NIPS’13). Originial source code
-Ofastcompiler option.
SW Small World Graph ([13], IS'14). Originial source code
-Ofastcompiler option.
HNSW Hierarchical Navigable Small World ([20], arXiv'16). Originial source code
-Ofastcompiler option.
RCT Rank Cover Tree ([14], TPAMI'15). Originial source code is obtained from authors by email, who kindly allow us to release them through this benchmark.
KGraph ([15] [16], WWW'11). Originial source code
metric.cpp
DPG Diversified Proximity Graph ([1]). Originial source code We engineered a new algorithm, DPG, that constructs an alternative neighborhood graph index, yet uses the same search algorithm as the KGraph. DPG is more robust than KGraph across different datasets. For more details about the algorithm and its analysis, please refer to our NNS Experimental Evaluation Paper[1].
NSG Navigating Spreading-out Graph ([21], VLDB 2019). Originial source code
Currently, we use
Table 1 summarizes the characteristics of the datasets including the number of data points (n), dimensionality (d), Relative Contrast (RC [18]), local intrinsic dimensionality (LID [19]), and data type, where RC and LID are used to describe the hardness of the datasets.
| Name | n X 10^3 | d | RC | LID | Type| | ------ | ------ | ------ | ------ | ------ | -------- | | Nus* | 269 | 500 | 1.67 | 24.5 | Image | | Gist* | 983 | 960 | 1.94 | 18.9 | Image | | Rand* | 1,000 | 100 | 3.05 | 58.7 | Synthetic | | Glove* | 1,192 | 100 | 1.82 | 20.0 | Text | | Cifa | 50 | 512 | 1.97 | 9.0 | Image | | Audio | 53 | 192 | 2.97 | 5.6 | Audio | | Mnist | 69 | 784 | 2.38 | 6.5 | Image | | Sun | 79 | 512 | 1.94 | 9.9 | Image | | Enron | 95 | 1,369 | 6.39 | 11.7 | Text | | Trevi | 100 | 4,096 | 2.95 | 9.2 | Image | | Notre | 333 | 128 | 3.22 | 9.0 | Image | | Yout | 346 | 1,770 | 2.29 | 12.6 | Video | | Msong | 922 | 420 | 3.81 | 9.5 | Audio | | Sift | 994 | 128 | 3.50 | 9.3 | Image | | Deep | 1,000 | 128 | 1.96 | 12.1 | Image | | Ben | 1,098 | 128 | 1.96 | 8.3 | Image | | Imag | 2,340 | 150 | 2.54 | 11.6 | Image | | Gauss | 2,000 | 512 | 3.36 | 19.6 | Synthetic | | UQ-V | 3,038 | 256 | 8.39 | 7.2 | Video | | BANN | 10,000 | 128 | 2.60 | 10.3 | Image |
Table 1: Dataset Summary
We mark the first four datasets in Table 1 with asterisks to indicate that they are hard datasets compared with others.
Below, we give more descriptions of these datasets.
Sift consists of 1 million 128-d SIFT vectors.
Nusw includes around $2.7$ million web images, each as a 500-dimensional bag-of-words vector.
Msong is a collection of audio features and metadata for a million contemporary popular music tracks with $420$ dimensions.
Gist is is an image dataset which contains about 1 million data points with 960 dimensions.
Deep contains deep neural codes of natural images obtained from the activations of a convolutional neural network, which contains about 1 million data points with 256 dimensions.
Bann is used to evaluate the scalability of the algorithms, where 1M, 2M, 4M, 6M, 8M, and 10M data points are sampled from 128-dimensional SIFT descriptors extracted from natural images.
Guass is generated by randomly choosing $1000$ cluster centers with in space $[0,10]^512$, and each cluster follows the a Gaussian distribution with deviation 1 on each dimension.
Random contains $1$M randomly chosen points in a unit hypersphere with dimensionality 100.
Audio has about 0.05 million 192-d audio feature vectors extracted by Marsyas library from DARPA TIMIT audio speed dataset.
Cifar is a labeled subset of TinyImage dataset, which consists of 60000 32 X color images in 10 classes, with each image represented by a 512-d GIST feature vector.
Enron origins from a collection of emails. Yifang et. al. extract bi-grams and form feature vectors of 1369 dimensions.
Glove contains 1.2 million 100-d word feature vectors extracted from Tweets.
Imag is introduced and employed by The ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which contains about 2.4 million data points with 150-d dense SIFT features.
Mnist consists of 70k images of hand-written digits, each as a 784-d vector concatenating all pixels. we randomly sample 1k as the queries and use the remaining as the data base.
Sun contains about 0.08 million 512-d GIST features of images.
Notre contains about 0.3 million 128-d features of a set of Flickr images and a reconstruction.
UKbench contains about 1 million 128-d features of images.
Trevi consists of 0.4 million X 1024 bitmap(.bmp) images, each containing a
16 X 16array of image patch. Each patch is sampled as
64 X 64grayscale, with a canonical scale and orientation. Therefore, Trevi patch dataset consists of around 100,000 4096-d vectors.
UQ_V is a video dataset. A local feature based on some keyframes are extracted which include 256 dimensions.
Yout contains 3,425 videos of 1,595 different people. All the videos were downloaded from YouTube. An average of 2.15 videos are available for each subject. The shortest clip duration is 48 frames, the longest clip is 6,070 frames, and the average length of a video clip is 181.3 frames.
Please refer to our NNS Experimental Evaluation paper[1].
Our own code is released under the Apache License Version 2.0. Copyright is owned by DBWang Group (University of New South Wales, Australia) and Database group at QCIS, UTS (Centre for Quantum Computation & Intelligent Systems, The University of Technology Sydney, Australia).
Below are the license information for the included implementations:
KGraph: BSD license. Users are still encouraged to download the binary distributions from its home site so building issues can be avoided.
NMSLib and Annoy: Apache License Version 2.0.
AGH: License information below
Terms of Use -- Copyright (c) 2009-2011 by -- DVMM Laboratory Department of Electrical Engineering Columbia University Rm 1312 S.W. Mudd, 500 West 120th Street New York, NY 10027 USA -- If it is your intention to use this code for non-commercial purposes, such as in academic research, this code is free. -- If you use this code in your research, please acknowledge the authors, and cite our related publication: --Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang, "Hashing with Graphs," International Conference on Machine Learning (ICML), Bellevue, Washington, USA, 2011
SRS: GPL License.
FLANN: BSD License.
RCT: The Authors grant us the permission to release source code by email.
Algorithms whose license information are not mentioned: NSH, OPQ, QALSH, SGH, and SH.
Here, we present the Recall with respect to Speedup of all the algorithms on four datasets, the k is set to be 20.
Sift Gist Glove MillionSong
[1] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, X. Lin, Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement (v1.0). CoRR, vol. abs/1610.02455, 2016. Online version
[2] Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB, 9(1):1–12, 2015.
[3] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB,8(1):1–12, 2014
[4] Q. Jiang and W. Li. Scalable graph hashing with feature transformation. In IJCAI, pages 2248–2254, 2015.
[5] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages 1–8, 2011.
[6] Y. Park, M. J. Cafarella, and B. Mozafari. Neighbor-sensitive hashing. In PVLDB, 9(3):144–155, 2015.
[7] J. Gao, H. V. Jagadish, B. C. Ooi, and S. Wang. Selective hashing: Closing the gap between radius search and k-nn search. In SIGKDD, 2015.
[8] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):744–755, 2014.
[9] B. Naidan, L. Boytsov, and E. Nyberg. Permutation search methods are efficient, yet faster search is possible. PVLDB, 8(12):1618–1629, 2015.
[10] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell., 36(11):2227–2240, 2014.
[11] E. Bernhardsson. Annoy at github.
[12] L. Boytsov and B. Naidan. Learning to prune in metric and non-metric spaces. In NIPS, 2013.
[13] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Inf. Syst., 45:61–68, 2014
[14] M. E. Houle and M. Nett. Rank-based similarity search: Reducing the dimensional dependence. IEEE TPAMI, 37(1):136–150, 2015.
[15] W. Dong. Kgraph website.
[16] W. Dong, M. Charikar, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, 2011.
[17] A. Babenko and V. S. Lempitsky. The inverted multi-index. In CVPR, pages 3069–3076, 2012.
[18] J. He, S. Kumar, and S. Chang. On the difficulty of nearest neighbor search. In ICML, 2012.
[19] L. Amsaleg, O. Chelly, T. Furon, S. Girard, M. E. Houle, K. Kawarabayashi, and M. Nett. Estimating local intrinsic dimensionality. In SIGKDD, 2015.
[20] Yu. A. Malkov and D. A. Yashunin Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs, In arXiv, 2016
[21] C. Fu, C. Wang, and D. Cai. Fast approximate nearest neighbor search with navigating spreading-out graphs. In VLDB, pages 461- 474, 2019.