PerformanceBenchmark of Locality Sensitity Hashing and KD-Tree Algorithm

1 The Purposes

  • Get familiar with the common ANN algorithms, such as KD-Tree and LSH
  • Learn the implementation of LSH and other related coding skills
  • Analysis the performance of KD-Tree and LSH under different dimensions

2 The Principles

2.1 ANN

Efficient and accurate methods for Approximate Near-NearestNeighbor (ANN) queries are important for large-scale datasetsapplications such as image feature matching, recommendation system,DNA sequencing and so on.
Popular ANN methods include randomized kd-tree(KD-Tree),hierarchical k-means(HKM) and locality-sensitive hashing(LSH).

2.1.1 KDT

KDT partition a vector space by recursively generatinghyperplanes and cut along coordinates where there is maximal variancein the data. KD-Tree is good for searching in low-dimensional spaceswhile its efficiency decreases as dimensionality grows over than 10generally.

  • Building a kd-tree: O(N*logN) for time and O(N) for space
  • Nearest neighbor search: O(logN)
  • M nearest neighbors: O(M*logN)

2.1.2 LSH

LSH algorithm map similar input items to the same hash code withhigher probability than dissimilar items. For a given key size K, wegenerate randomized hyperplanes and check if a point P is above orbelow the hyperplanes. Assigning 0/1 based on below/above checking.And then perform harmming distance search on K bit for approximatenearest neighbors.

  • Build a LSH table: O(N*K) for time and O(N) for space, K is the key size
  • Nearest neightbor search: O(1) for the best situation

2.2 Related Open Source Projects

FLANNis a library for performing fast approximate nearest neighborsearches in high dimensional spaces. It contains a collection ofalgorithms we found to work best for nearest neighbor search and asystem for automatically choosing the best algorithm and optimumparameters depending on the dataset.

LsHash is an open source project on Github and implement a fast pythonlocality sensitive hashing with persistance support.

Webenchmark our KD-Tree performance based on FLANN and pyflann project.The dynamic library libflann.so is compiled and linked bypyflann python bindings. Our Lsh algorithm is based on LsHash andsome modifications are made.

2.3 Datasets

2.3.1 SIFT10K

SIFT10K dataset contains 128 dimensionalities of SIFT feature. Itcomprises up to 10 thousand of data vectors and one hundred of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.3.3 SIFT1M

SIFT1M dataset contains 128 dimensionalities of SIFT feature. Itcomprises up to 1 million of data vectors and 10 thousand of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.3.3 GIST1M

GIST dataset contains 960 dimensionalities of GIST feature. Itcomprises up to 1 million of data vectors and one thousand of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.4 Measure

We choose 3 main factors to benchmark performance of KD-Tree andLSH. Precesion can be seen as a measure of exactness or quality andtime/memory cost are also taken into account.
For each expriment, we will plot the two (x,y) lines underdifferent different algorithms, where y is the search time gain overlinear search time and x is the corresponding precision. Theprecision is the average precision of all query points.
Regarding expriment figure may like Fig.1 :

Fig. 1: respecting expriment figure on comparision of kd-tree and lsh algorithms. IT’S NOT ACTUAL EXPRIMENTAL DATA!

3 The Procedures

This is the machine configuration:

  • Ubuntu 16.04, 64 bit, with Intel Core i7-4970K 4.0GHz x8 and 16GB RAM
  • Main programming language: Python2.7 & C11++
  • FLANN 1.8.4 and pyflann bindings
  • LSH implementation via python based on open source LsHash

3.1 Data Dimensionality

The Startdard dataset SIFT1M contains 128 dimensionality of SIFTfeature and GIFT1M comprises 960 dimensionality of GIST feature. Wecompare the each performance of KD-Tree and LSH on dimension 128, 960respectively.

For KD-Tree, the parameters include Kd-tree number K and how manyleafs to visit when searching for neighbors, namely Checks. For LSH,the parameters include table number L and key size.

For each dimension, we will calculate the distribution of(precision, search time gain over linear) and plot it using pythonmatplotlib.
Fig.2 and Fig.3 are the respecting expriement result.

Fig. 2: (precision, search time gain over linear) distribution on SIFT dataset

Fig. 3: (precision, search time gain over linear) distribution on GIST dataset

Besides, we will calculate the memory cost on both 128 and 960dimentionality.

3.2 Dataset Size

We select dataset size of 10K, 1M from standard dataset SIFT10Kand SIFT1M. We discard the GIST1M dataset as the GIST featuredimension varies. For better comparision, SIFT1B need to be takeninto account. However, it’s too large to run on my computer atpresent.
In this expriment, we will vary the tree number K, search leafsChecks for KD-Tree algorithm and table number L, key Size for LSHalgorithm to get the disbritution of (search time gain over linear,precision) under different dataset size.
Fig.2 and Fig.3 are the respecting expriement result.

Fig. 5: (precision, search time gain over linear) distribution on SIFT1M dataset

Fig. 4: (precision, search time gain over linear) distribution on SIFT10K dataset

Besides, we will calculate the memory cost on both 10K and 1Mdataset.

3.3 Varying k

The ground truth dataset comprises top k=100 features for eachquery. To compare specific k performance of KD-Tree and LSHalgorithms, we select k=1, 3, 30, 100. We choose SIFT1M as thedataset and discard SIFT10K and GIST1M to keep the other dimensionsame other than different k.
In this expriment, we will vary the tree number K, search leafsChecks for KD-Tree algorithm and table number L, key Size for LSHalgorithm to get the disbritution of (search time gain over linear,precision) under different dataset size.

Fig. 6: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=1

Fig. 7: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=3

Fig. 9: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=100

Fig. 8: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=30

4 The Results

To be continued...

5 The Bibliography

[1]MariusMuja and David G. Lowe: Nearest Neighbor Algorithms for High Dimensional Data. Pattern Analysis and Machine Intelligence (PAMI), Vol. 36, 2014.

[2]A. Gionis, P. Indyk, and R. Motwani. Similarity search in highdimensions via hashing. In Proc. Int’l Conf. on Very Large DataBases (VLDB), 1999. 1, 2, 4

[3]Qin Lv, William Josephson, Zhe Wang, Moses Charikar, Kai Li.Multi-Probe LSH: Efficient Indexing for High-Dimensional SimilaritySearch. Proceedings of the 33rd International Conference on VeryLarge Data Bases (VLDB). Vienna, Austria. September 2007

[4]Marius Muja, David Lowe. FLANN - Fast Library for Approximate NearestNeighbors User Manual. January 24, 2013

[5]C. Silpa-Anan and R. Hartley. Optimised KD-trees for fast imagedescriptor matching. In Proc. Computer Vision and Pattern Recognition(CVPR), 2008. 1, 2

[6]IoannisZ. Emiris and DimitriNicolopoulos. Randomizedkd-trees for Approximate Nearest Neighbor Search.

November28, 2013

[7]JingdongWang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing forSimilarity Search: A Survey.

arXiv:1408.2927v1[cs.DS] 13 Aug 2014

[8]MalcolmSlaney, Yury Lifshits and Junfeng He. Optimal Parameters for Locality-Sensitive Hashing.Proceedings of the IEEE, Vol. 100, No. 9, September 2012

[9]Nearest neighbor search. Wikipedia,https://en.wikipedia.org/wiki/Nearest_neighbor_search. Accessed 12 Oct 2016

[10]FLANN-Fast Library for Approximate Nearest Neighbors. University ofBritsh, http://www.cs.ubc.ca/research/flann/.Access 14 Oct 2016

[11]ALGLIB User Guide. ALGLIB Project,http://www.alglib.net/other/nearestneighbors.php.Accessed 14 Oct 2016

[12]Jason. LSH 算法框架分析. Web, http://blog.jasonding.top/2018/01/01/MLStick/ http://www.cnblogs.com/hxsyl/p/4626092.html.Access 11 Oct 2016

[13]坏人. LSH位置敏感哈希算法. CSDN, http://blog.csdn.net/u013378306/article/details/52473176,8 Sep 2016.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 远处炙热的太阳 温暖我心 不知名的树木 拥抱着我 惊鸿一瞥 落在门前待放的月季 紧靠着的满园春色 以绿的姿态映入眼...
    踢球写诗的小何老师阅读 245评论 0 3
  • 今天突破了自己,一个是和线上的小伙伴约了面基,见面聊得很愉快,但是之前因为对自己不太自信,不好意思。但其实见面很多...
    芬芬vstar阅读 323评论 0 0
  • 沙盒机制: 沙盒 : 每个iOS应用程序都会为自己创建一个文件系统目录(文件夹),这个<big>独立,封闭,安全<...
    云之君兮鹏阅读 2,611评论 1 24
  • 投入应在自己的承受范围之内,那就是:即便完全损失,也不会影响自己的生活 永远谨记股神巴菲特的那句忠告:在大家贪婪的...
    God_Morning阅读 296评论 0 0