1. 1
Learning to Hash for Large-Scale Search
Xu Jiaming
Chinese Academe of Science
2014-07-04 @CUHK
2. 2
Motivation
Similarity based search has been popular in many applications
– Image/video search and retrieval: finding most similar images/videos
– Audio search: find similar songs
– Product search: find shoes with similar style but different color
– Patient search: find patients with similar diagnostic status
Two key components:
– Similarity/distance measure
– Indexing scheme
Whittlesearch (Kovashka et al. 2013)
- 2013CIKM Tutorial by Jun Wang
3. 3
A Conceptual Diagram for Hashing Based Image Search System
Indexing
and Search
Image
Database
Similarity Search & Retrieval
Hash Function Design
Visual Search ApplicationsVisual Search Applications
Reranking
Refinement
Designing compact yet accurate hashing codes is a
critical component to make the search effective
- 2013CIKM Tutorial by Jun Wang
9. 9
STH [2010-SIGIR]
2
min :
. .: { 1,1}
0
1
ij i j
ij
k
i
i
i
T
i i
i
S y y
s t y
y
y y
n
−
∈ −
=
=
∑
∑
∑ I
min : ( ( ) )
. .: ( , ) { 1,1}
0
T
k
T
T
trace Y D W Y
s t Y i j
−
∈ −
=
=
Y 1
Y Y I
Laplacian Eigenmap
Self Taught Hashing (STH)
Unsupervised Learning
Supervised Learning
12. 12
ITQ [2011-CVPR, 2013-TPAMI]
Iterative Quantization
Apply PCA for dimensionality reduction, find to maximize:
Keep top c eigenvectors of the data covariance matrix to
obtain , projected data is
Note that if is an optimal solution then is also optimal for
any orthogonal matrix
Key idea: Find to minimize the quantization loss:
nc and V are fixed so this is equivalent to maximizing ( ) :
15. 15
SHU [2013-IJCAI]
Smart Hashing Update
1. Consistency-based Selection;
2. Similarity-based Selection.
( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j= −
2
{ 1,1}
1
min l r
l
T
l l
H
F
Q H H S
r×
∈ −
= −
2
1 1
{1,2,...,r}
min k k T
k r r Fk
R rS H H− −
∈
= −
16. 16
TSH [2014-ACL]
Two-Stage Hashing
LSH for neighbor candidate pruning; ITQ for
effective re-ranking.
LSH captures term similarity; ITQ captures
topic similarity
Advantages:
High hash lookup success rate is attained by the LSH stage;
High search precision due to the ITQ re-ranking stage;
Scan only a small portion of an entire dataset
Integrate two similarity measures
17. 17
SHTTM [2013-SIGIR]
Semantic Hashing Using Tags and Topic Modeling
Hash Code Learning Hash Function Learning
2 2*
1
* 1
( )
arg min
( )
j j j
n
j j
j
T T
y f x x
y x λ
λ
=
−
= =
= − +
⇒ = +
∑W
W
W W W
W Y X X X I
Tag Consistency
1
2
2 2 2
min ( )
. . { 1,1} , 0
T
F
k n
C
s t
γ
×
− + + −
∈ − =
Y,U
T U Y U Yθ
Y Y1
g
Similarity Preservation
18. 18
DVH [2013-ICML]
Predictable Dual-View Hashing
The goal is to find two sets of hyperplanes that map the visual and textual space into a common
subspace.
CCA
Multi-SVM
19. 19
MVH [2011-SIGIR]
Composite Hashing with Multiple Information Sources
( )
2
2( ) ( ) ( ) ( )
1 2
1 1 1
( , , ) ( ) ( , )
( )
S C
M M M
TT k k k k
k
k k k
J J J
C tr C α
= = =
= +
= + − +∑ ∑ ∑
Y WαY Y W
Y L Y Y W X W%
Overall Objection
24. 24
Reference
[1]. Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via
hashing[C]//VLDB. 1999, 99: 518-529.
[2]. Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor
in high dimensions[C]//Foundations of Computer Science, 2006. FOCS'06. 47th Annual
IEEE Symposium on. IEEE, 2006: 459-468.
[3]. Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest
Neighbor in High Dimensions[J]. COMMUNICATIONS OF THE ACM, 2008, 51(1): 117.
[4]. Charikar M S. Similarity estimation techniques from rounding
algorithms[C]//Proceedings of the thiry-fourth annual ACM symposium on Theory of
computing. ACM, 2002: 380-388.
[5]. Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web
crawling[C]//Proceedings of the 16th international conference on World Wide Web. ACM,
2007: 141-150.
[6]. Zhang D, Wang J, Cai D, et al. Self-taught hashing for fast similarity
search[C]//Proceedings of the 33rd international ACM SIGIR conference on Research
and development in information retrieval. ACM, 2010: 18-25.
[7]. Liu W, Wang J, Ji R, et al. Supervised hashing with kernels[C]//Computer Vision and
Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2074-2081.
25. 25
Reference
[8]. Gong Y, Lazebnik S. Iterative quantization: A procrustean approach to learning binary
codes[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.
IEEE, 2011: 817-824.
[9]. Gong Y, Lazebnik S, Gordo A, et al. Iterative quantization: A procrustean approach to
learning binary codes for large-scale image retrieval[J]. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 2013, 35(12): 2916-2929.
[10]. Lin G, Shen C, Suter D, et al. A general two-step approach to learning-based
hashing[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE,
2013: 2552-2559.
[11]. Yang Q, Huang L K, Zheng W S, et al. Smart hashing update for fast
response[C]//Proceedings of the Twenty-Third international joint conference on Artificial
Intelligence. AAAI Press, 2013: 1855-1861.
[12]. Li H, Liu W, Ji H. Two-Stage Hashing for Fast Document Retrieval[C]. ACL. 2014
[13]. Wang Q, Zhang D, Si L. Semantic hashing using tags and topic
modeling[C]//Proceedings of the 36th international ACM SIGIR conference on Research
and development in information retrieval. ACM, 2013: 213-222.
[14]. Rastegari M, Choi J, Fakhraei S, et al. Predictable Dual-View
Hashing[C]//Proceedings of The 30th International Conference on Machine Learning.
2013: 1328-1336.
26. 26
Reference
[15]. Zhang D, Wang F, Si L. Composite hashing with multiple information
sources[C]//Proceedings of the 34th international ACM SIGIR conference on Research
and development in Information Retrieval. ACM, 2011: 225-234.
[16]. Szmit, Radosław. "Locality Sensitive Hashing for Similarity Search Using
MapReduce on Large Scale Data." Language Processing and Intelligent Information
Systems. Springer Berlin Heidelberg, 2013. 171-178.
[17]. Blog: Location Sensitive Hashing in Map Reduce:
http://horicky.blogspot.hk/2012/09/location-sensitive-hashing-in-map-reduce.html
[18]. Likelike Project: https://github.com/takahi-i/likelike
[19]. Jun Wang. Learning to Hash for Large-Scale Search. 2013 CIKM Tutorial.