1. 1
Hashing: Object Embedding
Reporter: Xu Jiaming (PH.D Student)
Date: 2014.03.27
Computational-Brain Research Center
Institute of Automation, Chinese Academy of Sciences
Report
2. 2
First, What is Embedding?
[ 出自 ]: https://en.wikipedia.org/wiki/Embedding
When some object X is said to be embedded in another object Y,
the embedding is given by some injective and structure-
preserving map f : X → Y. The precise meaning of "structure-
preserving" depends on the kind of mathematical structure of which
X and Y are instances.
Structure-Preserving in IR:
1 2 1 2
:
(X ,X ) (Y ,Y )
f
Sim Sim
→
≈
X Y
3. 3
Then, What is Hash?
[ 出自 ]: https://en.wikipedia.org/wiki/Hash_table
The hash function will assign each key to a unique bucket, but this
situation is rarely achievable in practice (usually some keys will
hash to the same bucket). Instead, most hash table designs assume
that hash collisions—different keys that are assigned by the hash
function to the same bucket—will occur and must be
accommodated in some way.
4. 4
Combine the Two Properties
[1998, Piotr Indyk, cited: 1847]
Locality Sensitive Hashing
. ( , ) , . Pr[ ( ) ( )] 1
. ( , ) (1 ), . Pr[ ( ) ( )] 1
if D p q r then h p h q p
if D p q r then h p h q pε
≤ = ≥
> + = ≥
9. 9
Data-Aware: Spectral Hashing [NIPS.2008]
2
min :
. .: { 1,1}
0
1
ij i j
ij
k
i
i
i
T
i i
i
S y y
s t y
y
y y
n
−
∈ −
=
=
∑
∑
∑ I
min : ( ( ) )
. .: ( , ) { 1,1}
0
T
k
T
T
trace Y D W Y
s t Y i j
−
∈ −
=
=
Y 1
Y Y I
Laplacian Eigenmap
XW Y=
10. 10
Some Questions?
1. Can we obtain hashing codes by binarizing the real-valued low-
dimensional vectors such as LSI?
2. Can we get hashing codes by Deep Learning approaches such
as RBM, or AutoEncoder?
11. 11
Some Questions?
1. Can we obtain hashing codes by binarizing the real-valued low-
dimensional vectors such as LSI?
Of Course !
[R. Salakhutdinov, G. Hinton. Semantic Hashing, SIGIR2007]
2. Can we get hashing codes by Deep Learning approaches such
as RBM, or AutoEncoder?
No Problem !
[R. Salakhutdinov, G. Hinton. Semantic Hashing, SIGIR2007]
13. 13
1/9 - ICML2013:
Title: Learning Hash Functions Using Column Generation
Authors: Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick
Organization: The University of Adelaide (Australia)
Based On: NIPS2005: Distance Metric Learning for Large Margin Nearest Neighbor Classification
Motivation: In content based image retrieval, to collect feedback, users may be required to report
whether image x looks more similar to x+ than it is to a third image x−. This task is typically much
easier than to label each individual image.
11
min
. . 0, 0;
( , ) ( , ) 1 ,
J
i i
H i i H i i i
C
s t
d d i
ξ
ξ
ξ
=
− +
+
≥ ≥
− ≥ − ∀
∑w,ξ
w
w
x x x x
14. 14
2/9 - ICML2013:
Title: Predictable Dual-View Hashing
Authors: Mohammad Rastegari, Jonghyun Choi, Shobeir Fakhraei, Hal Daume III, Larry S. Davis
Organization: The University of Maryland (USA)
Motivation: It is often the case that information about data are available from two or more views, e.g.,
images and their textual descriptions. It is highly desirable to embed information from both domains in
the binary codes, to increase search and retrieval capabilities.
2 2 2 2
2 2 2 2
min
. . sgn(W X )
sgn(W X )
T T T T
V V T T T T T V V V
T
T T T
T
V V V
W X Y Y Y I W X Y Y Y I
s t Y
Y
− + − + − + −
=
=
Y,U
15. 15
3/9 - SIGIR2013:
Title: Semantic Hashing Using Tags and Topic Modeling.
Authors: Qifan Wang, Dan Zhang, Luo Si
Organization: Purdue University (USA)
Motivation: Two major issues are not addressed in the existing hashing methods: (1) Tag information
is not fully utilized in previous methods. Most existing methods only deal with the contents of
documents without utilizing the information contained in tags; (2) Document similarity in the
original keyword feature space is used as guidance for generating hashing codes in previous methods,
which may not fully reflect the semantic relationship.
1
2
2 2 2
min ( )
. . { 1,1} , 0
T
F
k n
C
s t
γ
×
− + + −
∈ − =
Y,U
T U Y U Yθ
Y Y1
g
16. 16
3/9 - SIGIR2013:
Title: Semantic Hashing Using Tags and Topic Modeling.
Authors: Qifan Wang, Dan Zhang, Luo Si
Organization: Purdue University (USA)
Motivation: Two major issues are not addressed in the existing hashing methods: (1) Tag information
is not fully utilized in previous methods. Most existing methods only deal with the contents of
documents without utilizing the information contained in tags; (2) Document similarity in the
original keyword feature space is used as guidance for generating hashing codes in previous methods,
which may not fully reflect the semantic relationship.
1
2
2 2 2
min ( )
. . { 1,1} , 0
T
F
k n
C
s t
γ
×
− + + −
∈ − =
Y,U
T U Y U Yθ
Y Y1
g
Our experiments on
20Newsgroups
17. 17
4/9 - IJCAI2013:
Title: A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and
Hashing.
Authors: Debing Zhang, Genmao Yang, Yao Hu, Zhongming Jin, Deng Cai, Xiaofei He
Organization: Zhejiang University (China)
Motivation: Traditionally, to solve problem of nearest neighbor search, researchers mainly focus on
building effective data structures such as hierarchical k-means tree or using hashing methods to
accelerate the query process. In this paper, we propose a novel unified approximate nearest neighbor
search scheme to combine the advantages of both the effective data structure and the fast Hamming
distance computation in hashing methods.
18. 18
5/9 - CVPR2013:
Title: K-means Hashing: an Affinity-Preserving Quantization Method for Learning Binary
Compact Codes.
Authors: Kaiming He, Fang Wen, Jian Sun
Organization: Microsoft Research Asia (China)
Motivation: Both Hamming-based methods and lookup-based methods are of growing interest
recently, and each category has its benefits depending on the scenarios. The lookup-based methods
have been shown more accurate than some Hamming methods with the same code-length. However,
the lookup-based distance computation is slower than the Hamming distance computation. Hamming
methods also have the advantage that the distance computation is problem-independent
1 1
2
0 0
( ( , ) ( , ))
k k
aff ij i j h
i j
E w d c c d i j
− −
= =
= −∑∑
19. 19
6/9 - ICCV2013:
Title: Complementary Projection Hashing.
Authors: Zhongming Jin1
, Yao Hu1
, Yue Lin1
, Debing Zhang1
, Shiding Lin2
, Deng Cai1
, Xuelong Li3
Organization: 1. Zhejiang University, 2. Baidu Inc., 3. Chinese Academy of Sciences, Xi’an (China)
Motivation: 1. (a) The hyperplane a crosses the sparse region and the neighbors are quantized into the
same bucket; (b) The hyperplane b crosses the dense region and the neighbors are quantized into the
different buckets. Apparently, the hyperplane a is more suitable as a hashing function. 2. (a) (b) Both
the hyperplane a and the hyperplane b can evenly separated the data. (c) However, putting them
together does not generate a good two bits hash function. (d) A better example for two bits hash
function
20. 20
7/9 - CVPR2013:
Title: Hash Bit Selection: a Unified Solution for Selection Problems in Hashing.
Authors: Xianglong Liu1
, Junfeng He2,3
, Bo Lang1
, Shih-Fu Chang2
.
Organization: 1. Beihang University(China), 2. Columbia University(US), 3. Facebook(US)
Motivation: Recent years have witnessed the active development of hashing techniques for nearest
neighbor search over big datasets. However, to apply hashing techniques successfully, there are
several important issues remaining open in selection features, hashing algorithms, parameter
settings , kernels, etc.
21. 21
8/9 - ICCV2013:
Title: A General Two-Step Approach to Learning-Based Hashing.
Authors: Guosheng Lin, Chunhua Shen, David Suter, Anton van den Hengel.
Organization: University of Adelaide (Australia)
Based On: SIGIR2010: Self-Taught Hashing for Fast Similarity Search
Motivation: Most existing approaches to hashing apply a single form of hash function, and an
optimization process which is typically deeply coupled to this specific form. This tight coupling
restricts the flexibility of the method to respond to the data, and can result in complex optimization
problems that are difficult to solve. Their framework decomposes hashing learning problem into two
steps: hash bit learning and hash function learning based on the learned bits.
22. 22
9/9 - IJCAI2013:
Title: Smart Hashing Update for Fast Response.
Authors: Qiang Yang, Long-Kai Huang, Wei-Shi Zheng, Yingbiao Ling.
Organization: Sun Yat-sen University (China)
Based On: DMKD2012: Active Hashing and Its Application to Image and Text Retrieval
Motivation: Although most existing hashing-based methods have been proven to obtain high accuracy,
they are regarded as passive hashing and assume that the labeled points are provided in advance. In this
paper, they consider updating a hashing model upon gradually increased labeled data in a fast response
to users, called smart hashing update (SHU).
1. Consistency-based Selection;
2. Similarity-based Selection.
[CVPR.2012]
( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j= −
2
{ 1,1}
1
min l r
l
T
l l
H
F
Q H H S
r×
∈ −
= −
2
1 1
{1,2,...,r}
min k k T
k r r Fk
R rS H H− −
∈
= −
23. 23
Reporter: Xu Jiaming (Ph.D Student)
Date: 2014.03.27
Computational-Brain Research Center
Institute of Automation, Chinese Academy of Sciences
这篇ICML的文章是美国马里兰大学的工作,他的Motivation是现在获取信息一般都是来自多个渠道,例如图片和对应的文字描述。同时从这两个领域进行hash编码。最终能够实现把文字和图片映射到同一个Hamming空间中进行信息检索。例如右下角是他的试验样例,最下面一行,输入Plane flying on the air,果真输出一些飞机在天上飞的照片。同样,最上面输入Laptop placed on the table, 输出的图片结果中,不只有laptop,还有television
这篇ICML的文章是美国普渡大学的工作,他们实验室的大老板是个华人Si Luo,之前在清华读完本科和硕士之后,跑到CMU读了硕士和博士,然后到普渡大学任教,专注于信息检索,机器学习和自然语言处理。然后他的大部分学生都是从清华大学过去深造的。他们实验室的IR方面做了很多比较有意义的工作。这篇文章的Motivation来自两方面,1).说现有的Hash方法没有考虑标签信息,而实际上之前的部分Hashing工作还是考虑了Hash信息的,所以他加了fully这样的修饰;2). 第二方面说以前的Hashing工作都是通过文本的原始特征来保存文本间的相似信息的,这个不能充分反映文本间的语义关系,因而他引入了TopicModeling。去年我写文章的时候也对这方面工作进行了调研,确实还没有人在Hash方法中引入TopicModle工作的,也是初生牛犊不怕虎的就写了一篇基于Topic特征的快速检索方法。所以他们这篇文章中也是指明,As far as we know,他们是第一个引入TopicModel到Hash中的工作。式子中,theta是主题特征,T是标签信息,U是针对标签建模的一个隐变量。约束条件是对Hash码的约束。右下角是他们在20NewsGroups和WebKB上的试验结果,其实效果提升并不是很明显。毕竟他的BaseLine除SSH,其他的都没有使用标签信息