SlideShare une entreprise Scribd logo
1  sur  23
1
Hashing: Object Embedding
Reporter: Xu Jiaming (PH.D Student)
Date: 2014.03.27
Computational-Brain Research Center
Institute of Automation, Chinese Academy of Sciences
Report
2
First, What is Embedding?
[ 出自 ]: https://en.wikipedia.org/wiki/Embedding
When some object X is said to be embedded in another object Y,
the embedding is given by some injective and structure-
preserving map f : X → Y. The precise meaning of "structure-
preserving" depends on the kind of mathematical structure of which
X and Y are instances.
Structure-Preserving in IR:
1 2 1 2
:
(X ,X ) (Y ,Y )
f
Sim Sim
→
≈
X Y
3
Then, What is Hash?
[ 出自 ]: https://en.wikipedia.org/wiki/Hash_table
The hash function will assign each key to a unique bucket, but this
situation is rarely achievable in practice (usually some keys will
hash to the same bucket). Instead, most hash table designs assume
that hash collisions—different keys that are assigned by the hash
function to the same bucket—will occur and must be
accommodated in some way.
4
Combine the Two Properties
[1998, Piotr Indyk, cited: 1847]
Locality Sensitive Hashing
. ( , ) , . Pr[ ( ) ( )] 1
. ( , ) (1 ), . Pr[ ( ) ( )] 1
if D p q r then h p h q p
if D p q r then h p h q pε
≤ = ≥
> + = ≥
5
Overview of Hashing
Real World Binary Space
2000 values 32 bits
Binary
Reduction
6
Facing Big Data
Approximation
7
Learning to Hash
Data-Oblivious
Data-Aware
Description Methods
LSI, RBM, SpH, STH, …
LSH, Kernel-LSH, SimHash, …
8
Data-Oblivious: SimHash [WWW.2007]
Text
…
…
Observed
Features
W1
W2
Wn
100110 W1
110000 W2
001001 Wn
…
…
W1 –W1 -W1 W1 W1 -W1
W2 W2 -W2 -W2 -W2 -W2
-Wn –Wn Wn –Wn –Wn Wn
…
…13, 108, -22, -5, -32, 551, 1, 0, 0, 0,
1
Step1:
Compute TF-
IDF
Step2: Hash
Function
Step3: Signature
Step4: Sum
Step5: Generate
Fingerprint
9
Data-Aware: Spectral Hashing [NIPS.2008]
2
min :
. .: { 1,1}
0
1
ij i j
ij
k
i
i
i
T
i i
i
S y y
s t y
y
y y
n
−
∈ −
=
=
∑
∑
∑ I
min : ( ( ) )
. .: ( , ) { 1,1}
0
T
k
T
T
trace Y D W Y
s t Y i j
−
∈ −
=
=
Y 1
Y Y I
Laplacian Eigenmap
XW Y=
10
Some Questions?
1. Can we obtain hashing codes by binarizing the real-valued low-
dimensional vectors such as LSI?
2. Can we get hashing codes by Deep Learning approaches such
as RBM, or AutoEncoder?
11
Some Questions?
1. Can we obtain hashing codes by binarizing the real-valued low-
dimensional vectors such as LSI?
Of Course !
[R. Salakhutdinov, G. Hinton. Semantic Hashing, SIGIR2007]
2. Can we get hashing codes by Deep Learning approaches such
as RBM, or AutoEncoder?
No Problem !
[R. Salakhutdinov, G. Hinton. Semantic Hashing, SIGIR2007]
12
In 2013, What Did They Think About?
Total: 30
13
1/9 - ICML2013:
Title: Learning Hash Functions Using Column Generation
Authors: Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick
Organization: The University of Adelaide (Australia)
Based On: NIPS2005: Distance Metric Learning for Large Margin Nearest Neighbor Classification
Motivation: In content based image retrieval, to collect feedback, users may be required to report
whether image x looks more similar to x+ than it is to a third image x−. This task is typically much
easier than to label each individual image.
11
min
. . 0, 0;
( , ) ( , ) 1 ,
J
i i
H i i H i i i
C
s t
d d i
ξ
ξ
ξ
=
− +
+
≥ ≥
− ≥ − ∀
∑w,ξ
w
w
x x x x
14
2/9 - ICML2013:
Title: Predictable Dual-View Hashing
Authors: Mohammad Rastegari, Jonghyun Choi, Shobeir Fakhraei, Hal Daume III, Larry S. Davis
Organization: The University of Maryland (USA)
Motivation: It is often the case that information about data are available from two or more views, e.g.,
images and their textual descriptions. It is highly desirable to embed information from both domains in
the binary codes, to increase search and retrieval capabilities.
2 2 2 2
2 2 2 2
min
. . sgn(W X )
sgn(W X )
T T T T
V V T T T T T V V V
T
T T T
T
V V V
W X Y Y Y I W X Y Y Y I
s t Y
Y
− + − + − + −
=
=
Y,U
15
3/9 - SIGIR2013:
Title: Semantic Hashing Using Tags and Topic Modeling.
Authors: Qifan Wang, Dan Zhang, Luo Si
Organization: Purdue University (USA)
Motivation: Two major issues are not addressed in the existing hashing methods: (1) Tag information
is not fully utilized in previous methods. Most existing methods only deal with the contents of
documents without utilizing the information contained in tags; (2) Document similarity in the
original keyword feature space is used as guidance for generating hashing codes in previous methods,
which may not fully reflect the semantic relationship.
1
2
2 2 2
min ( )
. . { 1,1} , 0
T
F
k n
C
s t
γ
×
− + + −
∈ − =
Y,U
T U Y U Yθ
Y Y1
g
16
3/9 - SIGIR2013:
Title: Semantic Hashing Using Tags and Topic Modeling.
Authors: Qifan Wang, Dan Zhang, Luo Si
Organization: Purdue University (USA)
Motivation: Two major issues are not addressed in the existing hashing methods: (1) Tag information
is not fully utilized in previous methods. Most existing methods only deal with the contents of
documents without utilizing the information contained in tags; (2) Document similarity in the
original keyword feature space is used as guidance for generating hashing codes in previous methods,
which may not fully reflect the semantic relationship.
1
2
2 2 2
min ( )
. . { 1,1} , 0
T
F
k n
C
s t
γ
×
− + + −
∈ − =
Y,U
T U Y U Yθ
Y Y1
g
Our experiments on
20Newsgroups
17
4/9 - IJCAI2013:
Title: A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and
Hashing.
Authors: Debing Zhang, Genmao Yang, Yao Hu, Zhongming Jin, Deng Cai, Xiaofei He
Organization: Zhejiang University (China)
Motivation: Traditionally, to solve problem of nearest neighbor search, researchers mainly focus on
building effective data structures such as hierarchical k-means tree or using hashing methods to
accelerate the query process. In this paper, we propose a novel unified approximate nearest neighbor
search scheme to combine the advantages of both the effective data structure and the fast Hamming
distance computation in hashing methods.
18
5/9 - CVPR2013:
Title: K-means Hashing: an Affinity-Preserving Quantization Method for Learning Binary
Compact Codes.
Authors: Kaiming He, Fang Wen, Jian Sun
Organization: Microsoft Research Asia (China)
Motivation: Both Hamming-based methods and lookup-based methods are of growing interest
recently, and each category has its benefits depending on the scenarios. The lookup-based methods
have been shown more accurate than some Hamming methods with the same code-length. However,
the lookup-based distance computation is slower than the Hamming distance computation. Hamming
methods also have the advantage that the distance computation is problem-independent
1 1
2
0 0
( ( , ) ( , ))
k k
aff ij i j h
i j
E w d c c d i j
− −
= =
= −∑∑
19
6/9 - ICCV2013:
Title: Complementary Projection Hashing.
Authors: Zhongming Jin1
, Yao Hu1
, Yue Lin1
, Debing Zhang1
, Shiding Lin2
, Deng Cai1
, Xuelong Li3
Organization: 1. Zhejiang University, 2. Baidu Inc., 3. Chinese Academy of Sciences, Xi’an (China)
Motivation: 1. (a) The hyperplane a crosses the sparse region and the neighbors are quantized into the
same bucket; (b) The hyperplane b crosses the dense region and the neighbors are quantized into the
different buckets. Apparently, the hyperplane a is more suitable as a hashing function. 2. (a) (b) Both
the hyperplane a and the hyperplane b can evenly separated the data. (c) However, putting them
together does not generate a good two bits hash function. (d) A better example for two bits hash
function
20
7/9 - CVPR2013:
Title: Hash Bit Selection: a Unified Solution for Selection Problems in Hashing.
Authors: Xianglong Liu1
, Junfeng He2,3
, Bo Lang1
, Shih-Fu Chang2
.
Organization: 1. Beihang University(China), 2. Columbia University(US), 3. Facebook(US)
Motivation: Recent years have witnessed the active development of hashing techniques for nearest
neighbor search over big datasets. However, to apply hashing techniques successfully, there are
several important issues remaining open in selection features, hashing algorithms, parameter
settings , kernels, etc.
21
8/9 - ICCV2013:
Title: A General Two-Step Approach to Learning-Based Hashing.
Authors: Guosheng Lin, Chunhua Shen, David Suter, Anton van den Hengel.
Organization: University of Adelaide (Australia)
Based On: SIGIR2010: Self-Taught Hashing for Fast Similarity Search
Motivation: Most existing approaches to hashing apply a single form of hash function, and an
optimization process which is typically deeply coupled to this specific form. This tight coupling
restricts the flexibility of the method to respond to the data, and can result in complex optimization
problems that are difficult to solve. Their framework decomposes hashing learning problem into two
steps: hash bit learning and hash function learning based on the learned bits.
22
9/9 - IJCAI2013:
Title: Smart Hashing Update for Fast Response.
Authors: Qiang Yang, Long-Kai Huang, Wei-Shi Zheng, Yingbiao Ling.
Organization: Sun Yat-sen University (China)
Based On: DMKD2012: Active Hashing and Its Application to Image and Text Retrieval
Motivation: Although most existing hashing-based methods have been proven to obtain high accuracy,
they are regarded as passive hashing and assume that the labeled points are provided in advance. In this
paper, they consider updating a hashing model upon gradually increased labeled data in a fast response
to users, called smart hashing update (SHU).
1. Consistency-based Selection;
2. Similarity-based Selection.
[CVPR.2012]
( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j= −
2
{ 1,1}
1
min l r
l
T
l l
H
F
Q H H S
r×
∈ −
= −
2
1 1
{1,2,...,r}
min k k T
k r r Fk
R rS H H− −
∈
= −
23
Reporter: Xu Jiaming (Ph.D Student)
Date: 2014.03.27
Computational-Brain Research Center
Institute of Automation, Chinese Academy of Sciences

Contenu connexe

Tendances

Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Daniele Di Mitri
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongeSAT Publishing House
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
 
Learning possibilistic networks from data: a survey
Learning possibilistic networks from data: a surveyLearning possibilistic networks from data: a survey
Learning possibilistic networks from data: a surveyUniversity of Nantes
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionAustin Benson
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theorycsandit
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesianAhmad Amri
 

Tendances (18)

Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Fusing semantic data
Fusing semantic dataFusing semantic data
Fusing semantic data
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm
 
Learning possibilistic networks from data: a survey
Learning possibilistic networks from data: a surveyLearning possibilistic networks from data: a survey
Learning possibilistic networks from data: a survey
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGEXPERT OPINION AND COHERENCE BASED TOPIC MODELING
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
 

En vedette

Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...
Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...
Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...Akram El-Korashy
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringAkram El-Korashy
 
PFI Seminar 2012/03/15 カーネルとハッシュの機械学習
PFI Seminar 2012/03/15 カーネルとハッシュの機械学習PFI Seminar 2012/03/15 カーネルとハッシュの機械学習
PFI Seminar 2012/03/15 カーネルとハッシュの機械学習Preferred Networks
 
Naturaleza
NaturalezaNaturaleza
Naturalezadiosyely
 
Bumpless control for reduced thd in power factor correction circuits
Bumpless control for reduced thd in power factor correction circuitsBumpless control for reduced thd in power factor correction circuits
Bumpless control for reduced thd in power factor correction circuitsLeMeniz Infotech
 
April 13 2014 slideshow
April 13 2014 slideshowApril 13 2014 slideshow
April 13 2014 slideshowEarl Oswalt
 
лунные затмения
лунные затмениялунные затмения
лунные затменияEgor_2000
 
Google glass
Google glassGoogle glass
Google glassBK12741
 
Americans in Saint-Nazaire
Americans in Saint-NazaireAmericans in Saint-Nazaire
Americans in Saint-Nazairemdrouet44
 
НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)
НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)
НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)Yelena Shaulova
 
Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...
Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...
Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...Michael Mortenson
 
Camps gaëlle et lilou
Camps   gaëlle et lilouCamps   gaëlle et lilou
Camps gaëlle et liloumdrouet44
 

En vedette (12)

Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...
Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...
Search space reduction for holistic ligature recognition in Urdu Nastaliq scr...
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question Answering
 
PFI Seminar 2012/03/15 カーネルとハッシュの機械学習
PFI Seminar 2012/03/15 カーネルとハッシュの機械学習PFI Seminar 2012/03/15 カーネルとハッシュの機械学習
PFI Seminar 2012/03/15 カーネルとハッシュの機械学習
 
Naturaleza
NaturalezaNaturaleza
Naturaleza
 
Bumpless control for reduced thd in power factor correction circuits
Bumpless control for reduced thd in power factor correction circuitsBumpless control for reduced thd in power factor correction circuits
Bumpless control for reduced thd in power factor correction circuits
 
April 13 2014 slideshow
April 13 2014 slideshowApril 13 2014 slideshow
April 13 2014 slideshow
 
лунные затмения
лунные затмениялунные затмения
лунные затмения
 
Google glass
Google glassGoogle glass
Google glass
 
Americans in Saint-Nazaire
Americans in Saint-NazaireAmericans in Saint-Nazaire
Americans in Saint-Nazaire
 
НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)
НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)
НОВАЯ ВОЛНА КОМАНДНАЯ ЭФФЕКТИВНОСТЬ_Развитие Лидеров Коучинг(1)
 
Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...
Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...
Analysing Analytics: Evolution or Emperor's New Clothes? (Young OR Conference...
 
Camps gaëlle et lilou
Camps   gaëlle et lilouCamps   gaëlle et lilou
Camps gaëlle et lilou
 

Similaire à 20140327 - Hashing Object Embedding

Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringJinho Choi
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deakin University
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021hala Skaf
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning ExplainedMelanie Swan
 
Similarity-preserving hash for content-based audio retrieval using unsupervis...
Similarity-preserving hash for content-based audio retrieval using unsupervis...Similarity-preserving hash for content-based audio retrieval using unsupervis...
Similarity-preserving hash for content-based audio retrieval using unsupervis...IJECEIAES
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Sri Ambati
 
Deep learning and reasoning: Recent advances
Deep learning and reasoning: Recent advancesDeep learning and reasoning: Recent advances
Deep learning and reasoning: Recent advancesDeakin University
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...Aalto University
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
Hybrid Meta-Heuristic Algorithms For Solving Network Design Problem
Hybrid Meta-Heuristic Algorithms For Solving Network Design ProblemHybrid Meta-Heuristic Algorithms For Solving Network Design Problem
Hybrid Meta-Heuristic Algorithms For Solving Network Design ProblemAlana Cartwright
 
Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1IPLODProject
 
In tech application-of_data_mining_technology_on_e_learning_material_recommen...
In tech application-of_data_mining_technology_on_e_learning_material_recommen...In tech application-of_data_mining_technology_on_e_learning_material_recommen...
In tech application-of_data_mining_technology_on_e_learning_material_recommen...Enhmandah Hemeelee
 

Similaire à 20140327 - Hashing Object Embedding (20)

Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-Answering
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Similarity-preserving hash for content-based audio retrieval using unsupervis...
Similarity-preserving hash for content-based audio retrieval using unsupervis...Similarity-preserving hash for content-based audio retrieval using unsupervis...
Similarity-preserving hash for content-based audio retrieval using unsupervis...
 
JOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in PracticeJOSA TechTalks - Machine Learning in Practice
JOSA TechTalks - Machine Learning in Practice
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
 
Deep learning and reasoning: Recent advances
Deep learning and reasoning: Recent advancesDeep learning and reasoning: Recent advances
Deep learning and reasoning: Recent advances
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
AI Science
AI Science AI Science
AI Science
 
ME Synopsis
ME SynopsisME Synopsis
ME Synopsis
 
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Deep Learning 2.0
Deep Learning 2.0Deep Learning 2.0
Deep Learning 2.0
 
Hybrid Meta-Heuristic Algorithms For Solving Network Design Problem
Hybrid Meta-Heuristic Algorithms For Solving Network Design ProblemHybrid Meta-Heuristic Algorithms For Solving Network Design Problem
Hybrid Meta-Heuristic Algorithms For Solving Network Design Problem
 
Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1Linked Open Data (LOD) part 1
Linked Open Data (LOD) part 1
 
In tech application-of_data_mining_technology_on_e_learning_material_recommen...
In tech application-of_data_mining_technology_on_e_learning_material_recommen...In tech application-of_data_mining_technology_on_e_learning_material_recommen...
In tech application-of_data_mining_technology_on_e_learning_material_recommen...
 

Dernier

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

20140327 - Hashing Object Embedding

  • 1. 1 Hashing: Object Embedding Reporter: Xu Jiaming (PH.D Student) Date: 2014.03.27 Computational-Brain Research Center Institute of Automation, Chinese Academy of Sciences Report
  • 2. 2 First, What is Embedding? [ 出自 ]: https://en.wikipedia.org/wiki/Embedding When some object X is said to be embedded in another object Y, the embedding is given by some injective and structure- preserving map f : X → Y. The precise meaning of "structure- preserving" depends on the kind of mathematical structure of which X and Y are instances. Structure-Preserving in IR: 1 2 1 2 : (X ,X ) (Y ,Y ) f Sim Sim → ≈ X Y
  • 3. 3 Then, What is Hash? [ 出自 ]: https://en.wikipedia.org/wiki/Hash_table The hash function will assign each key to a unique bucket, but this situation is rarely achievable in practice (usually some keys will hash to the same bucket). Instead, most hash table designs assume that hash collisions—different keys that are assigned by the hash function to the same bucket—will occur and must be accommodated in some way.
  • 4. 4 Combine the Two Properties [1998, Piotr Indyk, cited: 1847] Locality Sensitive Hashing . ( , ) , . Pr[ ( ) ( )] 1 . ( , ) (1 ), . Pr[ ( ) ( )] 1 if D p q r then h p h q p if D p q r then h p h q pε ≤ = ≥ > + = ≥
  • 5. 5 Overview of Hashing Real World Binary Space 2000 values 32 bits Binary Reduction
  • 7. 7 Learning to Hash Data-Oblivious Data-Aware Description Methods LSI, RBM, SpH, STH, … LSH, Kernel-LSH, SimHash, …
  • 8. 8 Data-Oblivious: SimHash [WWW.2007] Text … … Observed Features W1 W2 Wn 100110 W1 110000 W2 001001 Wn … … W1 –W1 -W1 W1 W1 -W1 W2 W2 -W2 -W2 -W2 -W2 -Wn –Wn Wn –Wn –Wn Wn … …13, 108, -22, -5, -32, 551, 1, 0, 0, 0, 1 Step1: Compute TF- IDF Step2: Hash Function Step3: Signature Step4: Sum Step5: Generate Fingerprint
  • 9. 9 Data-Aware: Spectral Hashing [NIPS.2008] 2 min : . .: { 1,1} 0 1 ij i j ij k i i i T i i i S y y s t y y y y n − ∈ − = = ∑ ∑ ∑ I min : ( ( ) ) . .: ( , ) { 1,1} 0 T k T T trace Y D W Y s t Y i j − ∈ − = = Y 1 Y Y I Laplacian Eigenmap XW Y=
  • 10. 10 Some Questions? 1. Can we obtain hashing codes by binarizing the real-valued low- dimensional vectors such as LSI? 2. Can we get hashing codes by Deep Learning approaches such as RBM, or AutoEncoder?
  • 11. 11 Some Questions? 1. Can we obtain hashing codes by binarizing the real-valued low- dimensional vectors such as LSI? Of Course ! [R. Salakhutdinov, G. Hinton. Semantic Hashing, SIGIR2007] 2. Can we get hashing codes by Deep Learning approaches such as RBM, or AutoEncoder? No Problem ! [R. Salakhutdinov, G. Hinton. Semantic Hashing, SIGIR2007]
  • 12. 12 In 2013, What Did They Think About? Total: 30
  • 13. 13 1/9 - ICML2013: Title: Learning Hash Functions Using Column Generation Authors: Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick Organization: The University of Adelaide (Australia) Based On: NIPS2005: Distance Metric Learning for Large Margin Nearest Neighbor Classification Motivation: In content based image retrieval, to collect feedback, users may be required to report whether image x looks more similar to x+ than it is to a third image x−. This task is typically much easier than to label each individual image. 11 min . . 0, 0; ( , ) ( , ) 1 , J i i H i i H i i i C s t d d i ξ ξ ξ = − + + ≥ ≥ − ≥ − ∀ ∑w,ξ w w x x x x
  • 14. 14 2/9 - ICML2013: Title: Predictable Dual-View Hashing Authors: Mohammad Rastegari, Jonghyun Choi, Shobeir Fakhraei, Hal Daume III, Larry S. Davis Organization: The University of Maryland (USA) Motivation: It is often the case that information about data are available from two or more views, e.g., images and their textual descriptions. It is highly desirable to embed information from both domains in the binary codes, to increase search and retrieval capabilities. 2 2 2 2 2 2 2 2 min . . sgn(W X ) sgn(W X ) T T T T V V T T T T T V V V T T T T T V V V W X Y Y Y I W X Y Y Y I s t Y Y − + − + − + − = = Y,U
  • 15. 15 3/9 - SIGIR2013: Title: Semantic Hashing Using Tags and Topic Modeling. Authors: Qifan Wang, Dan Zhang, Luo Si Organization: Purdue University (USA) Motivation: Two major issues are not addressed in the existing hashing methods: (1) Tag information is not fully utilized in previous methods. Most existing methods only deal with the contents of documents without utilizing the information contained in tags; (2) Document similarity in the original keyword feature space is used as guidance for generating hashing codes in previous methods, which may not fully reflect the semantic relationship. 1 2 2 2 2 min ( ) . . { 1,1} , 0 T F k n C s t γ × − + + − ∈ − = Y,U T U Y U Yθ Y Y1 g
  • 16. 16 3/9 - SIGIR2013: Title: Semantic Hashing Using Tags and Topic Modeling. Authors: Qifan Wang, Dan Zhang, Luo Si Organization: Purdue University (USA) Motivation: Two major issues are not addressed in the existing hashing methods: (1) Tag information is not fully utilized in previous methods. Most existing methods only deal with the contents of documents without utilizing the information contained in tags; (2) Document similarity in the original keyword feature space is used as guidance for generating hashing codes in previous methods, which may not fully reflect the semantic relationship. 1 2 2 2 2 min ( ) . . { 1,1} , 0 T F k n C s t γ × − + + − ∈ − = Y,U T U Y U Yθ Y Y1 g Our experiments on 20Newsgroups
  • 17. 17 4/9 - IJCAI2013: Title: A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing. Authors: Debing Zhang, Genmao Yang, Yao Hu, Zhongming Jin, Deng Cai, Xiaofei He Organization: Zhejiang University (China) Motivation: Traditionally, to solve problem of nearest neighbor search, researchers mainly focus on building effective data structures such as hierarchical k-means tree or using hashing methods to accelerate the query process. In this paper, we propose a novel unified approximate nearest neighbor search scheme to combine the advantages of both the effective data structure and the fast Hamming distance computation in hashing methods.
  • 18. 18 5/9 - CVPR2013: Title: K-means Hashing: an Affinity-Preserving Quantization Method for Learning Binary Compact Codes. Authors: Kaiming He, Fang Wen, Jian Sun Organization: Microsoft Research Asia (China) Motivation: Both Hamming-based methods and lookup-based methods are of growing interest recently, and each category has its benefits depending on the scenarios. The lookup-based methods have been shown more accurate than some Hamming methods with the same code-length. However, the lookup-based distance computation is slower than the Hamming distance computation. Hamming methods also have the advantage that the distance computation is problem-independent 1 1 2 0 0 ( ( , ) ( , )) k k aff ij i j h i j E w d c c d i j − − = = = −∑∑
  • 19. 19 6/9 - ICCV2013: Title: Complementary Projection Hashing. Authors: Zhongming Jin1 , Yao Hu1 , Yue Lin1 , Debing Zhang1 , Shiding Lin2 , Deng Cai1 , Xuelong Li3 Organization: 1. Zhejiang University, 2. Baidu Inc., 3. Chinese Academy of Sciences, Xi’an (China) Motivation: 1. (a) The hyperplane a crosses the sparse region and the neighbors are quantized into the same bucket; (b) The hyperplane b crosses the dense region and the neighbors are quantized into the different buckets. Apparently, the hyperplane a is more suitable as a hashing function. 2. (a) (b) Both the hyperplane a and the hyperplane b can evenly separated the data. (c) However, putting them together does not generate a good two bits hash function. (d) A better example for two bits hash function
  • 20. 20 7/9 - CVPR2013: Title: Hash Bit Selection: a Unified Solution for Selection Problems in Hashing. Authors: Xianglong Liu1 , Junfeng He2,3 , Bo Lang1 , Shih-Fu Chang2 . Organization: 1. Beihang University(China), 2. Columbia University(US), 3. Facebook(US) Motivation: Recent years have witnessed the active development of hashing techniques for nearest neighbor search over big datasets. However, to apply hashing techniques successfully, there are several important issues remaining open in selection features, hashing algorithms, parameter settings , kernels, etc.
  • 21. 21 8/9 - ICCV2013: Title: A General Two-Step Approach to Learning-Based Hashing. Authors: Guosheng Lin, Chunhua Shen, David Suter, Anton van den Hengel. Organization: University of Adelaide (Australia) Based On: SIGIR2010: Self-Taught Hashing for Fast Similarity Search Motivation: Most existing approaches to hashing apply a single form of hash function, and an optimization process which is typically deeply coupled to this specific form. This tight coupling restricts the flexibility of the method to respond to the data, and can result in complex optimization problems that are difficult to solve. Their framework decomposes hashing learning problem into two steps: hash bit learning and hash function learning based on the learned bits.
  • 22. 22 9/9 - IJCAI2013: Title: Smart Hashing Update for Fast Response. Authors: Qiang Yang, Long-Kai Huang, Wei-Shi Zheng, Yingbiao Ling. Organization: Sun Yat-sen University (China) Based On: DMKD2012: Active Hashing and Its Application to Image and Text Retrieval Motivation: Although most existing hashing-based methods have been proven to obtain high accuracy, they are regarded as passive hashing and assume that the labeled points are provided in advance. In this paper, they consider updating a hashing model upon gradually increased labeled data in a fast response to users, called smart hashing update (SHU). 1. Consistency-based Selection; 2. Similarity-based Selection. [CVPR.2012] ( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j= − 2 { 1,1} 1 min l r l T l l H F Q H H S r× ∈ − = − 2 1 1 {1,2,...,r} min k k T k r r Fk R rS H H− − ∈ = −
  • 23. 23 Reporter: Xu Jiaming (Ph.D Student) Date: 2014.03.27 Computational-Brain Research Center Institute of Automation, Chinese Academy of Sciences

Notes de l'éditeur

  1. 大家好,我今天给大家做的报告是Hashing,我这里给出一个副标题,Object Embedding,其实也是哗众取宠,因为Word Embedding这个词比较时髦嘛。
  2. 我们先来兜个圈,什么叫Embedding?WikiPedia上这么解释的,如果说一个物体X被Embbeding到物体Y上,那么这个Embedding过程通常是由一些单射且结构保存的映射函数完成。关于结构保存的精确定义取决于X,Y这两个实体具有什么样的数学结构。简单理解,其实Embedding就是一种结构性保存的映射关系,大多数来说都是降维,我们所熟知的大多数的降维分解方法,如SVD,PCA,LDA都是一种Embeding。在IR中,结构保存性就是希望通过一种映射关系后,原空间中物体之间的相似性能在目标空间中基本保持不变。
  3. 然后我们来看什么叫Hash,对于写代码的人来说,Hash最熟悉不过,是一种典型的Key-Value结构,最常见的算法莫过于MD5。其设计思想是使Key集合中的任意关键字能够尽可能均匀的变换到Value空间中,不同的Key对应不同的Value,即使Key值只有轻微变化,Value值也会发生很大地变化。WipikPedia上这么解释说,Hash函数通常是给每Key值分配一个独一无二的桶。同时也指明了在实际应用中,这种情况是难以实现的,总有一些Key值会分到同一个桶内。这种情况被定义为Hash Collisions,Hash碰撞,这在Hash函数设计中要尽可能避免。Hash被广泛应用也是因为它急快的Key值检索速度,理想情况下,时耗是O(1)。但是这种散列的Hash函数显然并不满足Embedding的性质,因为没有保存物体所固有的结构性质。
  4. 那么我们会想,能否设计这样一种Hash:它使相似Key值计算出的Value值相同或在某种度量下相近,甚至得到的Value值能够保留物体原始空间的信息,这样相同或相近的文件能够以Hash的方式被快速检索出来,或用作快速的相似性比对。98年Indyk在Stanford读PHD时与他的导师一起提出的一种hash方法:位置敏感哈希(Local Sensitive Hashing, LSH)正好满足了这种需求。对于任意的q,p属于S空间,若从集合S到U的Hash函数h,对距离函数D(p,q),满足公式中的条件则称D(p,q)是位置敏感的。其实就是说,原空间距离近的映射到同一个桶中的概率就比较大。
  5. 现在简单来看,Hashing其实就是简单做了一件事,对高维空间的物体进行降维二值化,同时保存原空间中的相似性信息。这种二进制低维特征重表示方法在信息检索领域是很有用的,现在移动终端的应用很火,如果我们用手机拍一张照片进行信息检索的话,给服务器上传一个32bits的信息要比上传一张图片快捷的多,而且在服务器端进行位运算要比实值运算快捷的多。现在大数据时代,Hashing方法就变得更加重要,现在微软和百度的海量信息组在应对大数据方面都在做Hashing方面的研究。
  6. Hashing方法舍精求快的方法也给我们一种面对大数据的态度就是Approximation-逼近。
  7. Hashing方法可分为两种,一种是基于数据集的,一种是与数据无关的。我们分别介绍一种方法。
  8. 与数据无关的算法大多是随即映射产生的,例如WWW2007上Google发表的一篇关于SimHash的文章,它把一个文本拆成n个Token,然后通过Hash函数映射到多个平面上,这个HashFunction并不具有相似保存性,按位求和然后二值化就得到了一个Hash指纹,整个算法非常简单,随即映射可以理解为在高维空间中进行切割,在切割面上的置为1,切割面下的置位-1,进行多次切割之后,两个原空间相似的物体生成的二进制编码就比较相似,而原空间不相似的物体生成的Hash码也就不相似。有人在文章中再提及随即映射hash时讲到类似于一种Boost方法,即多个Hash变成一个我们想要的强Hash方法
  9. 与数据相关的Hash算法大多是把现有的机器学习方法进行套用,例如NIPS2008上的这篇文章,首先根据我们要优化的目的构造Hash函数,假设y是映射后的Hash码,S是相似度矩阵,那么当xy相似度S比较大的时候,yi,yj则需要比较小,当相似度S比较小的时候则yi,yj可以大一点。如果直接优化这个目标函数的时候存在一个平凡解,就是y值都一样,因而需要加入下面三个条件做约束,第一个是说y是二值化的,第二个是说编码中的-1,1分布均匀,第三个实说不同维之间互不相关。在这样的约束条件下很难求解,我们放宽二值化条件,就这个问题就退化成谱聚类的问题,谱聚类相关的研究很多,最常用的解法就是拉普拉斯特征映射的方法进行求解,得到实值化的Y,然后设定阈值进行二值化得到Hash码。在这里扯一个题外话,因为在流行学习中会涉及到谱聚类和拉普拉斯特征映射,有时论文中也会提到“嵌入在高维空间中的低维流形”,必说如,大地其实就是嵌入到三维空间中的二维流型,从局部来讲,大地是个二维平面,而远观来看大地是个三维球型,而其实二维平面的特性更重要。在利用谱聚类进行降维的过程中我们构建的是一个带有相似度的关联矩阵,而拉普拉斯特征映射抓住了主要特征,那么映射到低维之后保留了原空间中的内在性质。比如说右下角这幅图,如果采用拉普拉斯特征映射和采用K-means聚类的结果肯定不一样,比如说这幅图中的两个点的距离,基于欧式空间的K-means方法和 流行学习方法的理解就不一样。好我们回到正体。
  10. 到这里可能会有一些疑问,一个是我们能不能直接通过二值化一个低维的实值向量得到HashCode? 另一个,既然DeepLearning方法中每层的特征大多是二值化表示的,那么能不能拿其中某层的特征做HashCode?
  11. 没有问题,我们可以看一下07年SemanticHashing的文章,作者 萨拉赫丁诺夫 和Hinton这两个人大家应该比较熟悉,06年师徒俩在Science一起发表的论文掀起DeepLearning新浪潮,而在他们SIGIR2007年的Semantic Hashing文章中就是用的RBM训练数据,拿其中二值化的特征层进行HashCode,而且他的BaseLine试验就是用LSI进行降维分解之后设定阈值得到二值化向量. 不过就是说,这种直接二值化的方法一般得到的效果不是很好。萨拉赫丁诺夫博士论文的第二章就是Semantic Hashing,但从07年至今用DeepLearning来做Hashing的文章有,但很少,不过大家在讲到Semantic Hashing这个词和用LSI做BaseLine试验的时候还是都会很礼貌的引用一下他们这篇论文。
  12. 那么在13年过去的一年,大家在Hashing方面做了些什么?我简单调研了一下,在我调研的16个国际会议中设计Hashing问题的共有30篇,下面我挑出一些在这里介绍,主要只介绍他们的idea,去年发表Hashing最多的是IJCAI, CVPR和CIKM都是5篇,而VLDB一篇没有,但实际上Hash最早的论文是出自VLDB的。
  13. 这时一篇ICML上的文章,是通过列生成法学习Hash函数,我把他的作者信息和组织单位也列在这里,大家可能会到看一些比较熟悉的同行,也可以了解一下有哪些人在做这件事,同时尽可能的把它的工作基础也列在这里,能看出他是基于谁的工作继续前进。我们重点还是看他的Idea, 这篇文章的Motivation就是说,在基于内容的图片检索中,搜集用户反馈信息的时候,让一个用户来来判断一张图片X,对比于X-,是否和X+更相似。这样的任务通常要比每张图片贴标签容易的多。因而他在目标函数里面就构造了一个约束项,x和x-的距离要比x和x+的距离大,通过列生成优化算法来求解这个问题。算法细节我们都不做讨论,只关心他们做这件事的Motivation和Idea。左下角是他的试验结果,第一副是输入一台电脑的检索结果,下面是MNIST数据集上的数据识别效果
  14. 这篇ICML的文章是美国马里兰大学的工作,他的Motivation是现在获取信息一般都是来自多个渠道,例如图片和对应的文字描述。同时从这两个领域进行hash编码。最终能够实现把文字和图片映射到同一个Hamming空间中进行信息检索。例如右下角是他的试验样例,最下面一行,输入Plane flying on the air,果真输出一些飞机在天上飞的照片。同样,最上面输入Laptop placed on the table, 输出的图片结果中,不只有laptop,还有television
  15. 这篇ICML的文章是美国普渡大学的工作,他们实验室的大老板是个华人Si Luo,之前在清华读完本科和硕士之后,跑到CMU读了硕士和博士,然后到普渡大学任教,专注于信息检索,机器学习和自然语言处理。然后他的大部分学生都是从清华大学过去深造的。他们实验室的IR方面做了很多比较有意义的工作。这篇文章的Motivation来自两方面,1).说现有的Hash方法没有考虑标签信息,而实际上之前的部分Hashing工作还是考虑了Hash信息的,所以他加了fully这样的修饰;2). 第二方面说以前的Hashing工作都是通过文本的原始特征来保存文本间的相似信息的,这个不能充分反映文本间的语义关系,因而他引入了TopicModeling。去年我写文章的时候也对这方面工作进行了调研,确实还没有人在Hash方法中引入TopicModle工作的,也是初生牛犊不怕虎的就写了一篇基于Topic特征的快速检索方法。所以他们这篇文章中也是指明,As far as we know,他们是第一个引入TopicModel到Hash中的工作。式子中,theta是主题特征,T是标签信息,U是针对标签建模的一个隐变量。约束条件是对Hash码的约束。右下角是他们在20NewsGroups和WebKB上的试验结果,其实效果提升并不是很明显。毕竟他的BaseLine除SSH,其他的都没有使用标签信息
  16. 左上图是我们通过另一种方式引入TopicFeature和Tag信息在20NewsGroups上做的基于Hashing的文本检索结果,从召回-精度曲线上能,可以看出加入标签信息后的结果效果应该是能够改善很多的。不过SIGIR这篇文章为了证明自己算法的对Tag信息的鲁棒性,在试验数据上随即删掉了部分Tag信息。
  17. 这篇IJCAI的文章是浙江大学的工作,这里有两个大牛,何晓飞和蔡登,何晓飞00年从浙大毕业后到芝加哥大学深造博士,05年毕业后进入Yahoo研究院,两年后07年也就是29岁时被浙大高薪聘用加入计算机学院的国家重点实验室,1年左右拿到正教授。在国内机器学习和计算机视觉也是引起了小小地震,一般从学术界转行到工业界大多是物质诱惑,而从工业界转到学术界大多是造诣很深的。他现在的论文引用率已超1W次。蔡登则是08年从美国博士毕业后加入了浙大计算机系做副教授,跟随何晓飞。顺道题外话,也8g一下这两位牛人,蔡登在美国读书时从师韩家炜,韩教授是KDD会议的发起者,但也有个外号叫华人界的灌水天王。而传言何晓飞在读博期间通过一个idea灌水论文30篇。当然这也是同行之间文人相轻,毕竟这两个牛人一起拿到过12年AAAI的BestPaper是做摘要的和10年CAM-MM的BestPaper是做基于社交媒体的音乐推荐。 我们回来看IJCAI的这篇文章,他的Motivation是说,传统方法中解决近邻搜索问题,人们通常致力于构建有效的数据结构,例如层次的K-means Tree,或者使用Hashing方法 来加速查询过程。在这篇文章中,他提出一种逼近的近邻搜索方法,就是揉合基于Tree的数据结构和基于Hamming计算的Hash方法。想法比较清晰,而且不需要额外任何的公式推导,所以他这篇文章没有任何数学公式,只是给出一副图例,然后就开始试验。左图是离线构造的一个K-means Tree,右图是query过程,加入Query节点已经到第i层,Step(a)是通过Hash方法找到最相似的4个中心点,然后Step(b)通过精确的Rank算法得到最为相似的两个中心点,然后每个类内又有三个子类,Query就进入了i+1层。想法比较简单,不过也是一种比较好的快速检索框架,毕竟在全数据集上的Hash检索结果还是太多,因而只跟聚类的中心节点做层次化对比。
  18. 既然讲到Hash和K-Means,我们来看另一篇利用Hash和K-Means的工作,是微软亚洲研究院的何恺明的工作,何恺明在CV界也算是牛人了。09年CVPR的BestPaper,就是冯媛去年讲过一篇文章是关于图像去雾的工作。我们看一下他的Motivation,讲到Hamming-based方法和Lookup-based方法在检索领域受到广泛关注,在不同领域他们有各自的优点,如在同样维数的编码上Lookup-based方法要更精确,而Hamming-based方法则查询更快。如图中展示的,Hamming-based的Hash方法是把物体映射到二进制的Hamming空间中,而Lookup-based的K-means方法是找到具有代表性的类中心点。那么能不能同时学习,把hash映射过程和聚类过程同时进行,从要优化的目标函数也能看出来就是要K-means的聚类中心点近的话,Hash映射后的Hamming距离也要近,反之亦然。去年刘鹏程在介绍这篇文章的时候,提到一个比较有意思的事情,就是作者在自己的Paper中埋了一个彩蛋,算法的简称KMH和作者名字的简称一致。
  19. 这篇ICCV的文章也是出自浙大的蔡登实验室,同时还有百度研究院和中科院西安分院的参与。还记着我们之间讲过说基于随即映射的Hash方法就是在高维空间中进行超平面切割。基于此,他们的Motivation有两点,1如左图:基于随即映射的Hash方法,如果超平面a是穿过两类中间的稀疏区域,那么不同类则各自量化到同一个桶中。如果超平面b穿过数据的密集区的话则数据会被量化到不同的桶中。Motivation2是说,超平面a和b都把数据分开,但是这两个平面一起作用时,如果是c的情况,则也不是一个很好的映射平面,如果组合成d的话,则是最好的,两个超平面就可以把4组数据区分开,而且同一组数据映射到同一个桶中。
  20. CVPR的这篇文章是由北航,哥伦比亚大学及Facebook一起参与的工作,他们的Motivation是说虽然目前在大数据的近邻搜索问题上Hashing的应用已经很成功,但是仍然有一些问题需要讨论,就是多特征的选择问题,多Hashing算法的选择,以及多参数设定的选择。本文给出了一个框架,通过针对同一个数据集采用不同特征,不同Hashing算法,不同参数设定下生成各种不同的二值向量,构成图,然后再从图中选出最优节点。
  21. ICCV这篇是澳大利亚的阿德雷得大學的工作,他们的Motivation是说目前大多数Hash方法都是针对数据集的Hash降维编码过程及Hash编码预测函数的学习过程整合在一起来学习,这种紧密耦合一方面限制了灵活性,另一方面导致优化问题变的复杂,难以求解。他们提出一种框架,把Hash问题拆解成两个阶段完成,第一个阶段是进行针对现有的数据集进行hash码学习,第二个阶段是基于之前的Hash码学习Hash函数。如果对Motivation不太清楚的,我们可以下面这个图例,这个图例是这篇文章的主要参考工作SIGIR2010的Self-Taught Hashing,它就是一个典型 二阶段Hash学习方法,出自于普渡大学的Si Luo实验室,这篇文章的第三作者是浙大的蔡登,可能是交流学习阶段一起完成的工作。我们看这个图,首先给一堆文本集然后通过一种无监督的降维方法得到二进制的Hash码,这是第一个阶段。然后根据已经学到的Hash码作为二值标签利用监督学习方法学习一个Hash函数。而这两个阶段都属于离线学习阶段,而Query查询属于在线阶段。其实STH本身就是一个二阶段框架了,ICCV的这篇文章基本就是基于此工作提出总结性的两阶段Hash学习框架
  22. 这篇IJCAI是中山大学的工作,他们的工作是基于DMKD2012年上一篇基于主动学习Hash的文章(DMKD是检索类里面的B级期刊)。他们的Motivation比较实用化,就是说现有的基于Hash的方法已经获得了比较好的效果,但是他们大多是被动Hashing学习,且假定带标签的数据都是已经提供好的。这在这篇文章中,他们考虑如何基于逐渐增多的标签数据更新Hashing模型码给用户做出快速相应,被称为Smart Hashing Update.所谓主动学习,就是系统自动的挑选一些数据给用户进行标记,然后基于已经存在的数据和新标准的数据更新整个Hashing模型。他们的算法流程见下图,每次由用户标出新数据之后,添加到现有数据集中,然后由系统挑选那些Hash位需要进行更新,被挑出的t个bit位对应的Hash函数参与本轮更新,那其实如何挑选这t个bit位比较关键,本文是给了两种策略:1,Consistency-based Selection;2,Similarity-based Selection;基于一致性选择是考虑整体数据集属于同一类的Hash码每一位上的一致性是否比较强,判断同样位的标签{-1,1}是否比较一致,是否都是正一,或都是负一,如果一致性不好的话,我们就把它挑选出来参与更新;当然这种策略的缺点就是没有考虑内部数据和外部新数据的相似性,因而第二种是基于相似度选择,度量同一类别内的Hash编码效果好不好,CVPR2012上给出了一个性能度量指标公式,H是同一种类别的Hash码,S是关联矩阵,这个指标越小的话,说明效果比较好。这边为了挑出效果不好的t个Hash函数对这个指标进行了变形,依次把第k位从Hash函数中剔除,对比剔除哪个Hash位之后这个指标下降比较明显的话,就把波动影响比较大的t个挑选出来这就完成了挑选工作,然后根据带标签数据进行重新学习
  23. ACL上有一篇是做MT的,CVPR上还有一篇是做医学图片检索的。剩下的论文太多,我也就没有一一列出来。