Text retrievalmining

Expérienced’uneintégration
Big data & Matchine Learning
6 avril 2017, Plaine Images

Rappels
Documents
Factures
Contacts
Source
Mails
Notes
BookmarksFeuilles Calcul
Messages
Présentation
Planning
Plans
Mind map
Articles
Mémoriser,
Maintenir
Organiser
Re-trouver
(Navigation, Recherche)
(contrôle, histoire, familiarité) ≠Web
Besoins
Tâches

• 1228900 documents ~160GB
• 314 types de fichiers
• 17500 dossiers
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Nombre fichiers / type

Idée directrice : mieux exploiter le contenu des documents dans
le processus de recherche (indexation, requête, présentation des résultats, …)

Traitements linguistiques
Text Retrieval Text Mining
Text Navigation
TextVisualisation
Associations de mots
Extraction d’entités
Modèle de langues
Similarité de textes
Résumé de textes
Partionnement de textes
Classification de textes
Extraction de thèmes
Prédiction de mots
…
Plus de connaissances
Plus de structures
Accès à
l’information
Indexation
Requêtage
Ordonnancement
Rétroaction de
pertinence
Personal
Information
Space
Supervisé
Non supervisé

• + de structures, + de liens, + d’analyses, … pour la recherche et la prise de décision
• Fouille non supervisée, tâche de fouille rapide (Topic Modeling)

Personal Information Space
(BigText Data)
Text
Retrieval
Text
Mining
Documents pertinents
(SmallText Data)
Alimentation pour nouveau requêtage
Requête utilisateur
t t
t t
tt
t
t t
t
t
t
Résumé
d d
d
d
d
d
d
d d
d
d
d
d
d
Structure
Connaissances

If we go further still into semantic analysis, then we might be able to recognize
dog as an animal. We also can recognize boy as a person, and playground as a
location and analyze their relations. One deduction could be that the dog was
chasing the boy, and the boy is on the playground. This will add more entities and
relations, through entity-relation recognition. Now, we can count the most frequent
person that appears in this whole collection of news articles. Or, whenever you see
a mention of this person you also tend to see mentions of another person or object.
These types of repeated pattens can potentially make very good features.
A dog is chasing a on the playground
String of characters
Sequence of words
+ POS tags
+ Syntactic structures
+ Entities and relations
+ Logic predicates
+ Speech acts
A dog
boy
A dog is chasing a on the playgroundboy
Det Noun
Noun phrase Noun phrase Noun phrase
Prep phraseVerb phrase
Complex verb
Aux Verb Det Prep Det NounNoun
Verb phrase
Sentence
Animal Person
CHASE ON
Location
a boy the playground
Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1)
Speech act = REQUEST
Deeper NLP: requires more human effort; less accurate
Closer to knowledge
representation
Figure 3.3 Illustration of different levels of text representation.Source:Text Data Management And Analysis, C. Zhai

w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
p1 p2 … pn
t1
t2
…
tm
Matrices de collection
Matrices de document
Résultat de recherche :
Documents pertinents
Similarité et
association de
termes
(=> Completion
de requêtes)
Similarité de
documents
(=> Document
Clustering)
….
Résumé de
documents
Extraction de
mot-clés
…doc
d1
Vectorisation
+ index
Rem : modélisation probabiliste comme alternative (predictif)

d1 d2 d3 d4 d5 d6 d7 d8 d9
human 1 0 1 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 0 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
Indexing by Latent Semantic Analysis,
Deerwester, Dumais & al., 1990
interface
computer
human
d4
d2
d1
d3
Espace
des termes
Idem Espace
des documents

• Tuple < T,C,R,W,M,d,S >
• T termes apparaissant dans les contextes
• C contextes où apparaissent les termes
• R relation de co-occurrence entre les termes et les contextes
• W schéma de pondération des termes (opt)
• M matrice distributionelle T x C
• d fonction de réduction de dimension, d : M -> Mo (opt)
• S mesure de distance entre les vecteurs dans M ou Mo
Instanciation en
fonction de la
tâche
Méthodes
Générales (LSA, …)

• Partitionnement de documents
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
𝑠𝑖𝑚$%& 𝑑1, 𝑑2 =
∑ 𝑤𝑗1 ∗ 𝑤𝑗20
123
∑ (𝑤𝑗1)60
123 ∗ ∑ (𝑤𝑗2)60
123

…
d1
index
.
.
.
tm
t1
t2
t3
d1
d2
d3
dn
R
T C
𝑀 =

w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
LSA𝑀 = 𝑀 ≈ 𝑈<×Σ<×𝑊<
@
Projection des termes et documents sur
k facteurs latents (k < n, k < m)
𝑠𝑖𝑚A&B 𝑑C , 𝑑1 = cos(𝑊<
.,C
, 𝑊<
.,1
)
d3
d1
d2
dn
Partitionnement des
documents résultat
d’une recherche
(content-based)
K-means
Intérêt : matrices de dimension réduite
pour les calculs de similarité de documents
Mk
m x n
U
m x r
Σ
WT
Approximation de
M au rang k
r x r r x n
k
k
k
k
Vecteurs
des termes
Vecteurs
des documents
Valeurs
singulières
ordonnées
𝛴× ×=
Idem pour les termes

• Résumé de document
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
p1 p2 … pn
t1
t2
…
tm
doc
s11 s12… s1n
s21 s22… s2n
… …
sn1 sn2… snn
p1 p2 … pn
p1
p2
…
pn
Termes x phrases Similarités phrases
(symétrique)
“Content overlap”
𝑀 =
𝑀I 𝑀
ou LSA
𝑠𝑖𝑚A&B 𝑝C , 𝑝1 = cos(𝑊<
.,C
, 𝑊<
.,1
)
𝑠𝑖𝑚KI
K 𝑝C , 𝑝1 =
𝑤<L𝑤< ∈ 𝑝C&𝑤< ∈ 𝑝1
log 𝑆C ∗ log ( 𝑆1 )

s11 s12… s1n
s21 s22… s2n
… …
sn1 sn2… snn
p1 p2 … pn
p1
p2
…
pn
Similarités phrases Partitionnement
(K-mean)
------------------------------
----------------------------------------------------------------------------------------------------
--------------------
PhraseRank
~ PageRank
phrase 1
phrase2
phrase 3
résumé
p1: 0.086
p2: 0.083
p3: 0.095
..
pn-1: 0.088
pn: 0.0734
Partition du document
en groupe de phrases
+ centroids
Liste ordonnée des
phrases selon scores
(après convergence)
Extraction
p1
p2
p3
0.12
0.56
0.65

• Extraction de termes-clés d’un
document ou d’une collection
• Extraction d’associations de termes
=> Complétion de requêtes
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
t1 t2 … tn
t1
t2
…
tm
doc
d1
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
t1 t2 … tn
t1
t2
…
tm
Termes xTermes
pour 1 document
Termes xTermes
pour n documents
𝑀 =

s11 s12… s1n
s21 s22… s2n
… …
sm1 sm2… smn
t1 t2 … tn
t1
t2
…
tmw11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
p1 p2 … pn
t1
t2
…
tm
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
Réutilisation
des matrices
existantes
Termes x Documents
Termes x Phrases
Analyseur de cooccurences à
partir du texte
𝑀𝑀I
𝑀𝑀I
ou LSA
ou LSA
Relations
- Fenêtres de mots
- Même unité syntaxique
- Syntaxiques (groupe nominal)
- …
doc
t1: rang 10
t2: rang 1
t3: rang 25
..
tn-1: rang 3
tn: rang 57
t1: {t3 < t10}
t2: {t28}
…
tn: {t10 < t35< t2}
Liste des termes clés
Word cloud
Completion des
requêtes
TextRank
Analyse
d’associations
Liste d’associations ou
Clusters
Termes xTermes
(freq. ou sim.)

: K=2 (0.01 s) K=10 (6 s), K=20 (7.5 s)
N = 25, (50% 0.1 s), N = 225 (50% 0.5 s), N=530
(25% 0.9 s)
: N = 1 (0.77 s), N= 10 (1.25 s), N =300 (3 s)

3. Problem Formulation and Experiments
document list
action at
query
state st
user
environment
examine
document list
generate implicit
feedback
reward rt
implicit
feedback
evaluation
measureretrieval system
agent
Figure 3.1: The IR problem modeled as a contextual bandit problem, with IR terminology
in black and corresponding RL terminology in green and italics.
of previously displayed results.1
This renders the problem a contextual bandit problem
(Barto et al., 1981; Langford and Zhang, 2008) (§2.4.1).
Because our algorithms learn online, we need to measure their online performance,
i.e., how well they address users’ information needs while learning. Previous work in
learning to rank for IR has considered only ﬁnal performance, i.e., performance on un-
seen data after training is completed (Liu, 2009), and, in the case of active learning,

Text retrievalmining

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Text retrievalmining