SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Expérienced’uneintégration
Big data & Matchine Learning
6 avril 2017, Plaine Images
1 2
Desktop
Rappels
Documents
Factures
Contacts
Source
Mails
Notes
BookmarksFeuilles Calcul
Messages
Présentation
Planning
Plans
Mind map
Articles
Mémoriser,
Maintenir
Organiser
Re-trouver
(Navigation, Recherche)
(contrôle, histoire, familiarité) ≠Web
Besoins
Tâches
• 1228900 documents ~160GB
• 314 types de fichiers
• 17500 dossiers
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Nombre fichiers / type
Desktop
Search
Engine
Idée directrice : mieux exploiter le contenu des documents dans
le processus de recherche (indexation, requête, présentation des résultats, …)
Traitements linguistiques
Text Retrieval Text Mining
Text Navigation
TextVisualisation
Associations de mots
Extraction d’entités
Modèle de langues
Similarité de textes
Résumé de textes
Partionnement de textes
Classification de textes
Extraction de thèmes
Prédiction de mots
…
Plus de connaissances
Plus de structures
Accès à
l’information
Indexation
Requêtage
Ordonnancement
Rétroaction de
pertinence
Personal
Information
Space
Supervisé
Non supervisé
• + de structures, + de liens, + d’analyses, … pour la recherche et la prise de décision
• Fouille non supervisée, tâche de fouille rapide (Topic Modeling)
Personal Information Space
(BigText Data)
Text
Retrieval
Text
Mining
Documents pertinents
(SmallText Data)
Alimentation pour nouveau requêtage
Requête utilisateur
t t
t t
tt
t
t t
t
t
t
Résumé
d d
d
d
d
d
d
d d
d
d
d
d
d
Structure
Connaissances
If we go further still into semantic analysis, then we might be able to recognize
dog as an animal. We also can recognize boy as a person, and playground as a
location and analyze their relations. One deduction could be that the dog was
chasing the boy, and the boy is on the playground. This will add more entities and
relations, through entity-relation recognition. Now, we can count the most frequent
person that appears in this whole collection of news articles. Or, whenever you see
a mention of this person you also tend to see mentions of another person or object.
These types of repeated pattens can potentially make very good features.
A dog is chasing a on the playground
String of characters
Sequence of words
+ POS tags
+ Syntactic structures
+ Entities and relations
+ Logic predicates
+ Speech acts
A dog
boy
A dog is chasing a on the playgroundboy
Det Noun
Noun phrase Noun phrase Noun phrase
Prep phraseVerb phrase
Complex verb
Aux Verb Det Prep Det NounNoun
Verb phrase
Sentence
Animal Person
CHASE ON
Location
a boy the playground
Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1)
Speech act = REQUEST
Deeper NLP: requires more human effort; less accurate
Closer to knowledge
representation
Figure 3.3 Illustration of different levels of text representation.Source:Text Data Management And Analysis, C. Zhai
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
p1 p2 … pn
t1
t2
…
tm
Matrices de collection
Matrices de document
Résultat de recherche :
Documents pertinents
Similarité et
association de
termes
(=> Completion
de requêtes)
Similarité de
documents
(=> Document
Clustering)
….
Résumé de
documents
Extraction de
mot-clés
…doc
d1
Vectorisation
+ index
Rem : modélisation probabiliste comme alternative (predictif)
d1 d2 d3 d4 d5 d6 d7 d8 d9
human 1 0 1 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 0 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
Indexing by Latent Semantic Analysis,
Deerwester, Dumais & al., 1990
interface
computer
human
d4
d2
d1
d3
Espace
des termes
Idem Espace
des documents
• Tuple < T,C,R,W,M,d,S >
• T termes apparaissant dans les contextes
• C contextes où apparaissent les termes
• R relation de co-occurrence entre les termes et les contextes
• W schéma de pondération des termes (opt)
• M matrice distributionelle T x C
• d fonction de réduction de dimension, d : M -> Mo (opt)
• S mesure de distance entre les vecteurs dans M ou Mo
Instanciation en
fonction de la
tâche
Méthodes
Générales (LSA, …)
• Partitionnement de documents
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
𝑠𝑖𝑚$%& 𝑑1, 𝑑2 =
∑ 𝑤𝑗1 ∗ 𝑤𝑗20
123
∑ (𝑤𝑗1)60
123 ∗ ∑ (𝑤𝑗2)60
123
	
…
d1
index
.
.
.
tm
t1
t2
t3
d1
d2
d3
dn
R
T C
𝑀 =
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
LSA𝑀 =	 𝑀 ≈ 𝑈<×Σ<×𝑊<
@
Projection des termes et documents sur
k facteurs latents (k < n, k < m)
𝑠𝑖𝑚A&B 𝑑C	, 𝑑1 = cos(𝑊<
.,C
, 𝑊<
.,1
)
d3
d1
d2
dn
Partitionnement des
documents résultat
d’une recherche
(content-based)
K-means
Intérêt : matrices de dimension réduite
pour les calculs de similarité de documents
Mk
m x n
U
m x r
Σ
WT
Approximation de
M au rang k
r x r r x n
k
k
k
k
Vecteurs
des termes
Vecteurs
des documents
Valeurs
singulières
ordonnées
𝛴× ×=	
Idem pour les termes
• Résumé de document
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
p1 p2 … pn
t1
t2
…
tm
doc
s11 s12… s1n
s21 s22… s2n
… …
sn1 sn2… snn
p1 p2 … pn
p1
p2
…
pn
Termes x phrases Similarités phrases
(symétrique)
“Content overlap”
𝑀 =	
𝑀I 𝑀
ou LSA
𝑠𝑖𝑚A&B 𝑝C	, 𝑝1 = cos(𝑊<
.,C
, 𝑊<
.,1
)
𝑠𝑖𝑚KI
K 𝑝C	, 𝑝1 =
𝑤<L𝑤< ∈ 𝑝C&𝑤< ∈ 𝑝1
log 𝑆C ∗ log	( 𝑆1 )
s11 s12… s1n
s21 s22… s2n
… …
sn1 sn2… snn
p1 p2 … pn
p1
p2
…
pn
Similarités phrases Partitionnement
(K-mean)
------------------------------
----------------------------------------------------------------------------------------------------
--------------------
PhraseRank
~ PageRank
phrase 1
phrase2
phrase 3
résumé
p1: 0.086
p2: 0.083
p3: 0.095
..
pn-1: 0.088
pn: 0.0734
Partition du document
en groupe de phrases
+ centroids
Liste ordonnée des
phrases selon scores
(après convergence)
Extraction
p1
p2
p3
0.12
0.56
0.65
• Extraction de termes-clés d’un
document ou d’une collection
• Extraction d’associations de termes
=> Complétion de requêtes
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
t1 t2 … tn
t1
t2
…
tm
doc
d1
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
t1 t2 … tn
t1
t2
…
tm
Termes xTermes
pour 1 document
Termes xTermes
pour n documents
𝑀 =
s11 s12… s1n
s21 s22… s2n
… …
sm1 sm2… smn
t1 t2 … tn
t1
t2
…
tmw11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
p1 p2 … pn
t1
t2
…
tm
w11 w12… w1n
w21 w22… w2n
… …
wm1 wm2… wmn
d1 d2 … dn
t1
t2
…
tm
Réutilisation
des matrices
existantes
Termes x Documents
Termes x Phrases
Analyseur de cooccurences à
partir du texte
𝑀𝑀I
𝑀𝑀I
ou LSA
ou LSA
Relations
- Fenêtres de mots
- Même unité syntaxique
- Syntaxiques (groupe nominal)
- …
doc
t1: rang 10
t2: rang 1
t3: rang 25
..
tn-1: rang 3
tn: rang 57
t1: {t3 < t10}
t2: {t28}
…
tn: {t10 < t35< t2}
Liste des termes clés
Word cloud
Completion des
requêtes
TextRank
Analyse
d’associations
Liste d’associations ou
Clusters
Termes xTermes
(freq. ou sim.)
: K=2 (0.01 s) K=10 (6 s), K=20 (7.5 s)
N = 25, (50% 0.1 s), N = 225 (50% 0.5 s), N=530
(25% 0.9 s)
: N = 1 (0.77 s), N= 10 (1.25 s), N =300 (3 s)
3. Problem Formulation and Experiments
document list
action at
query
state st
user
environment
examine
document list
generate implicit
feedback
reward rt
implicit
feedback
evaluation
measureretrieval system
agent
Figure 3.1: The IR problem modeled as a contextual bandit problem, with IR terminology
in black and corresponding RL terminology in green and italics.
of previously displayed results.1
This renders the problem a contextual bandit problem
(Barto et al., 1981; Langford and Zhang, 2008) (§2.4.1).
Because our algorithms learn online, we need to measure their online performance,
i.e., how well they address users’ information needs while learning. Previous work in
learning to rank for IR has considered only final performance, i.e., performance on un-
seen data after training is completed (Liu, 2009), and, in the case of active learning,
Text retrievalmining

Contenu connexe

En vedette

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Text retrievalmining

  • 1. Expérienced’uneintégration Big data & Matchine Learning 6 avril 2017, Plaine Images
  • 2.
  • 5. • 1228900 documents ~160GB • 314 types de fichiers • 17500 dossiers 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Nombre fichiers / type
  • 7.
  • 8.
  • 9. Idée directrice : mieux exploiter le contenu des documents dans le processus de recherche (indexation, requête, présentation des résultats, …)
  • 10. Traitements linguistiques Text Retrieval Text Mining Text Navigation TextVisualisation Associations de mots Extraction d’entités Modèle de langues Similarité de textes Résumé de textes Partionnement de textes Classification de textes Extraction de thèmes Prédiction de mots … Plus de connaissances Plus de structures Accès à l’information Indexation Requêtage Ordonnancement Rétroaction de pertinence Personal Information Space Supervisé Non supervisé
  • 11. • + de structures, + de liens, + d’analyses, … pour la recherche et la prise de décision • Fouille non supervisée, tâche de fouille rapide (Topic Modeling)
  • 12. Personal Information Space (BigText Data) Text Retrieval Text Mining Documents pertinents (SmallText Data) Alimentation pour nouveau requêtage Requête utilisateur t t t t tt t t t t t t Résumé d d d d d d d d d d d d d d Structure Connaissances
  • 13. If we go further still into semantic analysis, then we might be able to recognize dog as an animal. We also can recognize boy as a person, and playground as a location and analyze their relations. One deduction could be that the dog was chasing the boy, and the boy is on the playground. This will add more entities and relations, through entity-relation recognition. Now, we can count the most frequent person that appears in this whole collection of news articles. Or, whenever you see a mention of this person you also tend to see mentions of another person or object. These types of repeated pattens can potentially make very good features. A dog is chasing a on the playground String of characters Sequence of words + POS tags + Syntactic structures + Entities and relations + Logic predicates + Speech acts A dog boy A dog is chasing a on the playgroundboy Det Noun Noun phrase Noun phrase Noun phrase Prep phraseVerb phrase Complex verb Aux Verb Det Prep Det NounNoun Verb phrase Sentence Animal Person CHASE ON Location a boy the playground Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1) Speech act = REQUEST Deeper NLP: requires more human effort; less accurate Closer to knowledge representation Figure 3.3 Illustration of different levels of text representation.Source:Text Data Management And Analysis, C. Zhai
  • 14. w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn d1 d2 … dn t1 t2 … tm w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn p1 p2 … pn t1 t2 … tm Matrices de collection Matrices de document Résultat de recherche : Documents pertinents Similarité et association de termes (=> Completion de requêtes) Similarité de documents (=> Document Clustering) …. Résumé de documents Extraction de mot-clés …doc d1 Vectorisation + index Rem : modélisation probabiliste comme alternative (predictif)
  • 15. d1 d2 d3 d4 d5 d6 d7 d8 d9 human 1 0 1 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 0 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1 Indexing by Latent Semantic Analysis, Deerwester, Dumais & al., 1990 interface computer human d4 d2 d1 d3 Espace des termes Idem Espace des documents
  • 16. • Tuple < T,C,R,W,M,d,S > • T termes apparaissant dans les contextes • C contextes où apparaissent les termes • R relation de co-occurrence entre les termes et les contextes • W schéma de pondération des termes (opt) • M matrice distributionelle T x C • d fonction de réduction de dimension, d : M -> Mo (opt) • S mesure de distance entre les vecteurs dans M ou Mo Instanciation en fonction de la tâche Méthodes Générales (LSA, …)
  • 17. • Partitionnement de documents w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn d1 d2 … dn t1 t2 … tm 𝑠𝑖𝑚$%& 𝑑1, 𝑑2 = ∑ 𝑤𝑗1 ∗ 𝑤𝑗20 123 ∑ (𝑤𝑗1)60 123 ∗ ∑ (𝑤𝑗2)60 123 … d1 index . . . tm t1 t2 t3 d1 d2 d3 dn R T C 𝑀 =
  • 18. w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn d1 d2 … dn t1 t2 … tm LSA𝑀 = 𝑀 ≈ 𝑈<×Σ<×𝑊< @ Projection des termes et documents sur k facteurs latents (k < n, k < m) 𝑠𝑖𝑚A&B 𝑑C , 𝑑1 = cos(𝑊< .,C , 𝑊< .,1 ) d3 d1 d2 dn Partitionnement des documents résultat d’une recherche (content-based) K-means Intérêt : matrices de dimension réduite pour les calculs de similarité de documents Mk m x n U m x r Σ WT Approximation de M au rang k r x r r x n k k k k Vecteurs des termes Vecteurs des documents Valeurs singulières ordonnées 𝛴× ×= Idem pour les termes
  • 19. • Résumé de document w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn p1 p2 … pn t1 t2 … tm doc s11 s12… s1n s21 s22… s2n … … sn1 sn2… snn p1 p2 … pn p1 p2 … pn Termes x phrases Similarités phrases (symétrique) “Content overlap” 𝑀 = 𝑀I 𝑀 ou LSA 𝑠𝑖𝑚A&B 𝑝C , 𝑝1 = cos(𝑊< .,C , 𝑊< .,1 ) 𝑠𝑖𝑚KI K 𝑝C , 𝑝1 = 𝑤<L𝑤< ∈ 𝑝C&𝑤< ∈ 𝑝1 log 𝑆C ∗ log ( 𝑆1 )
  • 20. s11 s12… s1n s21 s22… s2n … … sn1 sn2… snn p1 p2 … pn p1 p2 … pn Similarités phrases Partitionnement (K-mean) ------------------------------ ---------------------------------------------------------------------------------------------------- -------------------- PhraseRank ~ PageRank phrase 1 phrase2 phrase 3 résumé p1: 0.086 p2: 0.083 p3: 0.095 .. pn-1: 0.088 pn: 0.0734 Partition du document en groupe de phrases + centroids Liste ordonnée des phrases selon scores (après convergence) Extraction p1 p2 p3 0.12 0.56 0.65
  • 21. • Extraction de termes-clés d’un document ou d’une collection • Extraction d’associations de termes => Complétion de requêtes w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn t1 t2 … tn t1 t2 … tm doc d1 w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn t1 t2 … tn t1 t2 … tm Termes xTermes pour 1 document Termes xTermes pour n documents 𝑀 =
  • 22. s11 s12… s1n s21 s22… s2n … … sm1 sm2… smn t1 t2 … tn t1 t2 … tmw11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn p1 p2 … pn t1 t2 … tm w11 w12… w1n w21 w22… w2n … … wm1 wm2… wmn d1 d2 … dn t1 t2 … tm Réutilisation des matrices existantes Termes x Documents Termes x Phrases Analyseur de cooccurences à partir du texte 𝑀𝑀I 𝑀𝑀I ou LSA ou LSA Relations - Fenêtres de mots - Même unité syntaxique - Syntaxiques (groupe nominal) - … doc t1: rang 10 t2: rang 1 t3: rang 25 .. tn-1: rang 3 tn: rang 57 t1: {t3 < t10} t2: {t28} … tn: {t10 < t35< t2} Liste des termes clés Word cloud Completion des requêtes TextRank Analyse d’associations Liste d’associations ou Clusters Termes xTermes (freq. ou sim.)
  • 23. : K=2 (0.01 s) K=10 (6 s), K=20 (7.5 s) N = 25, (50% 0.1 s), N = 225 (50% 0.5 s), N=530 (25% 0.9 s) : N = 1 (0.77 s), N= 10 (1.25 s), N =300 (3 s)
  • 24. 3. Problem Formulation and Experiments document list action at query state st user environment examine document list generate implicit feedback reward rt implicit feedback evaluation measureretrieval system agent Figure 3.1: The IR problem modeled as a contextual bandit problem, with IR terminology in black and corresponding RL terminology in green and italics. of previously displayed results.1 This renders the problem a contextual bandit problem (Barto et al., 1981; Langford and Zhang, 2008) (§2.4.1). Because our algorithms learn online, we need to measure their online performance, i.e., how well they address users’ information needs while learning. Previous work in learning to rank for IR has considered only final performance, i.e., performance on un- seen data after training is completed (Liu, 2009), and, in the case of active learning,