SlideShare a Scribd company logo
1 of 26
Download to read offline
Capturing Themed Evidence, a Hybrid Approach
K-Cap 2019
19-21 November 2019
Marina del Rey, California, United States
Enrico Daga and Enrico Motta
The Open University
enrico.daga@open.ac.uk
Motivation
The task of identifying pieces of evidence in texts is of fundamental
importance in supporting qualitative studies in various domains,
especially in the humanities (e.g. historiographic methodology)
Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is
prone to errors, and (d) the methodology is (often) not documented
Case study: the Listening Experience Database
• An open and freely searchable database that brings together a mass of
data about people’s experiences of listening to music of all kinds, in any
historical period and any culture.
• Sophisticated data model, natively in RDF / SPARQL
• Linked Open Data: http://data.open.ac.uk/context/led
• Since 2012, the LED project has collected over 10,000 unique listening
experiences from a variety of textual sources
https://led.kmi.open.ac.uk/
How to support users on capturing themed evidence?
• We coin the expression themed evidence, to refer to (direct or
indirect) traces of a fact or situation relevant to a theme of interest
and study the problem of identifying them in texts.
• The task of identifying themed evidence is at the intersection between
topical text classification (finding texts relevant to a certain theme) and
event retrieval (find events mentioned in texts).
• Not all topical texts are themed evidence and the nature of the event
itself is often assumed, implicit, and left to the reader
Finding Listening Experiences (theme: music)
• RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual supper
followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to
the piano.
• MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel,
where one is always sure of edification from the sermon if not from the psalms.
• MASONB-88, negative: Flags and pendants were suspended from the windows,
[. . . ] the colors of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly those of
music, poetry and painting, were especially honored, and floated triumphant
amidst the standards of electorates, dukedoms, and kingdoms.
A Hybrid Approach
• Themed evidence are a subset of topical texts (e.g. about “music”) - distributional
semantics
• Common knowledge graphs include a large amounts of interlinked entities,
including topical entities (in the category “music”) - entity linking to structured
knowledge
• Background knowledge can be used for learning features and tuning elements
of the method - corpus based analysis
• We formalise the task as a binary classification problem; approach in three steps:
1. Statistical relatedness analysis
2. Themed-entity detection
3. Hybridisation phase
Background Knowledge // Listening Experiences
• LE Database includes text excerpts that can be analysed as positive examples.
• Project Gutenberg >58k books in the public domain (48790 en)
• Reuters-21578 (Reu) 21.578 news articles of various categories. It does not
include music.
• The UK Reading Experience Database (UK RED) investigates the evidence of
reading in Britain
• DBpedia is a large knowledge graph published as Linked Data. Includes SPARQL
endpoint and a NER tool: DBpedia Spotlight
1> Statistical Relatedness Analysis
• Compute embeddings (Word2Vec) on Project Gutenberg (1.5B words!),
we develop a domain dictionary of 10k terms related to a core term:
music[n] (1.0) <— core term
melody[n] (7.8010)
guitar[n] (6.8451)
inspiriting[j] (6.3402)
heartful[j] (4.2634)
psalm[n] (4.0559)
…
1> Statistical Relatedness Analysis
0 rontgen[N]
1 play[V]
2 Brahms[N]
3 symphony[N]
4 another[D]
5 musical[J]
6 take[V]
7 always[R]
8 happen[V]
9 specially[R]
10 count[V]
11 something[N]
12 sort[N]
1> Distribution Analysis // Learning the threshold
We analyse the distribution of the dictionary with 1+ (LED) and 2- corpora
(RED and Reu), and calculate both average score x and standard deviation σx
on the positive corpus.
These values partition the corpus in quartiles:
(1) r < (x−σx); (3) (x+σx) < r > (x);
(2) x < r > (x−σx); (4) r > (x + σx).
3 threshold values:
th1 > (x − σx),
th2 > x, and th3 > x + σx.
Scores
Items
1> Statistical relatedness // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
• Anacreontic[n]: 4.13048797627
• amateur[n]: 4.60138704262
• admirably[r]: 3.65226351076
• orchestral[j]: 7.09262661606
• trio[n]: 5.60459207257
• piano[n]: 6.36957273307
1> Statistical relatedness // Example
MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
psalm[n]: 4.05596201177
1> Statistical relatedness // Example
MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
harmoniously[r]:4.96754289705
music[n]:1.0
poetry[n]:5.93071678171
painting[n]:4.39244380382
triumphant[j]:3.80869437369
amidst[i]:3.6638322575
Problems
• Named entities may not be sufficiently represented in the dictionary
(e.g."Prélude à l'après-midi d'un faune").
• Many entities may not appear in the trained embeddings.
• Terms may have a low score because not statistically relevant.
• However, the presence of named entities is a clue of possible evidence.
• Distributional approaches alone inherit ambiguity of the core term, for
example, figurative use (sounds good?)
2> Themed entity detection
• DBPedia Spotlight to identify %entities%
• SPARQL query to filter the ones related to
dbcat:Music
• Where %entities% are the resources identified by
the NER engine, and %d% is a parameter, set to 5
(>5 too much noise).
SELECT distinct ?sub WHERE {
VALUES ?sub { %entities% }
?sub dc:subject ?subject .
?subject skos:broader{0:%d%} cat:Music
}
2> Themed-entity detection // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
http://dbpedia.org/resource/Anacreontic_Society
http://dbpedia.org/resource/Orchestra
http://dbpedia.org/resource/Trio_(music)
http://dbpedia.org/resource/Così_fan_tutte
http://dbpedia.org/resource/Piano
2> Themed-entity detection // Example
MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
http://dbpedia.org/resource/Evening_Prayer_(Anglican)
http://dbpedia.org/resource/Psalms
2> Themed-entity detection // Example
MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
http://dbpedia.org/resource/Music
3> Hybridisation
Entity boost. To
promote terms mapped
to entities
PoS Filter: demote
terms other then verbs
and nouns, to privilege
factual statements
3> Hybridisation // Examples
• RECMUS-619: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte',
they drew me to the piano
• MASONB-31: In the evening we went to Rev. Baptist Noel's chapel, where
one is always sure of edification from the sermon if not from the psalms.
• MASONB-88: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly
those of music, poetry and painting, were especially honored, and ︎oated
triumphant amidst the standards of electorates, dukedoms, and kingdoms.
http://led.kmi.open.ac.uk/discovery/findler
Evaluation // Gold Standard
• 500 positive samples sourced from 17
books in the LED collection
• 500 negative samples sourced from the
same books
• Negative samples picked with similar length
of each positive (avg length ~125 words)
• Accurate: Fleiss’ kappa reports substantial
agreement among annotators
• Pessimistic: negative samples more similar
to positives then to RED and Reu
• Also a gold standard produced from RED, to
evaluate portability
Both GS published openly for reuse
http://led.kmi.open.ac.uk/discovery/findler
Evaluation // Methods
• Fo: Random Forest Classifier (ML) // trained on LED, RED, and Reu // Test ~80% Acc
• St: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF
(more details in the paper)
• Em: Statistical relatedness component only (Embeddings)
• En: Themed entity detection component (Entity)
• Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered)
• Hy-F: No filter, only entity boost (Hybrid - Unfiltered)
• Hy: Our Hybrid approach
• Hy/R: Our Hybrid approach on the Reading Experience Database (to test
portability). Core concept: book[n] and core entity: dbc:Literature
http://led.kmi.open.ac.uk/discovery/findler
Evaluation // Discussion
Fo: high precision, low recall, accuracy slightly
above random (robust GS)
En alone has a performance slightly above
random: gold standard is pessimistic
Without applying noise correction (POS filter),
precision is generally lower
Hy-F shows the impact of entity detection on
recall
Hy: best of both worlds. Substantial agreement
with annotators (Cohen’s K)
Hy/R: our approach is applicable to other
domains with small configuration
See the paper for more observations and for an analysis
of errors
The results are very good: 87% F-Measure & Accuracy
http://led.kmi.open.ac.uk/discovery/findler
Future work
• Applying the method to the scan of books (FindLEr demo) involves other issues before
classification, incl. segmentation, and clutter (indexes, references, …)
• In absence of a knowledge base of annotated documents, how to learn the parameters -
threshold & default score?
• Experiment with other embeddings techniques (ElMo, BERT), extract multi-words
expressions, and try other entity linkers (Wikifier/Wikidata)
• We performed a concept search, what about a multi-concept search (music & children,
music & war)
• Searching repositories instead of books (some workflow issues here…)
• KE to support the curation of the documentary evidence. See the Sciknow position
paper “Challenging knowledge extraction to support the curation of documentary evidence in
the humanities”
http://led.kmi.open.ac.uk/discovery/findler
Questions?
Feedback:	@enridaga	|	www.enridaga.net

More Related Content

Similar to Capturing Themed Evidence, a Hybrid Approach

Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchKepa J. Rodriguez
 
"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of Hamlet"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of HamletIain Emsley
 
Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)Renata Brandão
 
Denktank 2010
Denktank 2010Denktank 2010
Denktank 2010ocor203
 
Genre Classification and Analysis
Genre Classification and AnalysisGenre Classification and Analysis
Genre Classification and AnalysisAnat Gilboa
 
DPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateDPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateGretchen Gueguen
 
Russianmusicgenre
RussianmusicgenreRussianmusicgenre
Russianmusicgenrepengel1
 
BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19Jerwood Library, Trinity Laban
 
Making the Leap Towards Linked Data
Making the Leap Towards Linked DataMaking the Leap Towards Linked Data
Making the Leap Towards Linked DataIris Lee
 
Discovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-BelfordDiscovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-BelfordNASIG
 
CENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 BurrowsCENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 BurrowsToby Burrows
 
Creating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music VisualizationCreating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music Visualizationicchp2012
 
Malina lafayette 2019 colloquium
Malina lafayette 2019 colloquiumMalina lafayette 2019 colloquium
Malina lafayette 2019 colloquiumroger malina
 
Lovello Concert 8-2.pub (Read-Only)
 Lovello Concert 8-2.pub (Read-Only) Lovello Concert 8-2.pub (Read-Only)
Lovello Concert 8-2.pub (Read-Only)crysatal16
 
Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed TextsDatech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed TextsIMPACT Centre of Competence
 

Similar to Capturing Themed Evidence, a Hybrid Approach (20)

Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical Research
 
"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of Hamlet"It will discourse most eloquent music": Sonify Variants of Hamlet
"It will discourse most eloquent music": Sonify Variants of Hamlet
 
Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)Caroline Ardrey (University of Birmingham)
Caroline Ardrey (University of Birmingham)
 
Denktank 2010
Denktank 2010Denktank 2010
Denktank 2010
 
Genre Classification and Analysis
Genre Classification and AnalysisGenre Classification and Analysis
Genre Classification and Analysis
 
DPLA Archival Description Working Group Update
DPLA Archival Description Working Group UpdateDPLA Archival Description Working Group Update
DPLA Archival Description Working Group Update
 
MIR
MIRMIR
MIR
 
Russianmusicgenre
RussianmusicgenreRussianmusicgenre
Russianmusicgenre
 
BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19BMus Music History in Context 4b info skills (refine) 2018/19
BMus Music History in Context 4b info skills (refine) 2018/19
 
Making the Leap Towards Linked Data
Making the Leap Towards Linked DataMaking the Leap Towards Linked Data
Making the Leap Towards Linked Data
 
Discovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-BelfordDiscovering music: small-scale, web-scale, facets, and beyond-Belford
Discovering music: small-scale, web-scale, facets, and beyond-Belford
 
CENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 BurrowsCENDARI Summer School July 2015 Burrows
CENDARI Summer School July 2015 Burrows
 
Creating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music VisualizationCreating an Entertaining and Informative Music Visualization
Creating an Entertaining and Informative Music Visualization
 
Malina lafayette 2019 colloquium
Malina lafayette 2019 colloquiumMalina lafayette 2019 colloquium
Malina lafayette 2019 colloquium
 
Lovello Concert 8-2.pub (Read-Only)
 Lovello Concert 8-2.pub (Read-Only) Lovello Concert 8-2.pub (Read-Only)
Lovello Concert 8-2.pub (Read-Only)
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Art resources
Art resourcesArt resources
Art resources
 
Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed TextsDatech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts
 
On the two sides of the pond
On the two sides of the pondOn the two sides of the pond
On the two sides of the pond
 
Sparse and Low Rank Representations in Music Signal Analysis
 Sparse and Low Rank Representations in Music Signal  Analysis Sparse and Low Rank Representations in Music Signal  Analysis
Sparse and Low Rank Representations in Music Signal Analysis
 

More from Enrico Daga

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyEnrico Daga
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...Enrico Daga
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEIEnrico Daga
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything projectEnrico Daga
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Enrico Daga
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Enrico Daga
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterEnrico Daga
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User StudyEnrico Daga
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so farEnrico Daga
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsEnrico Daga
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionEnrico Daga
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsEnrico Daga
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
 

More from Enrico Daga (17)

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 

Recently uploaded

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 

Recently uploaded (20)

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Capturing Themed Evidence, a Hybrid Approach

  • 1. Capturing Themed Evidence, a Hybrid Approach K-Cap 2019 19-21 November 2019 Marina del Rey, California, United States Enrico Daga and Enrico Motta The Open University enrico.daga@open.ac.uk
  • 2. Motivation The task of identifying pieces of evidence in texts is of fundamental importance in supporting qualitative studies in various domains, especially in the humanities (e.g. historiographic methodology) Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is prone to errors, and (d) the methodology is (often) not documented
  • 3. Case study: the Listening Experience Database • An open and freely searchable database that brings together a mass of data about people’s experiences of listening to music of all kinds, in any historical period and any culture. • Sophisticated data model, natively in RDF / SPARQL • Linked Open Data: http://data.open.ac.uk/context/led • Since 2012, the LED project has collected over 10,000 unique listening experiences from a variety of textual sources https://led.kmi.open.ac.uk/
  • 4. How to support users on capturing themed evidence? • We coin the expression themed evidence, to refer to (direct or indirect) traces of a fact or situation relevant to a theme of interest and study the problem of identifying them in texts. • The task of identifying themed evidence is at the intersection between topical text classification (finding texts relevant to a certain theme) and event retrieval (find events mentioned in texts). • Not all topical texts are themed evidence and the nature of the event itself is often assumed, implicit, and left to the reader
  • 5. Finding Listening Experiences (theme: music) • RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to the piano. • MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel, where one is always sure of edification from the sermon if not from the psalms. • MASONB-88, negative: Flags and pendants were suspended from the windows, [. . . ] the colors of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and floated triumphant amidst the standards of electorates, dukedoms, and kingdoms.
  • 6. A Hybrid Approach • Themed evidence are a subset of topical texts (e.g. about “music”) - distributional semantics • Common knowledge graphs include a large amounts of interlinked entities, including topical entities (in the category “music”) - entity linking to structured knowledge • Background knowledge can be used for learning features and tuning elements of the method - corpus based analysis • We formalise the task as a binary classification problem; approach in three steps: 1. Statistical relatedness analysis 2. Themed-entity detection 3. Hybridisation phase
  • 7. Background Knowledge // Listening Experiences • LE Database includes text excerpts that can be analysed as positive examples. • Project Gutenberg >58k books in the public domain (48790 en) • Reuters-21578 (Reu) 21.578 news articles of various categories. It does not include music. • The UK Reading Experience Database (UK RED) investigates the evidence of reading in Britain • DBpedia is a large knowledge graph published as Linked Data. Includes SPARQL endpoint and a NER tool: DBpedia Spotlight
  • 8. 1> Statistical Relatedness Analysis • Compute embeddings (Word2Vec) on Project Gutenberg (1.5B words!), we develop a domain dictionary of 10k terms related to a core term: music[n] (1.0) <— core term melody[n] (7.8010) guitar[n] (6.8451) inspiriting[j] (6.3402) heartful[j] (4.2634) psalm[n] (4.0559) …
  • 9. 1> Statistical Relatedness Analysis 0 rontgen[N] 1 play[V] 2 Brahms[N] 3 symphony[N] 4 another[D] 5 musical[J] 6 take[V] 7 always[R] 8 happen[V] 9 specially[R] 10 count[V] 11 something[N] 12 sort[N]
  • 10. 1> Distribution Analysis // Learning the threshold We analyse the distribution of the dictionary with 1+ (LED) and 2- corpora (RED and Reu), and calculate both average score x and standard deviation σx on the positive corpus. These values partition the corpus in quartiles: (1) r < (x−σx); (3) (x+σx) < r > (x); (2) x < r > (x−σx); (4) r > (x + σx). 3 threshold values: th1 > (x − σx), th2 > x, and th3 > x + σx. Scores Items
  • 11. 1> Statistical relatedness // Example RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano. • Anacreontic[n]: 4.13048797627 • amateur[n]: 4.60138704262 • admirably[r]: 3.65226351076 • orchestral[j]: 7.09262661606 • trio[n]: 5.60459207257 • piano[n]: 6.36957273307
  • 12. 1> Statistical relatedness // Example MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. psalm[n]: 4.05596201177
  • 13. 1> Statistical relatedness // Example MASONB-88, negative: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms. harmoniously[r]:4.96754289705 music[n]:1.0 poetry[n]:5.93071678171 painting[n]:4.39244380382 triumphant[j]:3.80869437369 amidst[i]:3.6638322575
  • 14. Problems • Named entities may not be sufficiently represented in the dictionary (e.g."Prélude à l'après-midi d'un faune"). • Many entities may not appear in the trained embeddings. • Terms may have a low score because not statistically relevant. • However, the presence of named entities is a clue of possible evidence. • Distributional approaches alone inherit ambiguity of the core term, for example, figurative use (sounds good?)
  • 15. 2> Themed entity detection • DBPedia Spotlight to identify %entities% • SPARQL query to filter the ones related to dbcat:Music • Where %entities% are the resources identified by the NER engine, and %d% is a parameter, set to 5 (>5 too much noise). SELECT distinct ?sub WHERE { VALUES ?sub { %entities% } ?sub dc:subject ?subject . ?subject skos:broader{0:%d%} cat:Music }
  • 16. 2> Themed-entity detection // Example RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano. http://dbpedia.org/resource/Anacreontic_Society http://dbpedia.org/resource/Orchestra http://dbpedia.org/resource/Trio_(music) http://dbpedia.org/resource/Così_fan_tutte http://dbpedia.org/resource/Piano
  • 17. 2> Themed-entity detection // Example MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. http://dbpedia.org/resource/Evening_Prayer_(Anglican) http://dbpedia.org/resource/Psalms
  • 18. 2> Themed-entity detection // Example MASONB-88, negative: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms. http://dbpedia.org/resource/Music
  • 19. 3> Hybridisation Entity boost. To promote terms mapped to entities PoS Filter: demote terms other then verbs and nouns, to privilege factual statements
  • 20. 3> Hybridisation // Examples • RECMUS-619: Introduced to the Anacreontic Society, consisting of amateurs who perform admirably the best orchestral works. The usual supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they drew me to the piano • MASONB-31: In the evening we went to Rev. Baptist Noel's chapel, where one is always sure of edification from the sermon if not from the psalms. • MASONB-88: Flags and pendants were suspended from the windows, [...] the colours of the German States were waving harmoniously together, and the banners of the Fine Arts, with appropriate inscriptions, particularly those of music, poetry and painting, were especially honored, and ︎oated triumphant amidst the standards of electorates, dukedoms, and kingdoms.
  • 22. Evaluation // Gold Standard • 500 positive samples sourced from 17 books in the LED collection • 500 negative samples sourced from the same books • Negative samples picked with similar length of each positive (avg length ~125 words) • Accurate: Fleiss’ kappa reports substantial agreement among annotators • Pessimistic: negative samples more similar to positives then to RED and Reu • Also a gold standard produced from RED, to evaluate portability Both GS published openly for reuse http://led.kmi.open.ac.uk/discovery/findler
  • 23. Evaluation // Methods • Fo: Random Forest Classifier (ML) // trained on LED, RED, and Reu // Test ~80% Acc • St: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF (more details in the paper) • Em: Statistical relatedness component only (Embeddings) • En: Themed entity detection component (Entity) • Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered) • Hy-F: No filter, only entity boost (Hybrid - Unfiltered) • Hy: Our Hybrid approach • Hy/R: Our Hybrid approach on the Reading Experience Database (to test portability). Core concept: book[n] and core entity: dbc:Literature http://led.kmi.open.ac.uk/discovery/findler
  • 24. Evaluation // Discussion Fo: high precision, low recall, accuracy slightly above random (robust GS) En alone has a performance slightly above random: gold standard is pessimistic Without applying noise correction (POS filter), precision is generally lower Hy-F shows the impact of entity detection on recall Hy: best of both worlds. Substantial agreement with annotators (Cohen’s K) Hy/R: our approach is applicable to other domains with small configuration See the paper for more observations and for an analysis of errors The results are very good: 87% F-Measure & Accuracy http://led.kmi.open.ac.uk/discovery/findler
  • 25. Future work • Applying the method to the scan of books (FindLEr demo) involves other issues before classification, incl. segmentation, and clutter (indexes, references, …) • In absence of a knowledge base of annotated documents, how to learn the parameters - threshold & default score? • Experiment with other embeddings techniques (ElMo, BERT), extract multi-words expressions, and try other entity linkers (Wikifier/Wikidata) • We performed a concept search, what about a multi-concept search (music & children, music & war) • Searching repositories instead of books (some workflow issues here…) • KE to support the curation of the documentary evidence. See the Sciknow position paper “Challenging knowledge extraction to support the curation of documentary evidence in the humanities” http://led.kmi.open.ac.uk/discovery/findler