The task of identifying pieces of evidence in texts is of fundamental importance in supporting qualitative studies in various domains, especially in the humanities. In this paper, we coin the expression themed evidence, to refer to (direct or indirect) traces of a fact or situation relevant to a theme of interest and study the problem of identifying them in texts. We devise a generic framework aimed at capturing themed evidence in texts based on a hybrid approach, combining statistical natural language processing, background knowledge, and Semantic Web technologies. The effectiveness of the method is demonstrated on a case study of a digital humanities database aimed at collecting and curating a repository of evidence of experiences of listening to music. Extensive experiments demonstrate that our hybrid approach outperforms alternative solutions. We also evidence its generality by testing it on a different use case in the digital humanities.
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Capturing Themed Evidence, a Hybrid Approach
1. Capturing Themed Evidence, a Hybrid Approach
K-Cap 2019
19-21 November 2019
Marina del Rey, California, United States
Enrico Daga and Enrico Motta
The Open University
enrico.daga@open.ac.uk
2. Motivation
The task of identifying pieces of evidence in texts is of fundamental
importance in supporting qualitative studies in various domains,
especially in the humanities (e.g. historiographic methodology)
Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is
prone to errors, and (d) the methodology is (often) not documented
3. Case study: the Listening Experience Database
• An open and freely searchable database that brings together a mass of
data about people’s experiences of listening to music of all kinds, in any
historical period and any culture.
• Sophisticated data model, natively in RDF / SPARQL
• Linked Open Data: http://data.open.ac.uk/context/led
• Since 2012, the LED project has collected over 10,000 unique listening
experiences from a variety of textual sources
https://led.kmi.open.ac.uk/
4. How to support users on capturing themed evidence?
• We coin the expression themed evidence, to refer to (direct or
indirect) traces of a fact or situation relevant to a theme of interest
and study the problem of identifying them in texts.
• The task of identifying themed evidence is at the intersection between
topical text classification (finding texts relevant to a certain theme) and
event retrieval (find events mentioned in texts).
• Not all topical texts are themed evidence and the nature of the event
itself is often assumed, implicit, and left to the reader
5. Finding Listening Experiences (theme: music)
• RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual supper
followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to
the piano.
• MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel,
where one is always sure of edification from the sermon if not from the psalms.
• MASONB-88, negative: Flags and pendants were suspended from the windows,
[. . . ] the colors of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly those of
music, poetry and painting, were especially honored, and floated triumphant
amidst the standards of electorates, dukedoms, and kingdoms.
6. A Hybrid Approach
• Themed evidence are a subset of topical texts (e.g. about “music”) - distributional
semantics
• Common knowledge graphs include a large amounts of interlinked entities,
including topical entities (in the category “music”) - entity linking to structured
knowledge
• Background knowledge can be used for learning features and tuning elements
of the method - corpus based analysis
• We formalise the task as a binary classification problem; approach in three steps:
1. Statistical relatedness analysis
2. Themed-entity detection
3. Hybridisation phase
7. Background Knowledge // Listening Experiences
• LE Database includes text excerpts that can be analysed as positive examples.
• Project Gutenberg >58k books in the public domain (48790 en)
• Reuters-21578 (Reu) 21.578 news articles of various categories. It does not
include music.
• The UK Reading Experience Database (UK RED) investigates the evidence of
reading in Britain
• DBpedia is a large knowledge graph published as Linked Data. Includes SPARQL
endpoint and a NER tool: DBpedia Spotlight
8. 1> Statistical Relatedness Analysis
• Compute embeddings (Word2Vec) on Project Gutenberg (1.5B words!),
we develop a domain dictionary of 10k terms related to a core term:
music[n] (1.0) <— core term
melody[n] (7.8010)
guitar[n] (6.8451)
inspiriting[j] (6.3402)
heartful[j] (4.2634)
psalm[n] (4.0559)
…
10. 1> Distribution Analysis // Learning the threshold
We analyse the distribution of the dictionary with 1+ (LED) and 2- corpora
(RED and Reu), and calculate both average score x and standard deviation σx
on the positive corpus.
These values partition the corpus in quartiles:
(1) r < (x−σx); (3) (x+σx) < r > (x);
(2) x < r > (x−σx); (4) r > (x + σx).
3 threshold values:
th1 > (x − σx),
th2 > x, and th3 > x + σx.
Scores
Items
11. 1> Statistical relatedness // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
• Anacreontic[n]: 4.13048797627
• amateur[n]: 4.60138704262
• admirably[r]: 3.65226351076
• orchestral[j]: 7.09262661606
• trio[n]: 5.60459207257
• piano[n]: 6.36957273307
12. 1> Statistical relatedness // Example
MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
psalm[n]: 4.05596201177
13. 1> Statistical relatedness // Example
MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
harmoniously[r]:4.96754289705
music[n]:1.0
poetry[n]:5.93071678171
painting[n]:4.39244380382
triumphant[j]:3.80869437369
amidst[i]:3.6638322575
14. Problems
• Named entities may not be sufficiently represented in the dictionary
(e.g."Prélude à l'après-midi d'un faune").
• Many entities may not appear in the trained embeddings.
• Terms may have a low score because not statistically relevant.
• However, the presence of named entities is a clue of possible evidence.
• Distributional approaches alone inherit ambiguity of the core term, for
example, figurative use (sounds good?)
15. 2> Themed entity detection
• DBPedia Spotlight to identify %entities%
• SPARQL query to filter the ones related to
dbcat:Music
• Where %entities% are the resources identified by
the NER engine, and %d% is a parameter, set to 5
(>5 too much noise).
SELECT distinct ?sub WHERE {
VALUES ?sub { %entities% }
?sub dc:subject ?subject .
?subject skos:broader{0:%d%} cat:Music
}
16. 2> Themed-entity detection // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
http://dbpedia.org/resource/Anacreontic_Society
http://dbpedia.org/resource/Orchestra
http://dbpedia.org/resource/Trio_(music)
http://dbpedia.org/resource/Così_fan_tutte
http://dbpedia.org/resource/Piano
17. 2> Themed-entity detection // Example
MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
http://dbpedia.org/resource/Evening_Prayer_(Anglican)
http://dbpedia.org/resource/Psalms
18. 2> Themed-entity detection // Example
MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
http://dbpedia.org/resource/Music
19. 3> Hybridisation
Entity boost. To
promote terms mapped
to entities
PoS Filter: demote
terms other then verbs
and nouns, to privilege
factual statements
20. 3> Hybridisation // Examples
• RECMUS-619: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte',
they drew me to the piano
• MASONB-31: In the evening we went to Rev. Baptist Noel's chapel, where
one is always sure of edification from the sermon if not from the psalms.
• MASONB-88: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly
those of music, poetry and painting, were especially honored, and ︎oated
triumphant amidst the standards of electorates, dukedoms, and kingdoms.
22. Evaluation // Gold Standard
• 500 positive samples sourced from 17
books in the LED collection
• 500 negative samples sourced from the
same books
• Negative samples picked with similar length
of each positive (avg length ~125 words)
• Accurate: Fleiss’ kappa reports substantial
agreement among annotators
• Pessimistic: negative samples more similar
to positives then to RED and Reu
• Also a gold standard produced from RED, to
evaluate portability
Both GS published openly for reuse
http://led.kmi.open.ac.uk/discovery/findler
23. Evaluation // Methods
• Fo: Random Forest Classifier (ML) // trained on LED, RED, and Reu // Test ~80% Acc
• St: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF
(more details in the paper)
• Em: Statistical relatedness component only (Embeddings)
• En: Themed entity detection component (Entity)
• Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered)
• Hy-F: No filter, only entity boost (Hybrid - Unfiltered)
• Hy: Our Hybrid approach
• Hy/R: Our Hybrid approach on the Reading Experience Database (to test
portability). Core concept: book[n] and core entity: dbc:Literature
http://led.kmi.open.ac.uk/discovery/findler
24. Evaluation // Discussion
Fo: high precision, low recall, accuracy slightly
above random (robust GS)
En alone has a performance slightly above
random: gold standard is pessimistic
Without applying noise correction (POS filter),
precision is generally lower
Hy-F shows the impact of entity detection on
recall
Hy: best of both worlds. Substantial agreement
with annotators (Cohen’s K)
Hy/R: our approach is applicable to other
domains with small configuration
See the paper for more observations and for an analysis
of errors
The results are very good: 87% F-Measure & Accuracy
http://led.kmi.open.ac.uk/discovery/findler
25. Future work
• Applying the method to the scan of books (FindLEr demo) involves other issues before
classification, incl. segmentation, and clutter (indexes, references, …)
• In absence of a knowledge base of annotated documents, how to learn the parameters -
threshold & default score?
• Experiment with other embeddings techniques (ElMo, BERT), extract multi-words
expressions, and try other entity linkers (Wikifier/Wikidata)
• We performed a concept search, what about a multi-concept search (music & children,
music & war)
• Searching repositories instead of books (some workflow issues here…)
• KE to support the curation of the documentary evidence. See the Sciknow position
paper “Challenging knowledge extraction to support the curation of documentary evidence in
the humanities”
http://led.kmi.open.ac.uk/discovery/findler