Capturing Themed Evidence, a Hybrid Approach

Capturing Themed Evidence, a Hybrid Approach
K-Cap 2019
19-21 November 2019
Marina del Rey, California, United States
Enrico Daga and Enrico Motta
The Open University
enrico.daga@open.ac.uk

Motivation
The task of identifying pieces of evidence in texts is of fundamental
importance in supporting qualitative studies in various domains,
especially in the humanities (e.g. historiographic methodology)
Problems: the activity (a) requires effort / time, (b) is not systematic, (c) is
prone to errors, and (d) the methodology is (often) not documented

Case study: the Listening Experience Database
• An open and freely searchable database that brings together a mass of
data about people’s experiences of listening to music of all kinds, in any
historical period and any culture.
• Sophisticated data model, natively in RDF / SPARQL
• Linked Open Data: http://data.open.ac.uk/context/led
• Since 2012, the LED project has collected over 10,000 unique listening
experiences from a variety of textual sources
https://led.kmi.open.ac.uk/

How to support users on capturing themed evidence?
• We coin the expression themed evidence, to refer to (direct or
indirect) traces of a fact or situation relevant to a theme of interest
and study the problem of identifying them in texts.
• The task of identifying themed evidence is at the intersection between
topical text classification (finding texts relevant to a certain theme) and
event retrieval (find events mentioned in texts).
• Not all topical texts are themed evidence and the nature of the event
itself is often assumed, implicit, and left to the reader

Finding Listening Experiences (theme: music)
• RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual supper
followed. After propitiating me with a trio from ’Cosi Fan Tutte’, they drew me to
the piano.
• MASONB-31, positive: In the evening we went to Rev. Baptist Noel’s chapel,
where one is always sure of edification from the sermon if not from the psalms.
• MASONB-88, negative: Flags and pendants were suspended from the windows,
[. . . ] the colors of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly those of
music, poetry and painting, were especially honored, and floated triumphant
amidst the standards of electorates, dukedoms, and kingdoms.

A Hybrid Approach
• Themed evidence are a subset of topical texts (e.g. about “music”) - distributional
semantics
• Common knowledge graphs include a large amounts of interlinked entities,
including topical entities (in the category “music”) - entity linking to structured
knowledge
• Background knowledge can be used for learning features and tuning elements
of the method - corpus based analysis
• We formalise the task as a binary classification problem; approach in three steps:
1. Statistical relatedness analysis
2. Themed-entity detection
3. Hybridisation phase

Background Knowledge // Listening Experiences
• LE Database includes text excerpts that can be analysed as positive examples.
• Project Gutenberg >58k books in the public domain (48790 en)
• Reuters-21578 (Reu) 21.578 news articles of various categories. It does not
include music.
• The UK Reading Experience Database (UK RED) investigates the evidence of
reading in Britain
• DBpedia is a large knowledge graph published as Linked Data. Includes SPARQL
endpoint and a NER tool: DBpedia Spotlight

1> Statistical Relatedness Analysis
• Compute embeddings (Word2Vec) on Project Gutenberg (1.5B words!),
we develop a domain dictionary of 10k terms related to a core term:
music[n] (1.0) <— core term
melody[n] (7.8010)
guitar[n] (6.8451)
inspiriting[j] (6.3402)
heartful[j] (4.2634)
psalm[n] (4.0559)
…

1> Statistical Relatedness Analysis
0 rontgen[N]
1 play[V]
2 Brahms[N]
3 symphony[N]
4 another[D]
5 musical[J]
6 take[V]
7 always[R]
8 happen[V]
9 specially[R]
10 count[V]
11 something[N]
12 sort[N]

1> Distribution Analysis // Learning the threshold
We analyse the distribution of the dictionary with 1+ (LED) and 2- corpora
(RED and Reu), and calculate both average score x and standard deviation σx
on the positive corpus.
These values partition the corpus in quartiles:
(1) r < (x−σx); (3) (x+σx) < r > (x);
(2) x < r > (x−σx); (4) r > (x + σx).
3 threshold values:
th1 > (x − σx),
th2 > x, and th3 > x + σx.
Scores
Items

1> Statistical relatedness // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
amateurs who perform admirably the best orchestral works. The usual
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
• Anacreontic[n]: 4.13048797627
• amateur[n]: 4.60138704262
• admirably[r]: 3.65226351076
• orchestral[j]: 7.09262661606
• trio[n]: 5.60459207257
• piano[n]: 6.36957273307

MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
psalm[n]: 4.05596201177

MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
harmoniously[r]:4.96754289705
music[n]:1.0
poetry[n]:5.93071678171
painting[n]:4.39244380382
triumphant[j]:3.80869437369
amidst[i]:3.6638322575

Problems
• Named entities may not be sufficiently represented in the dictionary
(e.g."Prélude à l'après-midi d'un faune").
• Many entities may not appear in the trained embeddings.
• Terms may have a low score because not statistically relevant.
• However, the presence of named entities is a clue of possible evidence.
• Distributional approaches alone inherit ambiguity of the core term, for
example, figurative use (sounds good?)

2> Themed entity detection
• DBPedia Spotlight to identify %entities%
• SPARQL query to filter the ones related to
dbcat:Music
• Where %entities% are the resources identified by
the NER engine, and %d% is a parameter, set to 5
(>5 too much noise).
SELECT distinct ?sub WHERE {
VALUES ?sub { %entities% }
?sub dc:subject ?subject .
?subject skos:broader{0:%d%} cat:Music
}

2> Themed-entity detection // Example
RECMUS-619, positive: Introduced to the Anacreontic Society, consisting of
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte', they
drew me to the piano.
http://dbpedia.org/resource/Anacreontic_Society
http://dbpedia.org/resource/Orchestra
http://dbpedia.org/resource/Trio_(music)
http://dbpedia.org/resource/Così_fan_tutte
http://dbpedia.org/resource/Piano

MASONB-31, positive: In the evening we went to Rev. Baptist Noel's chapel,
where one is always sure of edification from the sermon if not from the
psalms.
http://dbpedia.org/resource/Evening_Prayer_(Anglican)
http://dbpedia.org/resource/Psalms

MASONB-88, negative: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and the
banners of the Fine Arts, with appropriate inscriptions, particularly those of music,
poetry and painting, were especially honored, and ︎oated triumphant amidst the
standards of electorates, dukedoms, and kingdoms.
http://dbpedia.org/resource/Music

3> Hybridisation
Entity boost. To
promote terms mapped
to entities
PoS Filter: demote
terms other then verbs
and nouns, to privilege
factual statements

3> Hybridisation // Examples
• RECMUS-619: Introduced to the Anacreontic Society, consisting of
supper followed. After propitiating me with a trio from 'Cosi Fan Tutte',
they drew me to the piano
• MASONB-31: In the evening we went to Rev. Baptist Noel's chapel, where
one is always sure of edification from the sermon if not from the psalms.
• MASONB-88: Flags and pendants were suspended from the windows, [...]
the colours of the German States were waving harmoniously together, and
the banners of the Fine Arts, with appropriate inscriptions, particularly
those of music, poetry and painting, were especially honored, and ︎oated
triumphant amidst the standards of electorates, dukedoms, and kingdoms.

http://led.kmi.open.ac.uk/discovery/findler

Evaluation // Gold Standard
• 500 positive samples sourced from 17
books in the LED collection
• 500 negative samples sourced from the
same books
• Negative samples picked with similar length
of each positive (avg length ~125 words)
• Accurate: Fleiss’ kappa reports substantial
agreement among annotators
• Pessimistic: negative samples more similar
to positives then to RED and Reu
• Also a gold standard produced from RED, to
evaluate portability
Both GS published openly for reuse

Evaluation // Methods
• Fo: Random Forest Classifier (ML) // trained on LED, RED, and Reu // Test ~80% Acc
• St: Statistical // a dictionary from Gutenberg’s Music shelf // AVG TF/IDF
(more details in the paper)
• Em: Statistical relatedness component only (Embeddings)
• En: Themed entity detection component (Entity)
• Em+F: Statistical relatedness + PoS Filter (Embeddings - Filtered)
• Hy-F: No filter, only entity boost (Hybrid - Unfiltered)
• Hy: Our Hybrid approach
• Hy/R: Our Hybrid approach on the Reading Experience Database (to test
portability). Core concept: book[n] and core entity: dbc:Literature

Evaluation // Discussion
Fo: high precision, low recall, accuracy slightly
above random (robust GS)
En alone has a performance slightly above
random: gold standard is pessimistic
Without applying noise correction (POS filter),
precision is generally lower
Hy-F shows the impact of entity detection on
recall
Hy: best of both worlds. Substantial agreement
with annotators (Cohen’s K)
Hy/R: our approach is applicable to other
domains with small configuration
See the paper for more observations and for an analysis
of errors
The results are very good: 87% F-Measure & Accuracy

Future work
• Applying the method to the scan of books (FindLEr demo) involves other issues before
classification, incl. segmentation, and clutter (indexes, references, …)
• In absence of a knowledge base of annotated documents, how to learn the parameters -
threshold & default score?
• Experiment with other embeddings techniques (ElMo, BERT), extract multi-words
expressions, and try other entity linkers (Wikifier/Wikidata)
• We performed a concept search, what about a multi-concept search (music & children,
music & war)
• Searching repositories instead of books (some workflow issues here…)
• KE to support the curation of the documentary evidence. See the Sciknow position
paper “Challenging knowledge extraction to support the curation of documentary evidence in
the humanities”

Questions?
Feedback: @enridaga | www.enridaga.net

Capturing Themed Evidence, a Hybrid Approach

Recommended

Recommended

More Related Content

Similar to Capturing Themed Evidence, a Hybrid Approach

Similar to Capturing Themed Evidence, a Hybrid Approach (20)

More from Enrico Daga

More from Enrico Daga (17)

Recently uploaded

Recently uploaded (20)

Capturing Themed Evidence, a Hybrid Approach