Enhanching access to information within digital heritage archives, e.g. New York Times Corpus, by identifying discourse phenomena and searchng and filtering events according to multiple facets.
5. Searching
Give a query as input
Obtain a set of relevant articles
Keyword v. Semantics
– Synonyms
– Hyponyms
– Spelling variants
– Inflections
– Relations between query terms
7. Searching
Keywords
Crimes in the town of Sandwich
– Crime Sandwich by Click Bang
Productions on SoundCloud
– Sandwich Crime - Topix
– Crime on rye: Four accused of
stealing $10 sandwich from car
– Crime Scene Sandwich Bags
– Crime rate in Sandwich, Illinois (IL):
murders, rapes, robberies
– Ham Sandwich Nation: Due Process
When Everything is a Crime
9. Searching
Semantics
Crimes in the town of Sandwich
– Kent Police issue warning after fake
£20 notes reported in Sandwich
– Trio jailed for total of 30 years after
crime spree in Sandwich
– Murder at Sandwich - Kent
10. Semantic search engine
Features
Specification of semantic types of
search terms: town:Sandwich
Normalisation of semantic entities:
Sandwich, Kent = Sandwich, UK
Relations between search terms to
describe events: location:Sandwich
Restrictions on discourse context of
retrieved events
12. Discourse interpretation
The story
Karl Munro may have killed Sunita in Weatherfield in 2013.
According to Karl Munro, Craig Tinker set Sunita on fire in Weatherfield in 2013.
Karl Munro said he will kill Sunita.
Karl Munro didn’t fail to kill Sunita in Weatherfield in 2013.
Stella Price condemned all of Karl’s wrongdoings.
13. ACE corpus
2005 version
Discourse -related Attributes
599 news-domain documents
Polarity
– News articles
Tense
– Transcripts of broadcast news
Specificity
– Transcripts of broadcast conversation
Modality
– Conversational telephone speech
– Weblogs
– Discussion fora
Source type
Subjectivity
15. New York Times corpus
Digital archive
20 years-worth of news articles – 1.8M
Includes annotations of
– Metadata
– Named entities
– Normalisation
Facilitates diachronic studies
– Language evolution
– Social change
– Development of events
25. Final remarks
Other domains
Same technique can be adapted to other domains
Previously developed
–EUPMC – medical journal articles
–ASCOT – clinical trials
26. Final remarks
Summary
Future work
Enhanced access to information within
digital heritage archives (NYT)
Apply to new domains and institutional
repositories
Identified discourse phenomena to
search for and filter events
Customise towards social unrest
Created ISHER, semantic search
engine to access the NYT corpus
Other languages in danger of digital
extinction – Meta-Net
Diachronic studies