Keynote at HELDIG Summit
7 November 2019, University of Helsinki, Finland https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/heldig-digital-humanities-summit-2019
Abstract:
The use of computational methods in humanities research is gaining popularity and leading to new insights. But as we move from distant reading methods to deeper language understanding, we find that many state-of-the-art language technology tools don't behave quite as advertised in publications. The corpora humanities scholars investigate display a wide range of language phenomena, plus humanities scholars do not necessarily have the same goals when they apply these language technology tools as the computational linguists who developed them. The variety in time span, genre, digitisation quality and corpus heterogeneity show the gap between the two research domains.
In this talk, I will discuss several projects in which we needed to address the mismatch between language technology tools and the humanities research objectives, and how we can go forward in fitting our computational methods to the diversity of humanities research questions.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology
1. Square pegs and round holes:
addressing the mismatch between humanities questions
and the state-of-the-art in language technology
Marieke.van.Erp@dh.huc.knaw.nl
merpeltje
D I G I TA L H U M A N I T I E S L A B
2. D I G I TA L H U M A N I T I E S L A B
Three use cases:
• Messy data: EviDENce project
• OCR troubles: Historical Recipe Web
• Genre mismatch: Why Language
Technology Can’t Handle Game of Thrones
(yet)
3. EviDENce - Ego Documents Events ModelliNg
How individuals recall war and violence
Hucopix - Elodie Burillon
4. Ego Documents Events modelliNg - how individuals recall war and violence
Sources: - Oral history interview transcripts WW2 (450)
Aims: - Better understand nature of and change in eyewitness reports
- Further develop event detection as means for extracting relevant
information from large and complex textual datasets
10. Manual NLP Pipeline
Ik 215 Ik 61
we 68 we 10
vader 30 “mijn vader” 1
moeder 11 “mijn moeder” 2
broer 9 “broer en zus” 1
“een broer” 1
vrienden 8 not found 1
Amerikanen 7 Location: “Amerikanen” 6
ouders 6 not found 4
die Duitsers 5 Location: “Duitsers” 9
Actors: Matching of manual and automatic annotation
11. D I G I TA L H U M A N I T I E S L A B
Taking a step back: what does
the research question really need?
• EviDENce historians are interested in
relevant passages
• NLP pipeline analyses texts down to word
level
• Should we be using an NLP pipeline at all?
Image source: https://cdn.xingosoftware.com/dedikkeblauwe/images/fetch/dpr_2/
https%3A%2F%2Fwww.dedikkeblauwe.nl%2Fassets%2Fupload%2Fimages%2F49%2F20190131165659_Kanon-op-mug.png
12. D I G I TA L H U M A N I T I E S L A B
Back to the drawing board!
• Current pipeline is error prone
• Humanities scholars are not trained to think
in NLP modules and linguistic layers
• Can we gather text passages describing
violence without deep text analysis?
• Three approaches:
• keyword expansion
• doc2vec
• ElasticSearch
13.
14.
15.
16. D I G I TA L H U M A N I T I E S L A B
Take home message
• Choose the right tool!
• It takes time to understand each other
• Next week we’ll know what other historians
think of our approach :)
17. Constructing a Recipe Web from
Historical Newspapers
Marieke van Erp @merpeltje
Melvin Wevers @melvinwevers
Hugo Huurdeman @timelessfuture
Image source: https://static.ah.nl/static/recepten/img_006188_890x594_JPG.jpg
18. Butter, salt & pepper
• Analysis of food customs:
• historians
• dieticians
• ethnologists
• 1945 - 1995 Parool, Volkskrant, NRC & Trouw
• Dataset and code available through: https://
github.com/DHLab-nl/historical-recipe-web
• Winner National Library - Rijksmuseum -
Network Digital Heritage HackaLOD Hackathon
• You & other researchers are invited to work
with us on case studies around food culture
D I G I TA L H U M A N I T I E S L A B Image source: https://assets3.thrillist.com/v1/image/1623749/size/tl-horizontal_main_2x.jpg
19. Newspapers as a source for
recipes
• perception of a Dutch food culture formed
in the 1950s
• newspapers are producer and messengers
of public discourse
• newspapers contain views on daily life and
customs
• But:
• keyword search for ‘recepten’
imprecise
• noise from digitisation process
Image source: delpher.nl
D I G I TA L H U M A N I T I E S L A B
20. Newspaper dataset
• Dutch National Library has digitised 90+
million book, newspaper and magazine pages
• Newspapers published between 1618 - 1995
from the Netherlands, the Dutch Indies
(present day Indonesia), the Antilles, the US
and Surinam (15% of all newspapers
published in the Netherlands)
• Available via website, data dump (until 1876)
and API (with agreement)
D I G I TA L H U M A N I T I E S L A B
Pages Articles Tokens
Parool 14,194 2,380,697 612,036,106
Volkskrant 13,628 2,248,652 744,275,792
NRC 7,199 947,198 489,397,816
Trouw 13,891 2,578,731 656,941,631
Total: 48,912 8,155,278 2,502,651,345
article: https://www.delpher.nl/nl/kranten/view?coll=ddd&identifier=ddd:010627319:mpeg21:a0067
21. Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
D I G I TA L H U M A N I T I E S L A B
dinsdag
6 ossestaartsoep
HUt *orstjes
l 0( * bonen met ananas
t t e bonen met ananas
Va0,1 2 blikken witte bonen In 1 uitje,
1 eetlepel ?lWd- 2 eetlepels keuken-
12 knakwostjes, 1 klein
„ ftaLananasDlokJes- SoJrJi het uitJe
en meng dit ,Qoe h bonen met
tomatensaus. Nir,;e groente in een
ingevette ?fd h ste schaal. Roer de
mos- Je hni?or de stroop en giet hier
23. From newspapers to a recipe web
D I G I TA L H U M A N I T I E S L A B
Ingredients
Recipe tags
Recipe
descriptions
Recipe articles
Information Extraction and
Multilabel Classification
Enrichment
Ingredient and
quantity extraction
Recipe tags
Structured newspaper
recipes
Origin
DBpedia link
Scientific name
Recipe text detection
Structured and enriched
newspaper recipes
Seed list
Text
classification
24. What & how much?
• articles cannot automatically be segmented
• OCR errors and non-grammatical
sentences are a hurdle for standard NLP
pipelines
• lexicon-based extraction of ingredients and
quantities
Image source: https://cdn.pixabay.com/photo/2014/11/15/20/30/kitchen-scale-532651_960_720.jpg
D I G I TA L H U M A N I T I E S L A B
25. Evaluation
• 100 articles were manually annotated using
Recogito
• OCR errors in ingredients or quantities marked
separately
• IAA .85 but OCR boundaries difficult:
jºar,anen’ vs ◦ºar,anen’
• Most precise lexicon: f1 = .67
• More research is needed for out-of-lexicon
ingredients
D I G I TA L H U M A N I T I E S L A B
27. 27,411 new (old) recipes
• 34,479 Tags
• 365,133 ingredients
• >17,000 Links to external sources
• Data and software available at: https://
github.com/DHLab-nl/historical-recipe-web
Bron: https://static.ah.nl/static/recepten/img_074629_890x594_JPG.jpgD I G I TA L H U M A N I T I E S L A B
28. Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
D I G I TA L H U M A N I T I E S L A B
Take home message
• OCR errors can impact information extraction
• OCR post-correction is an active research
field, but errors will remain
• Focus on most important elements to extract
source: https://resolver.kb.nl/resolve?urn=ABCDDD:010877049:mpeg21:a0158
30. Why Language Technology Can’t
Handle Game of Thrones (yet)
Niels Dekker, Tobias Kuhn & Marieke van Erp
Image source: https://anibundel.files.wordpress.com/2015/04/jonsnow-leaves-ygritte.jpg
31. Background
• Characters and relations are backbone of
stories
• Computational methods allow for scaling
up network extraction and analysis
• Relies on named entity recognition
• Most work thusfar focuses on 19th and
early 20th century novels
• Research question: how do these tools
perform on modern science fiction/fantasy
novels?
D I G I TA L H U M A N I T I E S L A B
32. Experimental setup
• Collect 20 ‘old’ and 20 ‘new’ novels
• Annotate first chapters for entities and
relationships between entities (gold
standard)
• Evaluate entity recognition tools on the sets
of ‘old’ and ‘new’ novels
• Compare system outputs to gold standard
annotations
• Bonus: compare network structures
Image source: delpher.nl
D I G I TA L H U M A N I T I E S L A B
Image source: https://cdn-images-1.medium.com/max/2400/1*QbCo9uE7jPbt1ttnMsqOog.jpeg
33. 19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels +
availability through Project Gutenberg + used in earlier studies
35. D I G I TA L H U M A N I T I E S L A B
Data preprocessing
• All books converted to plain text format
• Ensure all texts have the same character
encoding
• Pro tip: check whether there are no
odd or inconsistent quotation marks in
your documents
• Appendices, glossaries and reviews were
removed manually
D I G I TA L H U M A N I T I E S L A B
Image source: https://www.dataentryoutsourced.com/blog/wp-content/uploads/2015/03/
Post-091-640x200.jpg
36. Gold standard annotations
• Chapter lengths varied from 84 to 1,442
sentences
• An average of 300 sentences close to a
chapter boundary was selected
• e.g. the third chapter in Alice in
Wonderland ended after sentence
315, so for that book the first three
chapters were annotated
• 2 annotators (not the authors of the study)
D I G I TA L H U M A N I T I E S L A B
Image source: https://panmacmillan.azureedge.net/pmk11/panmacmillan/files/media/
panmacmillan/blogs/tws/august%202017/alice-in-wonderland-knowledge-quiz-header.png
37. D I G I TA L H U M A N I T I E S L A B
Annotation Instructions
• For each sentence:
• Identify all characters in it
• Identify anaphoric references (e.g. she
refers to Alice)
• To speed up the process, annotators were
provided with a list of characters derived
automatically
• Missing characters could be added to the
list
• Ignore generic pronouns, exclamations,
generic noun phrases, non-human named
characters (Buckbeak)
D I G I TA L H U M A N I T I E S L A B
Image source: https://vignette.wikia.nocookie.net/p__/images/3/35/Erich_Mueller_and_Shannon_McGrath_are_glued_together_back_to_back_with_Tree_Resin.jpeg/revision/
latest?cb=20170331180847&path-prefix=protagonist
38. Named Entity Recognisers:
BookNLP
• NLP pipeline modified to deal with books
• POS tagging, dependency parsing, NER,
character name clustering, quotation
speaker identification, pronominal
coreference resolution, supersense tagging
• NER module based on Stanford NER, with
some modifications
• We focus on NER, character name
clustering and pronominal character
resolution modules in our evaluation
• https://github.com/dbamman/book-nlp
D I G I TA L H U M A N I T I E S L A B
Image source: https://cdn.aarp.net/content/dam/aarp/money/budgeting_savings/2016/04/1140-
yeager-sell-your-used-books.imgcache.rev6feda141288df73e8fd100822bb375ea.jpg
39.
40. Intermediate conclusion
• No difference between ‘old’ and ‘new’
books
• Within categories, great variety in entity
distributions and results
• If a central entity is missed, the
performance suffers greatly (e.g.
Brave New World)
• Coreference resolution particularly difficult
in this domain
D I G I TA L H U M A N I T I E S L A B
Image source: https://www.nuffoodsspectrum.in/uploads/articles/quarterly_results_bg-4192.jpg
41. J eJ e
Ha SeHa Se
B ac eB ac e
L d R bbL d R bb
CC
P e Se MaP e Se Ma
H eH e
T e P ceT e P ce
T a Me SeT a Me Se
H e Se Ge dH e Se Ge d
L d Va ceL d Va ceDa eDa e
A a H e aceA a H e ace
L d H dL d H d
R be Ba a eR be Ba a eC e P eC e P e
Ca L d B ceCa L d B ce
E aE a
S a Sa aS a Sa a
M Ma eM Ma e
AA
R d Ca e SeR d Ca e Se TT
L a aL a a
Se D eSe D e
N e aN e a
S e eS e e
Ta SaTa Sa
JJ
A a AA a A
J cJ c
Y eY e
F e LadF e Lad
Ra de Ma ceRa de Ma ce
PP
Ma de Se WMa de Se W
C e aC e a
JJ
C eC e
D SeD Se
B e Y R ceB e Y R ce
C eC e
V e aV e a
Ca e JCa e J
G eG e
L d SL d S
Ha M eHa M e
Ned S aNed S a
S a B a dS a B a d
M eM e
G e BaG e Ba
M ecM ec
T a dT a d
Da eDa e
M a S eM a S e
Hea ba eHea ba e
Ja e Se R eJa e Se R e
E e Se Va dE e Se Va d
G dG d
Ca e B acCa e B ac
L d D da Be cL d D da Be c
B de B acB de B ac
Mae e LMae e L
Mae e AeMae e Ae
C a eC a e
M dM d
MaMa
C e a e Sa dC e a e Sa d
S aeS ae
Ha e aHa e a
L d Ne R ceL d Ne R ce
PePe
T adT ad
P eP e
L d d TL d d T
MaMa
Va H aVa H a
R cR c
E eE e
L d AL d A
Q aQ a
L d P eL d P e
L a Lad AL a Lad A
B aaB aa
Ma aMa a
B ac e J L dB ac e J L d
L d S e a dL d S e a d
Ma de Se We deMa de Se We de
T e aT e a
T eT e
Sa a a Se ASa a a Se A
Ba a Se SeBa a Se Se
Pa e Se IPa e Se I
B MB M
Pe SePe Se
L d Ma e JaL d Ma e Ja
Sa e TaSa e Ta
P e VaP e Va
J eJ e
BeBe
Ga edGa ed
M eM e
W e O e SeW e O e Se
F e SF e S
DaDa
K e eK e e
G eaG ea
La e TLa e T
Se M e Ma dSe M e Ma d
L d W aL d W a
Ha dHa d
D eD e
L d JL d J
S a Ba a e L dS a Ba a e L d
Je eJe e
UU
Fa TFa T
Ja e Se La eJa e Se La e
O K aO K a
M a CaM a Ca
Ca e MaCa e Ma
A e Se T eA e Se T e
Fa eFa e
L d R beL d R be
LL
L d R aL d R a
Je e P eJe e P e
TT
CC
MaeMae
HaHa
Va eVa e
Ed e Se TEd e Se T
H a deH a de
Ga eGa e
H HH H
C aC a
Hedd e Ma aHedd e Ma a
Mae e MMae e M
Lad Ca e S aLad Ca e S a
CaCa
Be S aBe S a
MaMa
Lad MLad M
KK
R be AR be A
Ge dGe d
X Ja ab aX Ja ab a
K a eeK a ee
L d Ba a e ReL d Ba a e Re
AA
L d Bae PeL d Bae Pe
Lad Sa aLad Sa a
M Ma DM Ma D
L d F e Wa deL d F e Wa de
Fa eFa e
Se Adda Ma b a dSe Adda Ma b a d
H SeH Se
O d NaO d Na
L aL a
JacJac
R ae a Ta a eR ae a Ta a e
J e P ceJ e P ce
B Se BB Se B
Va ce KaVa ce Ka
JJ
A Da e SeA Da e Se
M da e Se aM da e Se a
Se Ta a He aSe Ta a He a
L d T B ac dL d T B ac d
T L d La eT L d La e
Y TY T
Je BeJe Be
Ha deHa de
S a aS a a
A JA J
DD
BaeBae
GG
T e Se L aT e Se L a
La e Se Ke aLa e Se Ke a
S e F e SeS e F e Se
Ta da LadTa da Lad
Ra Da SeRa Da Se
S a dS a d
L d T H eL d T H e
A SeA Se
F e Ja eF e Ja e
W Se W deW Se W de
DaDa
He a dHe a d
W e DaW e Da
FF
Ma eMa e
WW
R a a K aR a a K a
M caM ca
JaJa
F e e B a SeF e e B a Se
U aU a
R ba SeR ba Se
NaeNae
C eC e
T b MT b M
Be e S aBe e S a
MM
L e eL e e
L d T eL d T e
B de Se TB de Se T
HaHa
M ce aM ce a
SS
O e Ya cO e Ya c
G e T eG e T e
II
Mae e P ce eMae e P ce e
G e W dG e W d
Q Ha a dQ Ha a d
Jae aeJae ae
L d CeL d Ce
C daC da
Ra aRa a
D eD e
Ma e IMa e I
T eT e
Ae Ta a eAe Ta a e
B e MaB e Ma
Da H dDa H d
R eR e
C e a e G e SeC e a e G e Se
S JS J
RaRa
Ae Ta a eAe Ta a e
D K aD K a
V e Ta a eV e Ta a e
QQ
W e LadW e Lad
H bb T ee-F eH bb T ee-F e
D aD a
R ce Se A daR ce Se A da
Ka SeKa Se
Ha eHa e
La ceLa ce
H eeH ee
Mace T eMace T e
L d H eL d H e
Ha M eHa M e
D a VaeD a Vae
Dae e Ta a eDae e Ta a e
L d Le dL d Le d
V aV a
G e Ga baG e Ga ba
R aeR ae
B R eB R e
Ca e TCa e T
La e Ce eLa e Ce e
JJ
Wa a Se R ceWa a Se R ce
L B eL B e
L d Ta Ra dL d Ta Ra d
De L dDe L d
Ja ed F e SeJa ed F e Se
TT
Se S a BaSe S a Ba
L d VaL d Va
B aB a
Ha Ka aHa Ka a
J aJ a
D eaD ea
Ha deHa de
bb
Ja SJa S
B e MB e M
A a S aA a S a
Dae e Ta a eDae e Ta a e
C b a L SeC b a L Se
H dH d
R be G eR be G e
HaHa
L d Ka a R c a dL d Ka a R c a d
BB
H bbe SeH bbe Se
K a JK a J
H a SeH a Se
L d ML d M
De dDe d
S aS a
R bb S aR bb S a
L d Ha d dL d Ha d d
A beA be
N e D aN e D a
J a Se MJ a Se M
42. CC
E aE a
AA
JJ
C e aC e a
JJ
S aeS ae
PePe
MaMa
Va H aVa H a
E eE e
Q aQ a
dd
T eT e
DaDa
a e Ta e T
D eD e
UU
O K aO K a
LL
CC
HaHa
H a deH a de
KK
K a eeK a ee
M Ma DM Ma D
R ae a Ta a eR ae a Ta a e
Va ce KaVa ce Ka
Y TY T
S a aS a a
DD
GG
La e Se Ke aLa e Se Ke a
Ra Da SeRa Da Se
FF
R a a K aR a a K a
F e e B a SeF e e B a Se
U aU a
NaeNae
C eC e
L d T eL d T e
II
Ra aRa a
Ma e IMa e I
Ae Ta a eAe Ta a e
D K aD K a
V e Ta a eV e Ta a e
QQ
D aD a
Ka SeKa Se
D a VaeD a Vae
Dae e Ta a eDae e Ta a e
L d Le dL d Le d
R aeR ae
La e Ce eLa e Ce e
JJ
De L dDe L d
TT
J aJ a
D eaD ea
MM
Dae e Ta a eDae e Ta a e
K a JK a J
L d ML d M
R bb S aR bb S a
J a Se MJ a Se M
44. From NLP output to KGs
• Names aren’t just about labels
• Context has meaning too
• Collapse Dany/Daenerys?
• depends on your research question
• NLP often stops after recognising names
and coreference links
D I G I TA L H U M A N I T I E S L A B
Image source: http://imagens.tiespecialistas.com.br/2011/10/Figura02.png
48. D I G I TA L H U M A N I T I E S L A B
Why is fiction hard for NLP?
• Fiction writers don’t have to abide by
conventions: they can use language more
creatively than newspaper journalists
• mix languages
• make up languages
• use nicknames
• Narratives written from first-person
perspective confuse the software
D I G I TA L H U M A N I T I E S L A B
Image source: https://steamuserimages-a.akamaihd.net/ugc/859477733475369907/F34770D6EFEC30A70A84BEFE93C2C522C0B4A902/
49. Performance fixes
• Replace word names with generic names
• Remove apostrophes from names
• But:
• Requires manual intervention
• Doesn’t scale
D I G I TA L H U M A N I T I E S L A B
51. D I G I TA L H U M A N I T I E S L A B
Where to go from here?
• Robuster NLP tools are necessary to better
understand novels (and other non-newspaper
texts)
• Background knowledge can help (e.g. GoT
Wiki lists all Danaerys’ nicknames)
• But: not all books are that popular
• Also: different names are used in different
contexts, you may not want to collapse them!
• Always: don’t just assume it works, look into
your data!
• Full paper at: http://peerj.com/articles/cs-189
D I G I TA L H U M A N I T I E S L A B Image source: https://news.images.itv.com/image/file/1232718/stream_img.jpg
52. D I G I TA L H U M A N I T I E S L A B
Conclusions
• Huge gap between NLP research and use
cases
• Understanding of each other’s tools
and questions
• What NLP tools can handle
• First: What does the research question
really need?
• Then: What is the mismatch between my
data and what the tools can handle?
• Next: Let’s get to work, there’s lots to do!
53. D I G I TA L H U M A N I T I E S L A B
COST Action 18209
• Web-centred Linguistic Data Science
• Various use cases (also digital humanities!)
• Management committee members
representing Finland: Jouni Tuominen &
Eero Hyvönen and Mietta Lennes &
Minna Tamper
• Website still under construction, for now:
https://www.cost.eu/actions/CA18209/
D I G I TA L H U M A N I T I E S L A B
54. Work in progress
Historical Image Analysis (@MelvinWevers) Global Apple Pie
(with Ulbe Bosma & Rebeca Ibáñez-Martîn)
18th century career mobility
(DHLab + HI + DI)
What makes or breaks an idea?
(@AdinaNerghes)
Amsterdam Time Machine (@merpeltje)
55. D I G I TA L H U M A N I T I E S L A B
Teaser: CULTURAIL
“Cultural AI is the study, design and development
of socio-technological AI systems that are implicitly
or explicitly aware of the subtle and subjective
richness of human culture. It is as much about using
AI for analyzing human culture as it is about using
knowledge and expertise from the humanities to
analyze and improve AI technology. It studies how
to deal with cultural bias in data and technology and
how to build AI that is optimized for cultural and
ethical values.”
Van Erp, Van den Bosch & Van Ossenbruggen, 2019
Image source: https://accuform-img2.akamaized.net/files/damObject/Image/huge/FRW304.jpg