Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology

Square pegs and round holes:
addressing the mismatch between humanities questions
and the state-of-the-art in language technology
Marieke.van.Erp@dh.huc.knaw.nl

merpeltje
D I G I TA L H U M A N I T I E S L A B

Three use cases:
• Messy data: EviDENce project

• OCR troubles: Historical Recipe Web

• Genre mismatch: Why Language
Technology Can’t Handle Game of Thrones
(yet)

EviDENce - Ego Documents Events ModelliNg
How individuals recall war and violence
Hucopix - Elodie Burillon

Ego Documents Events modelliNg - how individuals recall war and violence
Sources: - Oral history interview transcripts WW2 (450)
Aims: - Better understand nature of and change in eyewitness reports
- Further develop event detection as means for extracting relevant
information from large and complex textual datasets

manually
annotated
fragment
Annotated
by NLP
Pipeline

Manual NLP Pipeline
bombardement 28 zijn 394
brand 5 hebben 84
Arbeitseinsatz 4 zeggen 78
onderduiken 4 gaan 43
razzia 4 zitten 42
Amerikaans bombardement
2 weten 39
gevochten 2 doen 33
mobilisatietijd 2 komen 28
toen ging het allemaal branden 2 horen 27
verraden 2 wonen 26
Events - most frequent terms

Manual NLP Pipeline
bombardement 28 SRL: Subject or object : “het Engels bombardement” 1
SRL: Subject or object : “het bombardement” 1
bombarderen 5
brand 5 branden 5
afbranden 3
SRL: Subject or object: “die brand” 1
Arbeitseinsatz 4 Location: “Arbeitseinsatz” 1
onderduiken 4 onderduiken 4
razzia 4 Time: “als er razzia komen” 1
gevochten 2 vechten 3
Events: Matching of manual and automatic annotation

Manual NLP Pipeline
Ik 215 Ik 61
we 68 we 10
vader 30 “mijn vader” 1
moeder 11 “mijn moeder” 2
broer 9 “broer en zus” 1
“een broer” 1
vrienden 8 not found 1
Amerikanen 7 Location: “Amerikanen” 6
ouders 6 not found 4
die Duitsers 5 Location: “Duitsers” 9
Actors: Matching of manual and automatic annotation

Taking a step back: what does
the research question really need?
• EviDENce historians are interested in
relevant passages

• NLP pipeline analyses texts down to word
level

• Should we be using an NLP pipeline at all?
Image source: https://cdn.xingosoftware.com/dedikkeblauwe/images/fetch/dpr_2/
https%3A%2F%2Fwww.dedikkeblauwe.nl%2Fassets%2Fupload%2Fimages%2F49%2F20190131165659_Kanon-op-mug.png

Back to the drawing board!
• Current pipeline is error prone

• Humanities scholars are not trained to think
in NLP modules and linguistic layers

• Can we gather text passages describing
violence without deep text analysis?

• Three approaches:

• keyword expansion

• doc2vec

• ElasticSearch

Take home message
• Choose the right tool!

• It takes time to understand each other

• Next week we’ll know what other historians
think of our approach :)

Constructing a Recipe Web from
Historical Newspapers
Marieke van Erp @merpeltje

Melvin Wevers @melvinwevers

Hugo Huurdeman @timelessfuture
Image source: https://static.ah.nl/static/recepten/img_006188_890x594_JPG.jpg

Butter, salt & pepper
• Analysis of food customs:

• historians

• dieticians

• ethnologists

• 1945 - 1995 Parool, Volkskrant, NRC & Trouw

• Dataset and code available through: https://
github.com/DHLab-nl/historical-recipe-web

• Winner National Library - Rijksmuseum -
Network Digital Heritage HackaLOD Hackathon

• You & other researchers are invited to work
with us on case studies around food culture
D I G I TA L H U M A N I T I E S L A B Image source: https://assets3.thrillist.com/v1/image/1623749/size/tl-horizontal_main_2x.jpg

Newspapers as a source for
recipes
• perception of a Dutch food culture formed
in the 1950s

• newspapers are producer and messengers
of public discourse

• newspapers contain views on daily life and
customs

• But:

• keyword search for ‘recepten’
imprecise

• noise from digitisation process
Image source: delpher.nl

Newspaper dataset
• Dutch National Library has digitised 90+
million book, newspaper and magazine pages

• Newspapers published between 1618 - 1995
from the Netherlands, the Dutch Indies
(present day Indonesia), the Antilles, the US
and Surinam (15% of all newspapers
published in the Netherlands)

• Available via website, data dump (until 1876)
and API (with agreement)
Pages Articles Tokens
Parool 14,194 2,380,697 612,036,106
Volkskrant 13,628 2,248,652 744,275,792
NRC 7,199 947,198 489,397,816
Trouw 13,891 2,578,731 656,941,631
Total: 48,912 8,155,278 2,502,651,345
article: https://www.delpher.nl/nl/kranten/view?coll=ddd&identiﬁer=ddd:010627319:mpeg21:a0067

Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
dinsdag
6 ossestaartsoep
HUt *orstjes
l 0( * bonen met ananas
t t e bonen met ananas
Va0,1 2 blikken witte bonen In 1 uitje,
1 eetlepel ?lWd- 2 eetlepels keuken-
12 knakwostjes, 1 klein
„ ftaLananasDlokJes- SoJrJi het uitJe
en meng dit ,Qoe h bonen met
tomatensaus. Nir,;e groente in een
ingevette ?fd h ste schaal. Roer de
mos- Je hni?or de stroop en giet hier

OCR Quality

From newspapers to a recipe web
Ingredients
Recipe tags
Recipe
descriptions
Recipe articles
Information Extraction and
Multilabel Classification
Enrichment
Ingredient and
quantity extraction
Recipe tags
Structured newspaper
recipes
Origin
DBpedia link
Scientific name
Recipe text detection
Structured and enriched
newspaper recipes
Seed list
Text
classification

What & how much?
• articles cannot automatically be segmented

• OCR errors and non-grammatical
sentences are a hurdle for standard NLP
pipelines

• lexicon-based extraction of ingredients and
quantities
Image source: https://cdn.pixabay.com/photo/2014/11/15/20/30/kitchen-scale-532651_960_720.jpg

Evaluation
• 100 articles were manually annotated using
Recogito

• OCR errors in ingredients or quantities marked
separately

• IAA .85 but OCR boundaries diﬃcult:
jºar,anen’ vs ◦ºar,anen’

• Most precise lexicon: f1 = .67

• More research is needed for out-of-lexicon
ingredients

Results ingredients extraction

27,411 new (old) recipes
• 34,479 Tags

• 365,133 ingredients

• >17,000 Links to external sources

• Data and software available at: https://
github.com/DHLab-nl/historical-recipe-web
Bron: https://static.ah.nl/static/recepten/img_074629_890x594_JPG.jpgD I G I TA L H U M A N I T I E S L A B

Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
Take home message
• OCR errors can impact information extraction

• OCR post-correction is an active research
ﬁeld, but errors will remain

• Focus on most important elements to extract
source: https://resolver.kb.nl/resolve?urn=ABCDDD:010877049:mpeg21:a0158

Acknowledgements:
Image source: https://twelvemilesfromalemondotcom.ﬁles.wordpress.com/2014/09/img_0326.jpg

Why Language Technology Can’t
Handle Game of Thrones (yet)
Niels Dekker, Tobias Kuhn & Marieke van Erp
Image source: https://anibundel.ﬁles.wordpress.com/2015/04/jonsnow-leaves-ygritte.jpg

Background
• Characters and relations are backbone of
stories

• Computational methods allow for scaling
up network extraction and analysis

• Relies on named entity recognition

• Most work thusfar focuses on 19th and
early 20th century novels

• Research question: how do these tools
perform on modern science ﬁction/fantasy
novels?

Experimental setup
• Collect 20 ‘old’ and 20 ‘new’ novels

• Annotate ﬁrst chapters for entities and
relationships between entities (gold
standard)

• Evaluate entity recognition tools on the sets
of ‘old’ and ‘new’ novels

• Compare system outputs to gold standard
annotations

• Bonus: compare network structures
Image source: delpher.nl
Image source: https://cdn-images-1.medium.com/max/2400/1*QbCo9uE7jPbt1ttnMsqOog.jpeg

19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels +
availability through Project Gutenberg + used in earlier studies

‘New’ Science Fiction and Fantasy novels, based on list from BestFantasyBooks.com

Data preprocessing
• All books converted to plain text format

• Ensure all texts have the same character
encoding

• Pro tip: check whether there are no
odd or inconsistent quotation marks in
your documents

• Appendices, glossaries and reviews were
removed manually
Image source: https://www.dataentryoutsourced.com/blog/wp-content/uploads/2015/03/
Post-091-640x200.jpg

Gold standard annotations
• Chapter lengths varied from 84 to 1,442
sentences

• An average of 300 sentences close to a
chapter boundary was selected

• e.g. the third chapter in Alice in
Wonderland ended after sentence
315, so for that book the ﬁrst three
chapters were annotated

• 2 annotators (not the authors of the study)
Image source: https://panmacmillan.azureedge.net/pmk11/panmacmillan/ﬁles/media/
panmacmillan/blogs/tws/august%202017/alice-in-wonderland-knowledge-quiz-header.png

Annotation Instructions
• For each sentence:

• Identify all characters in it

• Identify anaphoric references (e.g. she
refers to Alice)

• To speed up the process, annotators were
provided with a list of characters derived
automatically

• Missing characters could be added to the
list

• Ignore generic pronouns, exclamations,
generic noun phrases, non-human named
characters (Buckbeak)
Image source: https://vignette.wikia.nocookie.net/p__/images/3/35/Erich_Mueller_and_Shannon_McGrath_are_glued_together_back_to_back_with_Tree_Resin.jpeg/revision/
latest?cb=20170331180847&path-preﬁx=protagonist

Named Entity Recognisers:
BookNLP
• NLP pipeline modified to deal with books

• POS tagging, dependency parsing, NER,
character name clustering, quotation
speaker identification, pronominal
coreference resolution, supersense tagging

• NER module based on Stanford NER, with
some modifications

• We focus on NER, character name
clustering and pronominal character
resolution modules in our evaluation

• https://github.com/dbamman/book-nlp
Image source: https://cdn.aarp.net/content/dam/aarp/money/budgeting_savings/2016/04/1140-
yeager-sell-your-used-books.imgcache.rev6feda141288df73e8fd100822bb375ea.jpg

Intermediate conclusion
• No difference between ‘old’ and ‘new’
books

• Within categories, great variety in entity
distributions and results

• If a central entity is missed, the
performance suffers greatly (e.g.
Brave New World)

• Coreference resolution particularly difficult
in this domain
Image source: https://www.nuffoodsspectrum.in/uploads/articles/quarterly_results_bg-4192.jpg

J eJ e
Ha SeHa Se
B ac eB ac e
L d R bbL d R bb
CC
P e Se MaP e Se Ma
H eH e
T e P ceT e P ce
T a Me SeT a Me Se
H e Se Ge dH e Se Ge d
L d Va ceL d Va ceDa eDa e
A a H e aceA a H e ace
L d H dL d H d
R be Ba a eR be Ba a eC e P eC e P e
Ca L d B ceCa L d B ce
E aE a
S a Sa aS a Sa a
M Ma eM Ma e
AA
R d Ca e SeR d Ca e Se TT
L a aL a a
Se D eSe D e
N e aN e a
S e eS e e
Ta SaTa Sa
JJ
A a AA a A
J cJ c
Y eY e
F e LadF e Lad
Ra de Ma ceRa de Ma ce
PP
Ma de Se WMa de Se W
C e aC e a
JJ
C eC e
D SeD Se
B e Y R ceB e Y R ce
C eC e
V e aV e a
Ca e JCa e J
G eG e
L d SL d S
Ha M eHa M e
Ned S aNed S a
S a B a dS a B a d
M eM e
G e BaG e Ba
M ecM ec
T a dT a d
Da eDa e
M a S eM a S e
Hea ba eHea ba e
Ja e Se R eJa e Se R e
E e Se Va dE e Se Va d
G dG d
Ca e B acCa e B ac
L d D da Be cL d D da Be c
B de B acB de B ac
Mae e LMae e L
Mae e AeMae e Ae
C a eC a e
M dM d
MaMa
C e a e Sa dC e a e Sa d
S aeS ae
Ha e aHa e a
L d Ne R ceL d Ne R ce
PePe
T adT ad
P eP e
L d d TL d d T
MaMa
Va H aVa H a
R cR c
E eE e
L d AL d A
Q aQ a
L d P eL d P e
L a Lad AL a Lad A
B aaB aa
Ma aMa a
B ac e J L dB ac e J L d
L d S e a dL d S e a d
Ma de Se We deMa de Se We de
T e aT e a
T eT e
Sa a a Se ASa a a Se A
Ba a Se SeBa a Se Se
Pa e Se IPa e Se I
B MB M
Pe SePe Se
L d Ma e JaL d Ma e Ja
Sa e TaSa e Ta
P e VaP e Va
J eJ e
BeBe
Ga edGa ed
M eM e
W e O e SeW e O e Se
F e SF e S
DaDa
K e eK e e
G eaG ea
La e TLa e T
Se M e Ma dSe M e Ma d
L d W aL d W a
Ha dHa d
D eD e
L d JL d J
S a Ba a e L dS a Ba a e L d
Je eJe e
UU
Fa TFa T
Ja e Se La eJa e Se La e
O K aO K a
M a CaM a Ca
Ca e MaCa e Ma
A e Se T eA e Se T e
Fa eFa e
L d R beL d R be
LL
L d R aL d R a
Je e P eJe e P e
TT
CC
MaeMae
HaHa
Va eVa e
Ed e Se TEd e Se T
H a deH a de
Ga eGa e
H HH H
C aC a
Hedd e Ma aHedd e Ma a
Mae e MMae e M
Lad Ca e S aLad Ca e S a
CaCa
Be S aBe S a
MaMa
Lad MLad M
KK
R be AR be A
Ge dGe d
X Ja ab aX Ja ab a
K a eeK a ee
L d Ba a e ReL d Ba a e Re
AA
L d Bae PeL d Bae Pe
Lad Sa aLad Sa a
M Ma DM Ma D
L d F e Wa deL d F e Wa de
Fa eFa e
Se Adda Ma b a dSe Adda Ma b a d
H SeH Se
O d NaO d Na
L aL a
JacJac
R ae a Ta a eR ae a Ta a e
J e P ceJ e P ce
B Se BB Se B
Va ce KaVa ce Ka
JJ
A Da e SeA Da e Se
M da e Se aM da e Se a
Se Ta a He aSe Ta a He a
L d T B ac dL d T B ac d
T L d La eT L d La e
Y TY T
Je BeJe Be
Ha deHa de
S a aS a a
A JA J
DD
BaeBae
GG
T e Se L aT e Se L a
La e Se Ke aLa e Se Ke a
S e F e SeS e F e Se
Ta da LadTa da Lad
Ra Da SeRa Da Se
S a dS a d
L d T H eL d T H e
A SeA Se
F e Ja eF e Ja e
W Se W deW Se W de
DaDa
He a dHe a d
W e DaW e Da
FF
Ma eMa e
WW
R a a K aR a a K a
M caM ca
JaJa
F e e B a SeF e e B a Se
U aU a
R ba SeR ba Se
NaeNae
C eC e
T b MT b M
Be e S aBe e S a
MM
L e eL e e
L d T eL d T e
B de Se TB de Se T
HaHa
M ce aM ce a
SS
O e Ya cO e Ya c
G e T eG e T e
II
Mae e P ce eMae e P ce e
G e W dG e W d
Q Ha a dQ Ha a d
Jae aeJae ae
L d CeL d Ce
C daC da
Ra aRa a
D eD e
Ma e IMa e I
T eT e
Ae Ta a eAe Ta a e
B e MaB e Ma
Da H dDa H d
R eR e
C e a e G e SeC e a e G e Se
S JS J
RaRa
Ae Ta a eAe Ta a e
D K aD K a
V e Ta a eV e Ta a e
QQ
W e LadW e Lad
H bb T ee-F eH bb T ee-F e
D aD a
R ce Se A daR ce Se A da
Ka SeKa Se
Ha eHa e
La ceLa ce
H eeH ee
Mace T eMace T e
L d H eL d H e
Ha M eHa M e
D a VaeD a Vae
Dae e Ta a eDae e Ta a e
L d Le dL d Le d
V aV a
G e Ga baG e Ga ba
R aeR ae
B R eB R e
Ca e TCa e T
La e Ce eLa e Ce e
JJ
Wa a Se R ceWa a Se R ce
L B eL B e
L d Ta Ra dL d Ta Ra d
De L dDe L d
Ja ed F e SeJa ed F e Se
TT
Se S a BaSe S a Ba
L d VaL d Va
B aB a
Ha Ka aHa Ka a
J aJ a
D eaD ea
Ha deHa de
bb
Ja SJa S
B e MB e M
A a S aA a S a
C b a L SeC b a L Se
H dH d
R be G eR be G e
HaHa
L d Ka a R c a dL d Ka a R c a d
BB
H bbe SeH bbe Se
K a JK a J
H a SeH a Se
L d ML d M
De dDe d
S aS a
R bb S aR bb S a
L d Ha d dL d Ha d d
A beA be
N e D aN e D a
J a Se MJ a Se M

CC
E aE a
AA
JJ
C e aC e a
JJ
S aeS ae
PePe
MaMa
Va H aVa H a
E eE e
Q aQ a
dd
T eT e
DaDa
a e Ta e T
D eD e
UU
O K aO K a
LL
CC
HaHa
H a deH a de
KK
K a eeK a ee
M Ma DM Ma D
R ae a Ta a eR ae a Ta a e
Va ce KaVa ce Ka
Y TY T
S a aS a a
DD
GG
La e Se Ke aLa e Se Ke a
Ra Da SeRa Da Se
FF
R a a K aR a a K a
F e e B a SeF e e B a Se
U aU a
NaeNae
C eC e
L d T eL d T e
II
Ra aRa a
Ma e IMa e I
Ae Ta a eAe Ta a e
D K aD K a
V e Ta a eV e Ta a e
QQ
D aD a
Ka SeKa Se
D a VaeD a Vae
L d Le dL d Le d
R aeR ae
La e Ce eLa e Ce e
JJ
De L dDe L d
TT
J aJ a
D eaD ea
MM
K a JK a J
L d ML d M
R bb S aR bb S a
J a Se MJ a Se M

Image source: https://i.pinimg.com/originals/30/25/20/302520dbb49bb4a01b5687a7e6c6bf60.jpg

From NLP output to KGs
• Names aren’t just about labels

• Context has meaning too

• Collapse Dany/Daenerys?

• depends on your research question

• NLP often stops after recognising names
and coreference links
Image source: http://imagens.tiespecialistas.com.br/2011/10/Figura02.png

The Three Musketeers: F1 32 - 48

The Three Musketeers after rewriting d’Artagnan to Dartagnan

Why is ﬁction hard for NLP?
• Fiction writers don’t have to abide by
conventions: they can use language more
creatively than newspaper journalists

• mix languages

• make up languages

• use nicknames

• Narratives written from ﬁrst-person
perspective confuse the software
Image source: https://steamuserimages-a.akamaihd.net/ugc/859477733475369907/F34770D6EFEC30A70A84BEFE93C2C522C0B4A902/

Performance ﬁxes
• Replace word names with generic names

• Remove apostrophes from names

• But:

• Requires manual intervention

• Doesn’t scale

Image source: https://static.boredpanda.com/blog/wp-content/uploads/2015/10/funny-game-of-thrones-memes-fb__700.jpg

Where to go from here?
• Robuster NLP tools are necessary to better
understand novels (and other non-newspaper
texts)

• Background knowledge can help (e.g. GoT
Wiki lists all Danaerys’ nicknames)

• But: not all books are that popular

• Also: different names are used in different
contexts, you may not want to collapse them!

• Always: don’t just assume it works, look into
your data!

• Full paper at: http://peerj.com/articles/cs-189
D I G I TA L H U M A N I T I E S L A B Image source: https://news.images.itv.com/image/file/1232718/stream_img.jpg

Conclusions
• Huge gap between NLP research and use
cases

• Understanding of each other’s tools
and questions

• What NLP tools can handle

• First: What does the research question
really need?

• Then: What is the mismatch between my
data and what the tools can handle?

• Next: Let’s get to work, there’s lots to do!

COST Action 18209
• Web-centred Linguistic Data Science

• Various use cases (also digital humanities!)

• Management committee members
representing Finland: Jouni Tuominen &
Eero Hyvönen and Mietta Lennes &
Minna Tamper

• Website still under construction, for now:
https://www.cost.eu/actions/CA18209/

Work in progress
Historical Image Analysis (@MelvinWevers) Global Apple Pie
(with Ulbe Bosma & Rebeca Ibáñez-Martîn)
18th century career mobility
(DHLab + HI + DI)
What makes or breaks an idea?
(@AdinaNerghes)
Amsterdam Time Machine (@merpeltje)

Teaser: CULTURAIL
“Cultural AI is the study, design and development
of socio-technological AI systems that are implicitly
or explicitly aware of the subtle and subjective
richness of human culture. It is as much about using
AI for analyzing human culture as it is about using
knowledge and expertise from the humanities to
analyze and improve AI technology. It studies how
to deal with cultural bias in data and technology and
how to build AI that is optimized for cultural and
ethical values.”

Van Erp, Van den Bosch & Van Ossenbruggen, 2019
Image source: https://accuform-img2.akamaized.net/ﬁles/damObject/Image/huge/FRW304.jpg

Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology

Recommended

Recommended

More Related Content

More from Marieke van Erp

More from Marieke van Erp (20)

Recently uploaded

Recently uploaded (20)

Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology