SlideShare a Scribd company logo
1 of 56
Download to read offline
Square pegs and round holes:
addressing the mismatch between humanities questions
and the state-of-the-art in language technology
Marieke.van.Erp@dh.huc.knaw.nl

merpeltje
D I G I TA L H U M A N I T I E S L A B
D I G I TA L H U M A N I T I E S L A B
Three use cases:
• Messy data: EviDENce project

• OCR troubles: Historical Recipe Web 

• Genre mismatch: Why Language
Technology Can’t Handle Game of Thrones
(yet)
EviDENce - Ego Documents Events ModelliNg
How individuals recall war and violence
Hucopix - Elodie Burillon
Ego Documents Events modelliNg - how individuals recall war and violence
Sources: - Oral history interview transcripts WW2 (450)
Aims: - Better understand nature of and change in eyewitness reports
- Further develop event detection as means for extracting relevant
information from large and complex textual datasets
NLP Pipeline
manually
annotated
fragment
Annotated
by NLP
Pipeline
Manual NLP Pipeline
bombardement 28 zijn 394
brand 5 hebben 84
Arbeitseinsatz 4 zeggen 78
onderduiken 4 gaan 43
razzia 4 zitten 42
Amerikaans bombardement
2 weten 39
gevochten 2 doen 33
mobilisatietijd 2 komen 28
toen ging het allemaal branden 2 horen 27
verraden 2 wonen 26
Events - most frequent terms
Manual NLP Pipeline
bombardement 28 SRL: Subject or object : “het Engels bombardement” 1
SRL: Subject or object : “het bombardement” 1
bombarderen 5
brand 5 branden 5
afbranden 3
SRL: Subject or object: “die brand” 1
Arbeitseinsatz 4 Location: “Arbeitseinsatz” 1
onderduiken 4 onderduiken 4
razzia 4 Time: “als er razzia komen” 1
gevochten 2 vechten 3
Events: Matching of manual and automatic annotation
Manual NLP Pipeline
Ik 215 Ik 61
we 68 we 10
vader 30 “mijn vader” 1
moeder 11 “mijn moeder” 2
broer 9 “broer en zus” 1
“een broer” 1
vrienden 8 not found 1
Amerikanen 7 Location: “Amerikanen” 6
ouders 6 not found 4
die Duitsers 5 Location: “Duitsers” 9
Actors: Matching of manual and automatic annotation
D I G I TA L H U M A N I T I E S L A B
Taking a step back: what does
the research question really need?
• EviDENce historians are interested in
relevant passages 

• NLP pipeline analyses texts down to word
level 

• Should we be using an NLP pipeline at all?
Image source: https://cdn.xingosoftware.com/dedikkeblauwe/images/fetch/dpr_2/
https%3A%2F%2Fwww.dedikkeblauwe.nl%2Fassets%2Fupload%2Fimages%2F49%2F20190131165659_Kanon-op-mug.png
D I G I TA L H U M A N I T I E S L A B
Back to the drawing board!
• Current pipeline is error prone 

• Humanities scholars are not trained to think
in NLP modules and linguistic layers 

• Can we gather text passages describing
violence without deep text analysis?

• Three approaches: 

• keyword expansion

• doc2vec

• ElasticSearch
D I G I TA L H U M A N I T I E S L A B
Take home message
• Choose the right tool! 

• It takes time to understand each other 

• Next week we’ll know what other historians
think of our approach :)
Constructing a Recipe Web from
Historical Newspapers
Marieke van Erp @merpeltje

Melvin Wevers @melvinwevers

Hugo Huurdeman @timelessfuture
Image source: https://static.ah.nl/static/recepten/img_006188_890x594_JPG.jpg
Butter, salt & pepper
• Analysis of food customs: 

• historians 

• dieticians 

• ethnologists 

• 1945 - 1995 Parool, Volkskrant, NRC & Trouw

• Dataset and code available through: https://
github.com/DHLab-nl/historical-recipe-web 

• Winner National Library - Rijksmuseum -
Network Digital Heritage HackaLOD Hackathon

• You & other researchers are invited to work
with us on case studies around food culture
D I G I TA L H U M A N I T I E S L A B Image source: https://assets3.thrillist.com/v1/image/1623749/size/tl-horizontal_main_2x.jpg
Newspapers as a source for
recipes
• perception of a Dutch food culture formed
in the 1950s 

• newspapers are producer and messengers
of public discourse 

• newspapers contain views on daily life and
customs 

• But:

• keyword search for ‘recepten’
imprecise 

• noise from digitisation process
Image source: delpher.nl
D I G I TA L H U M A N I T I E S L A B
Newspaper dataset
• Dutch National Library has digitised 90+
million book, newspaper and magazine pages 

• Newspapers published between 1618 - 1995
from the Netherlands, the Dutch Indies
(present day Indonesia), the Antilles, the US
and Surinam (15% of all newspapers
published in the Netherlands)

• Available via website, data dump (until 1876)
and API (with agreement)
D I G I TA L H U M A N I T I E S L A B
Pages Articles Tokens
Parool 14,194 2,380,697 612,036,106
Volkskrant 13,628 2,248,652 744,275,792
NRC 7,199 947,198 489,397,816
Trouw 13,891 2,578,731 656,941,631
Total: 48,912 8,155,278 2,502,651,345
article: https://www.delpher.nl/nl/kranten/view?coll=ddd&identifier=ddd:010627319:mpeg21:a0067
Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
D I G I TA L H U M A N I T I E S L A B
dinsdag
6 ossestaartsoep
HUt *orstjes
l 0( * bonen met ananas
t t e bonen met ananas
Va0,1 2 blikken witte bonen In 1 uitje,
1 eetlepel ?lWd- 2 eetlepels keuken-
12 knakwostjes, 1 klein
„ ftaLananasDlokJes- SoJrJi het uitJe
en meng dit ,Qoe h bonen met
tomatensaus. Nir,;e groente in een
ingevette ?fd h ste schaal. Roer de
mos- Je hni?or de stroop en giet hier
OCR Quality
D I G I TA L H U M A N I T I E S L A B
From newspapers to a recipe web
D I G I TA L H U M A N I T I E S L A B
Ingredients
Recipe tags
Recipe
descriptions
Recipe articles
Information Extraction and
Multilabel Classification
Enrichment
Ingredient and
quantity extraction
Recipe tags
Structured newspaper
recipes
Origin
DBpedia link
Scientific name
Recipe text detection
Structured and enriched
newspaper recipes
Seed list
Text
classification
What & how much?
• articles cannot automatically be segmented 

• OCR errors and non-grammatical
sentences are a hurdle for standard NLP
pipelines 

• lexicon-based extraction of ingredients and
quantities
Image source: https://cdn.pixabay.com/photo/2014/11/15/20/30/kitchen-scale-532651_960_720.jpg
D I G I TA L H U M A N I T I E S L A B
Evaluation
• 100 articles were manually annotated using
Recogito

• OCR errors in ingredients or quantities marked
separately 

• IAA .85 but OCR boundaries difficult:
jºar,anen’ vs ◦ºar,anen’

• Most precise lexicon: f1 = .67 

• More research is needed for out-of-lexicon
ingredients
D I G I TA L H U M A N I T I E S L A B
Results ingredients extraction
27,411 new (old) recipes
• 34,479 Tags

• 365,133 ingredients

• >17,000 Links to external sources

• Data and software available at: https://
github.com/DHLab-nl/historical-recipe-web
Bron: https://static.ah.nl/static/recepten/img_074629_890x594_JPG.jpgD I G I TA L H U M A N I T I E S L A B
Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
D I G I TA L H U M A N I T I E S L A B
Take home message
• OCR errors can impact information extraction 

• OCR post-correction is an active research
field, but errors will remain 

• Focus on most important elements to extract
source: https://resolver.kb.nl/resolve?urn=ABCDDD:010877049:mpeg21:a0158
Acknowledgements:
Image source: https://twelvemilesfromalemondotcom.files.wordpress.com/2014/09/img_0326.jpg
Why Language Technology Can’t
Handle Game of Thrones (yet)
Niels Dekker, Tobias Kuhn & Marieke van Erp
Image source: https://anibundel.files.wordpress.com/2015/04/jonsnow-leaves-ygritte.jpg
Background
• Characters and relations are backbone of
stories 

• Computational methods allow for scaling
up network extraction and analysis 

• Relies on named entity recognition 

• Most work thusfar focuses on 19th and
early 20th century novels 

• Research question: how do these tools
perform on modern science fiction/fantasy
novels?
D I G I TA L H U M A N I T I E S L A B
Experimental setup
• Collect 20 ‘old’ and 20 ‘new’ novels 

• Annotate first chapters for entities and
relationships between entities (gold
standard)

• Evaluate entity recognition tools on the sets
of ‘old’ and ‘new’ novels 

• Compare system outputs to gold standard
annotations 

• Bonus: compare network structures
Image source: delpher.nl
D I G I TA L H U M A N I T I E S L A B
Image source: https://cdn-images-1.medium.com/max/2400/1*QbCo9uE7jPbt1ttnMsqOog.jpeg
19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels +
availability through Project Gutenberg + used in earlier studies
‘New’ Science Fiction and Fantasy novels, based on list from BestFantasyBooks.com
D I G I TA L H U M A N I T I E S L A B
Data preprocessing
• All books converted to plain text format 

• Ensure all texts have the same character
encoding 

• Pro tip: check whether there are no
odd or inconsistent quotation marks in
your documents

• Appendices, glossaries and reviews were
removed manually
D I G I TA L H U M A N I T I E S L A B
Image source: https://www.dataentryoutsourced.com/blog/wp-content/uploads/2015/03/
Post-091-640x200.jpg
Gold standard annotations
• Chapter lengths varied from 84 to 1,442
sentences 

• An average of 300 sentences close to a
chapter boundary was selected 

• e.g. the third chapter in Alice in
Wonderland ended after sentence
315, so for that book the first three
chapters were annotated

• 2 annotators (not the authors of the study)
D I G I TA L H U M A N I T I E S L A B
Image source: https://panmacmillan.azureedge.net/pmk11/panmacmillan/files/media/
panmacmillan/blogs/tws/august%202017/alice-in-wonderland-knowledge-quiz-header.png
D I G I TA L H U M A N I T I E S L A B
Annotation Instructions
• For each sentence:

• Identify all characters in it 

• Identify anaphoric references (e.g. she
refers to Alice) 

• To speed up the process, annotators were
provided with a list of characters derived
automatically

• Missing characters could be added to the
list 

• Ignore generic pronouns, exclamations,
generic noun phrases, non-human named
characters (Buckbeak)
D I G I TA L H U M A N I T I E S L A B
Image source: https://vignette.wikia.nocookie.net/p__/images/3/35/Erich_Mueller_and_Shannon_McGrath_are_glued_together_back_to_back_with_Tree_Resin.jpeg/revision/
latest?cb=20170331180847&path-prefix=protagonist
Named Entity Recognisers:
BookNLP
• NLP pipeline modified to deal with books 

• POS tagging, dependency parsing, NER,
character name clustering, quotation
speaker identification, pronominal
coreference resolution, supersense tagging

• NER module based on Stanford NER, with
some modifications 

• We focus on NER, character name
clustering and pronominal character
resolution modules in our evaluation

• https://github.com/dbamman/book-nlp
D I G I TA L H U M A N I T I E S L A B
Image source: https://cdn.aarp.net/content/dam/aarp/money/budgeting_savings/2016/04/1140-
yeager-sell-your-used-books.imgcache.rev6feda141288df73e8fd100822bb375ea.jpg
Intermediate conclusion
• No difference between ‘old’ and ‘new’
books 

• Within categories, great variety in entity
distributions and results 

• If a central entity is missed, the
performance suffers greatly (e.g.
Brave New World)

• Coreference resolution particularly difficult
in this domain
D I G I TA L H U M A N I T I E S L A B
Image source: https://www.nuffoodsspectrum.in/uploads/articles/quarterly_results_bg-4192.jpg
J eJ e
Ha SeHa Se
B ac eB ac e
L d R bbL d R bb
CC
P e Se MaP e Se Ma
H eH e
T e P ceT e P ce
T a Me SeT a Me Se
H e Se Ge dH e Se Ge d
L d Va ceL d Va ceDa eDa e
A a H e aceA a H e ace
L d H dL d H d
R be Ba a eR be Ba a eC e P eC e P e
Ca L d B ceCa L d B ce
E aE a
S a Sa aS a Sa a
M Ma eM Ma e
AA
R d Ca e SeR d Ca e Se TT
L a aL a a
Se D eSe D e
N e aN e a
S e eS e e
Ta SaTa Sa
JJ
A a AA a A
J cJ c
Y eY e
F e LadF e Lad
Ra de Ma ceRa de Ma ce
PP
Ma de Se WMa de Se W
C e aC e a
JJ
C eC e
D SeD Se
B e Y R ceB e Y R ce
C eC e
V e aV e a
Ca e JCa e J
G eG e
L d SL d S
Ha M eHa M e
Ned S aNed S a
S a B a dS a B a d
M eM e
G e BaG e Ba
M ecM ec
T a dT a d
Da eDa e
M a S eM a S e
Hea ba eHea ba e
Ja e Se R eJa e Se R e
E e Se Va dE e Se Va d
G dG d
Ca e B acCa e B ac
L d D da Be cL d D da Be c
B de B acB de B ac
Mae e LMae e L
Mae e AeMae e Ae
C a eC a e
M dM d
MaMa
C e a e Sa dC e a e Sa d
S aeS ae
Ha e aHa e a
L d Ne R ceL d Ne R ce
PePe
T adT ad
P eP e
L d d TL d d T
MaMa
Va H aVa H a
R cR c
E eE e
L d AL d A
Q aQ a
L d P eL d P e
L a Lad AL a Lad A
B aaB aa
Ma aMa a
B ac e J L dB ac e J L d
L d S e a dL d S e a d
Ma de Se We deMa de Se We de
T e aT e a
T eT e
Sa a a Se ASa a a Se A
Ba a Se SeBa a Se Se
Pa e Se IPa e Se I
B MB M
Pe SePe Se
L d Ma e JaL d Ma e Ja
Sa e TaSa e Ta
P e VaP e Va
J eJ e
BeBe
Ga edGa ed
M eM e
W e O e SeW e O e Se
F e SF e S
DaDa
K e eK e e
G eaG ea
La e TLa e T
Se M e Ma dSe M e Ma d
L d W aL d W a
Ha dHa d
D eD e
L d JL d J
S a Ba a e L dS a Ba a e L d
Je eJe e
UU
Fa TFa T
Ja e Se La eJa e Se La e
O K aO K a
M a CaM a Ca
Ca e MaCa e Ma
A e Se T eA e Se T e
Fa eFa e
L d R beL d R be
LL
L d R aL d R a
Je e P eJe e P e
TT
CC
MaeMae
HaHa
Va eVa e
Ed e Se TEd e Se T
H a deH a de
Ga eGa e
H HH H
C aC a
Hedd e Ma aHedd e Ma a
Mae e MMae e M
Lad Ca e S aLad Ca e S a
CaCa
Be S aBe S a
MaMa
Lad MLad M
KK
R be AR be A
Ge dGe d
X Ja ab aX Ja ab a
K a eeK a ee
L d Ba a e ReL d Ba a e Re
AA
L d Bae PeL d Bae Pe
Lad Sa aLad Sa a
M Ma DM Ma D
L d F e Wa deL d F e Wa de
Fa eFa e
Se Adda Ma b a dSe Adda Ma b a d
H SeH Se
O d NaO d Na
L aL a
JacJac
R ae a Ta a eR ae a Ta a e
J e P ceJ e P ce
B Se BB Se B
Va ce KaVa ce Ka
JJ
A Da e SeA Da e Se
M da e Se aM da e Se a
Se Ta a He aSe Ta a He a
L d T B ac dL d T B ac d
T L d La eT L d La e
Y TY T
Je BeJe Be
Ha deHa de
S a aS a a
A JA J
DD
BaeBae
GG
T e Se L aT e Se L a
La e Se Ke aLa e Se Ke a
S e F e SeS e F e Se
Ta da LadTa da Lad
Ra Da SeRa Da Se
S a dS a d
L d T H eL d T H e
A SeA Se
F e Ja eF e Ja e
W Se W deW Se W de
DaDa
He a dHe a d
W e DaW e Da
FF
Ma eMa e
WW
R a a K aR a a K a
M caM ca
JaJa
F e e B a SeF e e B a Se
U aU a
R ba SeR ba Se
NaeNae
C eC e
T b MT b M
Be e S aBe e S a
MM
L e eL e e
L d T eL d T e
B de Se TB de Se T
HaHa
M ce aM ce a
SS
O e Ya cO e Ya c
G e T eG e T e
II
Mae e P ce eMae e P ce e
G e W dG e W d
Q Ha a dQ Ha a d
Jae aeJae ae
L d CeL d Ce
C daC da
Ra aRa a
D eD e
Ma e IMa e I
T eT e
Ae Ta a eAe Ta a e
B e MaB e Ma
Da H dDa H d
R eR e
C e a e G e SeC e a e G e Se
S JS J
RaRa
Ae Ta a eAe Ta a e
D K aD K a
V e Ta a eV e Ta a e
QQ
W e LadW e Lad
H bb T ee-F eH bb T ee-F e
D aD a
R ce Se A daR ce Se A da
Ka SeKa Se
Ha eHa e
La ceLa ce
H eeH ee
Mace T eMace T e
L d H eL d H e
Ha M eHa M e
D a VaeD a Vae
Dae e Ta a eDae e Ta a e
L d Le dL d Le d
V aV a
G e Ga baG e Ga ba
R aeR ae
B R eB R e
Ca e TCa e T
La e Ce eLa e Ce e
JJ
Wa a Se R ceWa a Se R ce
L B eL B e
L d Ta Ra dL d Ta Ra d
De L dDe L d
Ja ed F e SeJa ed F e Se
TT
Se S a BaSe S a Ba
L d VaL d Va
B aB a
Ha Ka aHa Ka a
J aJ a
D eaD ea
Ha deHa de
bb
Ja SJa S
B e MB e M
A a S aA a S a
Dae e Ta a eDae e Ta a e
C b a L SeC b a L Se
H dH d
R be G eR be G e
HaHa
L d Ka a R c a dL d Ka a R c a d
BB
H bbe SeH bbe Se
K a JK a J
H a SeH a Se
L d ML d M
De dDe d
S aS a
R bb S aR bb S a
L d Ha d dL d Ha d d
A beA be
N e D aN e D a
J a Se MJ a Se M
CC
E aE a
AA
JJ
C e aC e a
JJ
S aeS ae
PePe
MaMa
Va H aVa H a
E eE e
Q aQ a
dd
T eT e
DaDa
a e Ta e T
D eD e
UU
O K aO K a
LL
CC
HaHa
H a deH a de
KK
K a eeK a ee
M Ma DM Ma D
R ae a Ta a eR ae a Ta a e
Va ce KaVa ce Ka
Y TY T
S a aS a a
DD
GG
La e Se Ke aLa e Se Ke a
Ra Da SeRa Da Se
FF
R a a K aR a a K a
F e e B a SeF e e B a Se
U aU a
NaeNae
C eC e
L d T eL d T e
II
Ra aRa a
Ma e IMa e I
Ae Ta a eAe Ta a e
D K aD K a
V e Ta a eV e Ta a e
QQ
D aD a
Ka SeKa Se
D a VaeD a Vae
Dae e Ta a eDae e Ta a e
L d Le dL d Le d
R aeR ae
La e Ce eLa e Ce e
JJ
De L dDe L d
TT
J aJ a
D eaD ea
MM
Dae e Ta a eDae e Ta a e
K a JK a J
L d ML d M
R bb S aR bb S a
J a Se MJ a Se M
Image source: https://i.pinimg.com/originals/30/25/20/302520dbb49bb4a01b5687a7e6c6bf60.jpg
From NLP output to KGs
• Names aren’t just about labels 

• Context has meaning too 

• Collapse Dany/Daenerys? 

• depends on your research question

• NLP often stops after recognising names
and coreference links
D I G I TA L H U M A N I T I E S L A B
Image source: http://imagens.tiespecialistas.com.br/2011/10/Figura02.png
The Three Musketeers: F1 32 - 48
The Three Musketeers after rewriting d’Artagnan to Dartagnan
D I G I TA L H U M A N I T I E S L A B
Why is fiction hard for NLP?
• Fiction writers don’t have to abide by
conventions: they can use language more
creatively than newspaper journalists

• mix languages

• make up languages 

• use nicknames 

• Narratives written from first-person
perspective confuse the software
D I G I TA L H U M A N I T I E S L A B
Image source: https://steamuserimages-a.akamaihd.net/ugc/859477733475369907/F34770D6EFEC30A70A84BEFE93C2C522C0B4A902/
Performance fixes
• Replace word names with generic names

• Remove apostrophes from names 

• But:

• Requires manual intervention

• Doesn’t scale
D I G I TA L H U M A N I T I E S L A B
Image source: https://static.boredpanda.com/blog/wp-content/uploads/2015/10/funny-game-of-thrones-memes-fb__700.jpg
D I G I TA L H U M A N I T I E S L A B
Where to go from here?
• Robuster NLP tools are necessary to better
understand novels (and other non-newspaper
texts)

• Background knowledge can help (e.g. GoT
Wiki lists all Danaerys’ nicknames)

• But: not all books are that popular 

• Also: different names are used in different
contexts, you may not want to collapse them! 

• Always: don’t just assume it works, look into
your data! 

• Full paper at: http://peerj.com/articles/cs-189
D I G I TA L H U M A N I T I E S L A B Image source: https://news.images.itv.com/image/file/1232718/stream_img.jpg
D I G I TA L H U M A N I T I E S L A B
Conclusions
• Huge gap between NLP research and use
cases 

• Understanding of each other’s tools
and questions 

• What NLP tools can handle 

• First: What does the research question
really need? 

• Then: What is the mismatch between my
data and what the tools can handle? 

• Next: Let’s get to work, there’s lots to do!
D I G I TA L H U M A N I T I E S L A B
COST Action 18209
• Web-centred Linguistic Data Science

• Various use cases (also digital humanities!) 

• Management committee members
representing Finland: Jouni Tuominen &
Eero Hyvönen and Mietta Lennes &
Minna Tamper 

• Website still under construction, for now:
https://www.cost.eu/actions/CA18209/
D I G I TA L H U M A N I T I E S L A B
Work in progress
Historical Image Analysis (@MelvinWevers) Global Apple Pie
(with Ulbe Bosma & Rebeca Ibáñez-Martîn)
18th century career mobility
(DHLab + HI + DI)
What makes or breaks an idea?
(@AdinaNerghes)
Amsterdam Time Machine (@merpeltje)
D I G I TA L H U M A N I T I E S L A B
Teaser: CULTURAIL
“Cultural AI is the study, design and development
of socio-technological AI systems that are implicitly
or explicitly aware of the subtle and subjective
richness of human culture. It is as much about using
AI for analyzing human culture as it is about using
knowledge and expertise from the humanities to
analyze and improve AI technology. It studies how
to deal with cultural bias in data and technology and
how to build AI that is optimized for cultural and
ethical values.”

Van Erp, Van den Bosch & Van Ossenbruggen, 2019
Image source: https://accuform-img2.akamaized.net/files/damObject/Image/huge/FRW304.jpg
dhlab.nl

More Related Content

More from Marieke van Erp

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchMarieke van Erp
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsMarieke van Erp
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Marieke van Erp
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Marieke van Erp
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationMarieke van Erp
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Marieke van Erp
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction Marieke van Erp
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...Marieke van Erp
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...Marieke van Erp
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...Marieke van Erp
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsMarieke van Erp
 
Orientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural HistoryOrientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural HistoryMarieke van Erp
 
Offspring from Reproduction Problems: what replication failure teaches us
Offspring from Reproduction Problems: what replication failure teaches us Offspring from Reproduction Problems: what replication failure teaches us
Offspring from Reproduction Problems: what replication failure teaches us Marieke van Erp
 
From Events to Stories: Different ways of structuring the same bag of events ...
From Events to Stories: Different ways of structuring the same bag of events ...From Events to Stories: Different ways of structuring the same bag of events ...
From Events to Stories: Different ways of structuring the same bag of events ...Marieke van Erp
 
NewsReader: Automating detective work
NewsReader: Automating detective workNewsReader: Automating detective work
NewsReader: Automating detective workMarieke van Erp
 
Knowledge and Media 2012 Lecture 10: Research proposal QA
Knowledge and Media 2012 Lecture 10: Research proposal QAKnowledge and Media 2012 Lecture 10: Research proposal QA
Knowledge and Media 2012 Lecture 10: Research proposal QAMarieke van Erp
 

More from Marieke van Erp (20)

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 
Orientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural HistoryOrientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural History
 
Offspring from Reproduction Problems: what replication failure teaches us
Offspring from Reproduction Problems: what replication failure teaches us Offspring from Reproduction Problems: what replication failure teaches us
Offspring from Reproduction Problems: what replication failure teaches us
 
From Events to Stories: Different ways of structuring the same bag of events ...
From Events to Stories: Different ways of structuring the same bag of events ...From Events to Stories: Different ways of structuring the same bag of events ...
From Events to Stories: Different ways of structuring the same bag of events ...
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
NewsReader: Automating detective work
NewsReader: Automating detective workNewsReader: Automating detective work
NewsReader: Automating detective work
 
KM Lecture11 nlp/nif
KM Lecture11 nlp/nifKM Lecture11 nlp/nif
KM Lecture11 nlp/nif
 
Knowledge and Media 2012 Lecture 10: Research proposal QA
Knowledge and Media 2012 Lecture 10: Research proposal QAKnowledge and Media 2012 Lecture 10: Research proposal QA
Knowledge and Media 2012 Lecture 10: Research proposal QA
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology

  • 1. Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology Marieke.van.Erp@dh.huc.knaw.nl merpeltje D I G I TA L H U M A N I T I E S L A B
  • 2. D I G I TA L H U M A N I T I E S L A B Three use cases: • Messy data: EviDENce project • OCR troubles: Historical Recipe Web • Genre mismatch: Why Language Technology Can’t Handle Game of Thrones (yet)
  • 3. EviDENce - Ego Documents Events ModelliNg How individuals recall war and violence Hucopix - Elodie Burillon
  • 4. Ego Documents Events modelliNg - how individuals recall war and violence Sources: - Oral history interview transcripts WW2 (450) Aims: - Better understand nature of and change in eyewitness reports - Further develop event detection as means for extracting relevant information from large and complex textual datasets
  • 7.
  • 8. Manual NLP Pipeline bombardement 28 zijn 394 brand 5 hebben 84 Arbeitseinsatz 4 zeggen 78 onderduiken 4 gaan 43 razzia 4 zitten 42 Amerikaans bombardement 2 weten 39 gevochten 2 doen 33 mobilisatietijd 2 komen 28 toen ging het allemaal branden 2 horen 27 verraden 2 wonen 26 Events - most frequent terms
  • 9. Manual NLP Pipeline bombardement 28 SRL: Subject or object : “het Engels bombardement” 1 SRL: Subject or object : “het bombardement” 1 bombarderen 5 brand 5 branden 5 afbranden 3 SRL: Subject or object: “die brand” 1 Arbeitseinsatz 4 Location: “Arbeitseinsatz” 1 onderduiken 4 onderduiken 4 razzia 4 Time: “als er razzia komen” 1 gevochten 2 vechten 3 Events: Matching of manual and automatic annotation
  • 10. Manual NLP Pipeline Ik 215 Ik 61 we 68 we 10 vader 30 “mijn vader” 1 moeder 11 “mijn moeder” 2 broer 9 “broer en zus” 1 “een broer” 1 vrienden 8 not found 1 Amerikanen 7 Location: “Amerikanen” 6 ouders 6 not found 4 die Duitsers 5 Location: “Duitsers” 9 Actors: Matching of manual and automatic annotation
  • 11. D I G I TA L H U M A N I T I E S L A B Taking a step back: what does the research question really need? • EviDENce historians are interested in relevant passages • NLP pipeline analyses texts down to word level • Should we be using an NLP pipeline at all? Image source: https://cdn.xingosoftware.com/dedikkeblauwe/images/fetch/dpr_2/ https%3A%2F%2Fwww.dedikkeblauwe.nl%2Fassets%2Fupload%2Fimages%2F49%2F20190131165659_Kanon-op-mug.png
  • 12. D I G I TA L H U M A N I T I E S L A B Back to the drawing board! • Current pipeline is error prone • Humanities scholars are not trained to think in NLP modules and linguistic layers • Can we gather text passages describing violence without deep text analysis? • Three approaches: • keyword expansion • doc2vec • ElasticSearch
  • 13.
  • 14.
  • 15.
  • 16. D I G I TA L H U M A N I T I E S L A B Take home message • Choose the right tool! • It takes time to understand each other • Next week we’ll know what other historians think of our approach :)
  • 17. Constructing a Recipe Web from Historical Newspapers Marieke van Erp @merpeltje Melvin Wevers @melvinwevers Hugo Huurdeman @timelessfuture Image source: https://static.ah.nl/static/recepten/img_006188_890x594_JPG.jpg
  • 18. Butter, salt & pepper • Analysis of food customs: • historians • dieticians • ethnologists • 1945 - 1995 Parool, Volkskrant, NRC & Trouw • Dataset and code available through: https:// github.com/DHLab-nl/historical-recipe-web • Winner National Library - Rijksmuseum - Network Digital Heritage HackaLOD Hackathon • You & other researchers are invited to work with us on case studies around food culture D I G I TA L H U M A N I T I E S L A B Image source: https://assets3.thrillist.com/v1/image/1623749/size/tl-horizontal_main_2x.jpg
  • 19. Newspapers as a source for recipes • perception of a Dutch food culture formed in the 1950s • newspapers are producer and messengers of public discourse • newspapers contain views on daily life and customs • But: • keyword search for ‘recepten’ imprecise • noise from digitisation process Image source: delpher.nl D I G I TA L H U M A N I T I E S L A B
  • 20. Newspaper dataset • Dutch National Library has digitised 90+ million book, newspaper and magazine pages • Newspapers published between 1618 - 1995 from the Netherlands, the Dutch Indies (present day Indonesia), the Antilles, the US and Surinam (15% of all newspapers published in the Netherlands) • Available via website, data dump (until 1876) and API (with agreement) D I G I TA L H U M A N I T I E S L A B Pages Articles Tokens Parool 14,194 2,380,697 612,036,106 Volkskrant 13,628 2,248,652 744,275,792 NRC 7,199 947,198 489,397,816 Trouw 13,891 2,578,731 656,941,631 Total: 48,912 8,155,278 2,502,651,345 article: https://www.delpher.nl/nl/kranten/view?coll=ddd&identifier=ddd:010627319:mpeg21:a0067
  • 21. Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207 D I G I TA L H U M A N I T I E S L A B dinsdag 6 ossestaartsoep HUt *orstjes l 0( * bonen met ananas t t e bonen met ananas Va0,1 2 blikken witte bonen In 1 uitje, 1 eetlepel ?lWd- 2 eetlepels keuken- 12 knakwostjes, 1 klein „ ftaLananasDlokJes- SoJrJi het uitJe en meng dit ,Qoe h bonen met tomatensaus. Nir,;e groente in een ingevette ?fd h ste schaal. Roer de mos- Je hni?or de stroop en giet hier
  • 22. OCR Quality D I G I TA L H U M A N I T I E S L A B
  • 23. From newspapers to a recipe web D I G I TA L H U M A N I T I E S L A B Ingredients Recipe tags Recipe descriptions Recipe articles Information Extraction and Multilabel Classification Enrichment Ingredient and quantity extraction Recipe tags Structured newspaper recipes Origin DBpedia link Scientific name Recipe text detection Structured and enriched newspaper recipes Seed list Text classification
  • 24. What & how much? • articles cannot automatically be segmented • OCR errors and non-grammatical sentences are a hurdle for standard NLP pipelines • lexicon-based extraction of ingredients and quantities Image source: https://cdn.pixabay.com/photo/2014/11/15/20/30/kitchen-scale-532651_960_720.jpg D I G I TA L H U M A N I T I E S L A B
  • 25. Evaluation • 100 articles were manually annotated using Recogito • OCR errors in ingredients or quantities marked separately • IAA .85 but OCR boundaries difficult: jºar,anen’ vs ◦ºar,anen’ • Most precise lexicon: f1 = .67 • More research is needed for out-of-lexicon ingredients D I G I TA L H U M A N I T I E S L A B
  • 27. 27,411 new (old) recipes • 34,479 Tags • 365,133 ingredients • >17,000 Links to external sources • Data and software available at: https:// github.com/DHLab-nl/historical-recipe-web Bron: https://static.ah.nl/static/recepten/img_074629_890x594_JPG.jpgD I G I TA L H U M A N I T I E S L A B
  • 28. Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207 D I G I TA L H U M A N I T I E S L A B Take home message • OCR errors can impact information extraction • OCR post-correction is an active research field, but errors will remain • Focus on most important elements to extract source: https://resolver.kb.nl/resolve?urn=ABCDDD:010877049:mpeg21:a0158
  • 30. Why Language Technology Can’t Handle Game of Thrones (yet) Niels Dekker, Tobias Kuhn & Marieke van Erp Image source: https://anibundel.files.wordpress.com/2015/04/jonsnow-leaves-ygritte.jpg
  • 31. Background • Characters and relations are backbone of stories • Computational methods allow for scaling up network extraction and analysis • Relies on named entity recognition • Most work thusfar focuses on 19th and early 20th century novels • Research question: how do these tools perform on modern science fiction/fantasy novels? D I G I TA L H U M A N I T I E S L A B
  • 32. Experimental setup • Collect 20 ‘old’ and 20 ‘new’ novels • Annotate first chapters for entities and relationships between entities (gold standard) • Evaluate entity recognition tools on the sets of ‘old’ and ‘new’ novels • Compare system outputs to gold standard annotations • Bonus: compare network structures Image source: delpher.nl D I G I TA L H U M A N I T I E S L A B Image source: https://cdn-images-1.medium.com/max/2400/1*QbCo9uE7jPbt1ttnMsqOog.jpeg
  • 33. 19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels + availability through Project Gutenberg + used in earlier studies
  • 34. ‘New’ Science Fiction and Fantasy novels, based on list from BestFantasyBooks.com
  • 35. D I G I TA L H U M A N I T I E S L A B Data preprocessing • All books converted to plain text format • Ensure all texts have the same character encoding • Pro tip: check whether there are no odd or inconsistent quotation marks in your documents • Appendices, glossaries and reviews were removed manually D I G I TA L H U M A N I T I E S L A B Image source: https://www.dataentryoutsourced.com/blog/wp-content/uploads/2015/03/ Post-091-640x200.jpg
  • 36. Gold standard annotations • Chapter lengths varied from 84 to 1,442 sentences • An average of 300 sentences close to a chapter boundary was selected • e.g. the third chapter in Alice in Wonderland ended after sentence 315, so for that book the first three chapters were annotated • 2 annotators (not the authors of the study) D I G I TA L H U M A N I T I E S L A B Image source: https://panmacmillan.azureedge.net/pmk11/panmacmillan/files/media/ panmacmillan/blogs/tws/august%202017/alice-in-wonderland-knowledge-quiz-header.png
  • 37. D I G I TA L H U M A N I T I E S L A B Annotation Instructions • For each sentence: • Identify all characters in it • Identify anaphoric references (e.g. she refers to Alice) • To speed up the process, annotators were provided with a list of characters derived automatically • Missing characters could be added to the list • Ignore generic pronouns, exclamations, generic noun phrases, non-human named characters (Buckbeak) D I G I TA L H U M A N I T I E S L A B Image source: https://vignette.wikia.nocookie.net/p__/images/3/35/Erich_Mueller_and_Shannon_McGrath_are_glued_together_back_to_back_with_Tree_Resin.jpeg/revision/ latest?cb=20170331180847&path-prefix=protagonist
  • 38. Named Entity Recognisers: BookNLP • NLP pipeline modified to deal with books • POS tagging, dependency parsing, NER, character name clustering, quotation speaker identification, pronominal coreference resolution, supersense tagging • NER module based on Stanford NER, with some modifications • We focus on NER, character name clustering and pronominal character resolution modules in our evaluation • https://github.com/dbamman/book-nlp D I G I TA L H U M A N I T I E S L A B Image source: https://cdn.aarp.net/content/dam/aarp/money/budgeting_savings/2016/04/1140- yeager-sell-your-used-books.imgcache.rev6feda141288df73e8fd100822bb375ea.jpg
  • 39.
  • 40. Intermediate conclusion • No difference between ‘old’ and ‘new’ books • Within categories, great variety in entity distributions and results • If a central entity is missed, the performance suffers greatly (e.g. Brave New World) • Coreference resolution particularly difficult in this domain D I G I TA L H U M A N I T I E S L A B Image source: https://www.nuffoodsspectrum.in/uploads/articles/quarterly_results_bg-4192.jpg
  • 41. J eJ e Ha SeHa Se B ac eB ac e L d R bbL d R bb CC P e Se MaP e Se Ma H eH e T e P ceT e P ce T a Me SeT a Me Se H e Se Ge dH e Se Ge d L d Va ceL d Va ceDa eDa e A a H e aceA a H e ace L d H dL d H d R be Ba a eR be Ba a eC e P eC e P e Ca L d B ceCa L d B ce E aE a S a Sa aS a Sa a M Ma eM Ma e AA R d Ca e SeR d Ca e Se TT L a aL a a Se D eSe D e N e aN e a S e eS e e Ta SaTa Sa JJ A a AA a A J cJ c Y eY e F e LadF e Lad Ra de Ma ceRa de Ma ce PP Ma de Se WMa de Se W C e aC e a JJ C eC e D SeD Se B e Y R ceB e Y R ce C eC e V e aV e a Ca e JCa e J G eG e L d SL d S Ha M eHa M e Ned S aNed S a S a B a dS a B a d M eM e G e BaG e Ba M ecM ec T a dT a d Da eDa e M a S eM a S e Hea ba eHea ba e Ja e Se R eJa e Se R e E e Se Va dE e Se Va d G dG d Ca e B acCa e B ac L d D da Be cL d D da Be c B de B acB de B ac Mae e LMae e L Mae e AeMae e Ae C a eC a e M dM d MaMa C e a e Sa dC e a e Sa d S aeS ae Ha e aHa e a L d Ne R ceL d Ne R ce PePe T adT ad P eP e L d d TL d d T MaMa Va H aVa H a R cR c E eE e L d AL d A Q aQ a L d P eL d P e L a Lad AL a Lad A B aaB aa Ma aMa a B ac e J L dB ac e J L d L d S e a dL d S e a d Ma de Se We deMa de Se We de T e aT e a T eT e Sa a a Se ASa a a Se A Ba a Se SeBa a Se Se Pa e Se IPa e Se I B MB M Pe SePe Se L d Ma e JaL d Ma e Ja Sa e TaSa e Ta P e VaP e Va J eJ e BeBe Ga edGa ed M eM e W e O e SeW e O e Se F e SF e S DaDa K e eK e e G eaG ea La e TLa e T Se M e Ma dSe M e Ma d L d W aL d W a Ha dHa d D eD e L d JL d J S a Ba a e L dS a Ba a e L d Je eJe e UU Fa TFa T Ja e Se La eJa e Se La e O K aO K a M a CaM a Ca Ca e MaCa e Ma A e Se T eA e Se T e Fa eFa e L d R beL d R be LL L d R aL d R a Je e P eJe e P e TT CC MaeMae HaHa Va eVa e Ed e Se TEd e Se T H a deH a de Ga eGa e H HH H C aC a Hedd e Ma aHedd e Ma a Mae e MMae e M Lad Ca e S aLad Ca e S a CaCa Be S aBe S a MaMa Lad MLad M KK R be AR be A Ge dGe d X Ja ab aX Ja ab a K a eeK a ee L d Ba a e ReL d Ba a e Re AA L d Bae PeL d Bae Pe Lad Sa aLad Sa a M Ma DM Ma D L d F e Wa deL d F e Wa de Fa eFa e Se Adda Ma b a dSe Adda Ma b a d H SeH Se O d NaO d Na L aL a JacJac R ae a Ta a eR ae a Ta a e J e P ceJ e P ce B Se BB Se B Va ce KaVa ce Ka JJ A Da e SeA Da e Se M da e Se aM da e Se a Se Ta a He aSe Ta a He a L d T B ac dL d T B ac d T L d La eT L d La e Y TY T Je BeJe Be Ha deHa de S a aS a a A JA J DD BaeBae GG T e Se L aT e Se L a La e Se Ke aLa e Se Ke a S e F e SeS e F e Se Ta da LadTa da Lad Ra Da SeRa Da Se S a dS a d L d T H eL d T H e A SeA Se F e Ja eF e Ja e W Se W deW Se W de DaDa He a dHe a d W e DaW e Da FF Ma eMa e WW R a a K aR a a K a M caM ca JaJa F e e B a SeF e e B a Se U aU a R ba SeR ba Se NaeNae C eC e T b MT b M Be e S aBe e S a MM L e eL e e L d T eL d T e B de Se TB de Se T HaHa M ce aM ce a SS O e Ya cO e Ya c G e T eG e T e II Mae e P ce eMae e P ce e G e W dG e W d Q Ha a dQ Ha a d Jae aeJae ae L d CeL d Ce C daC da Ra aRa a D eD e Ma e IMa e I T eT e Ae Ta a eAe Ta a e B e MaB e Ma Da H dDa H d R eR e C e a e G e SeC e a e G e Se S JS J RaRa Ae Ta a eAe Ta a e D K aD K a V e Ta a eV e Ta a e QQ W e LadW e Lad H bb T ee-F eH bb T ee-F e D aD a R ce Se A daR ce Se A da Ka SeKa Se Ha eHa e La ceLa ce H eeH ee Mace T eMace T e L d H eL d H e Ha M eHa M e D a VaeD a Vae Dae e Ta a eDae e Ta a e L d Le dL d Le d V aV a G e Ga baG e Ga ba R aeR ae B R eB R e Ca e TCa e T La e Ce eLa e Ce e JJ Wa a Se R ceWa a Se R ce L B eL B e L d Ta Ra dL d Ta Ra d De L dDe L d Ja ed F e SeJa ed F e Se TT Se S a BaSe S a Ba L d VaL d Va B aB a Ha Ka aHa Ka a J aJ a D eaD ea Ha deHa de bb Ja SJa S B e MB e M A a S aA a S a Dae e Ta a eDae e Ta a e C b a L SeC b a L Se H dH d R be G eR be G e HaHa L d Ka a R c a dL d Ka a R c a d BB H bbe SeH bbe Se K a JK a J H a SeH a Se L d ML d M De dDe d S aS a R bb S aR bb S a L d Ha d dL d Ha d d A beA be N e D aN e D a J a Se MJ a Se M
  • 42. CC E aE a AA JJ C e aC e a JJ S aeS ae PePe MaMa Va H aVa H a E eE e Q aQ a dd T eT e DaDa a e Ta e T D eD e UU O K aO K a LL CC HaHa H a deH a de KK K a eeK a ee M Ma DM Ma D R ae a Ta a eR ae a Ta a e Va ce KaVa ce Ka Y TY T S a aS a a DD GG La e Se Ke aLa e Se Ke a Ra Da SeRa Da Se FF R a a K aR a a K a F e e B a SeF e e B a Se U aU a NaeNae C eC e L d T eL d T e II Ra aRa a Ma e IMa e I Ae Ta a eAe Ta a e D K aD K a V e Ta a eV e Ta a e QQ D aD a Ka SeKa Se D a VaeD a Vae Dae e Ta a eDae e Ta a e L d Le dL d Le d R aeR ae La e Ce eLa e Ce e JJ De L dDe L d TT J aJ a D eaD ea MM Dae e Ta a eDae e Ta a e K a JK a J L d ML d M R bb S aR bb S a J a Se MJ a Se M
  • 44. From NLP output to KGs • Names aren’t just about labels • Context has meaning too • Collapse Dany/Daenerys? • depends on your research question • NLP often stops after recognising names and coreference links D I G I TA L H U M A N I T I E S L A B Image source: http://imagens.tiespecialistas.com.br/2011/10/Figura02.png
  • 45.
  • 46. The Three Musketeers: F1 32 - 48
  • 47. The Three Musketeers after rewriting d’Artagnan to Dartagnan
  • 48. D I G I TA L H U M A N I T I E S L A B Why is fiction hard for NLP? • Fiction writers don’t have to abide by conventions: they can use language more creatively than newspaper journalists • mix languages • make up languages • use nicknames • Narratives written from first-person perspective confuse the software D I G I TA L H U M A N I T I E S L A B Image source: https://steamuserimages-a.akamaihd.net/ugc/859477733475369907/F34770D6EFEC30A70A84BEFE93C2C522C0B4A902/
  • 49. Performance fixes • Replace word names with generic names • Remove apostrophes from names • But: • Requires manual intervention • Doesn’t scale D I G I TA L H U M A N I T I E S L A B
  • 51. D I G I TA L H U M A N I T I E S L A B Where to go from here? • Robuster NLP tools are necessary to better understand novels (and other non-newspaper texts) • Background knowledge can help (e.g. GoT Wiki lists all Danaerys’ nicknames) • But: not all books are that popular • Also: different names are used in different contexts, you may not want to collapse them! • Always: don’t just assume it works, look into your data! • Full paper at: http://peerj.com/articles/cs-189 D I G I TA L H U M A N I T I E S L A B Image source: https://news.images.itv.com/image/file/1232718/stream_img.jpg
  • 52. D I G I TA L H U M A N I T I E S L A B Conclusions • Huge gap between NLP research and use cases • Understanding of each other’s tools and questions • What NLP tools can handle • First: What does the research question really need? • Then: What is the mismatch between my data and what the tools can handle? • Next: Let’s get to work, there’s lots to do!
  • 53. D I G I TA L H U M A N I T I E S L A B COST Action 18209 • Web-centred Linguistic Data Science • Various use cases (also digital humanities!) • Management committee members representing Finland: Jouni Tuominen & Eero Hyvönen and Mietta Lennes & Minna Tamper • Website still under construction, for now: https://www.cost.eu/actions/CA18209/ D I G I TA L H U M A N I T I E S L A B
  • 54. Work in progress Historical Image Analysis (@MelvinWevers) Global Apple Pie (with Ulbe Bosma & Rebeca Ibáñez-Martîn) 18th century career mobility (DHLab + HI + DI) What makes or breaks an idea? (@AdinaNerghes) Amsterdam Time Machine (@merpeltje)
  • 55. D I G I TA L H U M A N I T I E S L A B Teaser: CULTURAIL “Cultural AI is the study, design and development of socio-technological AI systems that are implicitly or explicitly aware of the subtle and subjective richness of human culture. It is as much about using AI for analyzing human culture as it is about using knowledge and expertise from the humanities to analyze and improve AI technology. It studies how to deal with cultural bias in data and technology and how to build AI that is optimized for cultural and ethical values.” Van Erp, Van den Bosch & Van Ossenbruggen, 2019 Image source: https://accuform-img2.akamaized.net/files/damObject/Image/huge/FRW304.jpg