Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Square pegs and round holes:
addressing the mismatch between humanities questions
and the state-of-the-art in language tec...
D I G I TA L H U M A N I T I E S L A B
Three use cases:
• Messy data: EviDENce project

• OCR troubles: Historical Recipe ...
EviDENce - Ego Documents Events ModelliNg
How individuals recall war and violence
Hucopix - Elodie Burillon
Ego Documents Events modelliNg - how individuals recall war and violence
Sources: - Oral history interview transcripts WW2...
NLP Pipeline
manually
annotated
fragment
Annotated
by NLP
Pipeline
Manual NLP Pipeline
bombardement 28 zijn 394
brand 5 hebben 84
Arbeitseinsatz 4 zeggen 78
onderduiken 4 gaan 43
razzia 4 z...
Manual NLP Pipeline
bombardement 28 SRL: Subject or object : “het Engels bombardement” 1
SRL: Subject or object : “het bom...
Manual NLP Pipeline
Ik 215 Ik 61
we 68 we 10
vader 30 “mijn vader” 1
moeder 11 “mijn moeder” 2
broer 9 “broer en zus” 1
“e...
D I G I TA L H U M A N I T I E S L A B
Taking a step back: what does
the research question really need?
• EviDENce histori...
D I G I TA L H U M A N I T I E S L A B
Back to the drawing board!
• Current pipeline is error prone 

• Humanities scholar...
D I G I TA L H U M A N I T I E S L A B
Take home message
• Choose the right tool! 

• It takes time to understand each oth...
Constructing a Recipe Web from
Historical Newspapers
Marieke van Erp @merpeltje

Melvin Wevers @melvinwevers

Hugo Huurdem...
Butter, salt & pepper
• Analysis of food customs: 

• historians 

• dieticians 

• ethnologists 

• 1945 - 1995 Parool, V...
Newspapers as a source for
recipes
• perception of a Dutch food culture formed
in the 1950s 

• newspapers are producer an...
Newspaper dataset
• Dutch National Library has digitised 90+
million book, newspaper and magazine pages 

• Newspapers pub...
Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
D I G I TA L H U M A N I T I E S L A B
dinsdag
6 os...
OCR Quality
D I G I TA L H U M A N I T I E S L A B
From newspapers to a recipe web
D I G I TA L H U M A N I T I E S L A B
Ingredients
Recipe tags
Recipe
descriptions
Recipe ...
What & how much?
• articles cannot automatically be segmented 

• OCR errors and non-grammatical
sentences are a hurdle fo...
Evaluation
• 100 articles were manually annotated using
Recogito

• OCR errors in ingredients or quantities marked
separat...
Results ingredients extraction
27,411 new (old) recipes
• 34,479 Tags

• 365,133 ingredients

• >17,000 Links to external sources

• Data and software av...
Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207
D I G I TA L H U M A N I T I E S L A B
Take home me...
Acknowledgements:
Image source: https://twelvemilesfromalemondotcom.files.wordpress.com/2014/09/img_0326.jpg
Why Language Technology Can’t
Handle Game of Thrones (yet)
Niels Dekker, Tobias Kuhn & Marieke van Erp
Image source: https...
Background
• Characters and relations are backbone of
stories 

• Computational methods allow for scaling
up network extra...
Experimental setup
• Collect 20 ‘old’ and 20 ‘new’ novels 

• Annotate first chapters for entities and
relationships betwee...
19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels +
availability through Project Gutenber...
‘New’ Science Fiction and Fantasy novels, based on list from BestFantasyBooks.com
D I G I TA L H U M A N I T I E S L A B
Data preprocessing
• All books converted to plain text format 

• Ensure all texts ...
Gold standard annotations
• Chapter lengths varied from 84 to 1,442
sentences 

• An average of 300 sentences close to a
c...
D I G I TA L H U M A N I T I E S L A B
Annotation Instructions
• For each sentence:

• Identify all characters in it 

• I...
Named Entity Recognisers:
BookNLP
• NLP pipeline modified to deal with books 

• POS tagging, dependency parsing, NER,
char...
Intermediate conclusion
• No difference between ‘old’ and ‘new’
books 

• Within categories, great variety in entity
distri...
J eJ e
Ha SeHa Se
B ac eB ac e
L d R bbL d R bb
CC
P e Se MaP e Se Ma
H eH e
T e P ceT e P ce
T a Me SeT a Me Se
H e Se Ge...
CC
E aE a
AA
JJ
C e aC e a
JJ
S aeS ae
PePe
MaMa
Va H aVa H a
E eE e
Q aQ a
dd
T eT e
DaDa
a e Ta e T
D eD e
UU
O K aO K a...
Image source: https://i.pinimg.com/originals/30/25/20/302520dbb49bb4a01b5687a7e6c6bf60.jpg
From NLP output to KGs
• Names aren’t just about labels 

• Context has meaning too 

• Collapse Dany/Daenerys? 

• depend...
The Three Musketeers: F1 32 - 48
The Three Musketeers after rewriting d’Artagnan to Dartagnan
D I G I TA L H U M A N I T I E S L A B
Why is fiction hard for NLP?
• Fiction writers don’t have to abide by
conventions: t...
Performance fixes
• Replace word names with generic names

• Remove apostrophes from names 

• But:

• Requires manual inte...
Image source: https://static.boredpanda.com/blog/wp-content/uploads/2015/10/funny-game-of-thrones-memes-fb__700.jpg
D I G I TA L H U M A N I T I E S L A B
Where to go from here?
• Robuster NLP tools are necessary to better
understand nove...
D I G I TA L H U M A N I T I E S L A B
Conclusions
• Huge gap between NLP research and use
cases 

• Understanding of each...
D I G I TA L H U M A N I T I E S L A B
COST Action 18209
• Web-centred Linguistic Data Science

• Various use cases (also ...
Work in progress
Historical Image Analysis (@MelvinWevers) Global Apple Pie
(with Ulbe Bosma & Rebeca Ibáñez-Martîn)
18th ...
D I G I TA L H U M A N I T I E S L A B
Teaser: CULTURAIL
“Cultural AI is the study, design and development
of socio-techno...
dhlab.nl
Square pegs and round holes:  addressing the mismatch between humanities questions and the state-of-the-art in language te...
Square pegs and round holes:  addressing the mismatch between humanities questions and the state-of-the-art in language te...
Square pegs and round holes:  addressing the mismatch between humanities questions and the state-of-the-art in language te...
Square pegs and round holes:  addressing the mismatch between humanities questions and the state-of-the-art in language te...
Square pegs and round holes:  addressing the mismatch between humanities questions and the state-of-the-art in language te...
Square pegs and round holes:  addressing the mismatch between humanities questions and the state-of-the-art in language te...
Prochain SlideShare
Chargement dans…5
×

Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology

Keynote at HELDIG Summit
7 November 2019, University of Helsinki, Finland https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/heldig-digital-humanities-summit-2019

Abstract:
The use of computational methods in humanities research is gaining popularity and leading to new insights. But as we move from distant reading methods to deeper language understanding, we find that many state-of-the-art language technology tools don't behave quite as advertised in publications. The corpora humanities scholars investigate display a wide range of language phenomena, plus humanities scholars do not necessarily have the same goals when they apply these language technology tools as the computational linguists who developed them. The variety in time span, genre, digitisation quality and corpus heterogeneity show the gap between the two research domains.

In this talk, I will discuss several projects in which we needed to address the mismatch between language technology tools and the humanities research objectives, and how we can go forward in fitting our computational methods to the diversity of humanities research questions.

  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology

  1. 1. Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology Marieke.van.Erp@dh.huc.knaw.nl merpeltje D I G I TA L H U M A N I T I E S L A B
  2. 2. D I G I TA L H U M A N I T I E S L A B Three use cases: • Messy data: EviDENce project • OCR troubles: Historical Recipe Web • Genre mismatch: Why Language Technology Can’t Handle Game of Thrones (yet)
  3. 3. EviDENce - Ego Documents Events ModelliNg How individuals recall war and violence Hucopix - Elodie Burillon
  4. 4. Ego Documents Events modelliNg - how individuals recall war and violence Sources: - Oral history interview transcripts WW2 (450) Aims: - Better understand nature of and change in eyewitness reports - Further develop event detection as means for extracting relevant information from large and complex textual datasets
  5. 5. NLP Pipeline
  6. 6. manually annotated fragment Annotated by NLP Pipeline
  7. 7. Manual NLP Pipeline bombardement 28 zijn 394 brand 5 hebben 84 Arbeitseinsatz 4 zeggen 78 onderduiken 4 gaan 43 razzia 4 zitten 42 Amerikaans bombardement 2 weten 39 gevochten 2 doen 33 mobilisatietijd 2 komen 28 toen ging het allemaal branden 2 horen 27 verraden 2 wonen 26 Events - most frequent terms
  8. 8. Manual NLP Pipeline bombardement 28 SRL: Subject or object : “het Engels bombardement” 1 SRL: Subject or object : “het bombardement” 1 bombarderen 5 brand 5 branden 5 afbranden 3 SRL: Subject or object: “die brand” 1 Arbeitseinsatz 4 Location: “Arbeitseinsatz” 1 onderduiken 4 onderduiken 4 razzia 4 Time: “als er razzia komen” 1 gevochten 2 vechten 3 Events: Matching of manual and automatic annotation
  9. 9. Manual NLP Pipeline Ik 215 Ik 61 we 68 we 10 vader 30 “mijn vader” 1 moeder 11 “mijn moeder” 2 broer 9 “broer en zus” 1 “een broer” 1 vrienden 8 not found 1 Amerikanen 7 Location: “Amerikanen” 6 ouders 6 not found 4 die Duitsers 5 Location: “Duitsers” 9 Actors: Matching of manual and automatic annotation
  10. 10. D I G I TA L H U M A N I T I E S L A B Taking a step back: what does the research question really need? • EviDENce historians are interested in relevant passages • NLP pipeline analyses texts down to word level • Should we be using an NLP pipeline at all? Image source: https://cdn.xingosoftware.com/dedikkeblauwe/images/fetch/dpr_2/ https%3A%2F%2Fwww.dedikkeblauwe.nl%2Fassets%2Fupload%2Fimages%2F49%2F20190131165659_Kanon-op-mug.png
  11. 11. D I G I TA L H U M A N I T I E S L A B Back to the drawing board! • Current pipeline is error prone • Humanities scholars are not trained to think in NLP modules and linguistic layers • Can we gather text passages describing violence without deep text analysis? • Three approaches: • keyword expansion • doc2vec • ElasticSearch
  12. 12. D I G I TA L H U M A N I T I E S L A B Take home message • Choose the right tool! • It takes time to understand each other • Next week we’ll know what other historians think of our approach :)
  13. 13. Constructing a Recipe Web from Historical Newspapers Marieke van Erp @merpeltje Melvin Wevers @melvinwevers Hugo Huurdeman @timelessfuture Image source: https://static.ah.nl/static/recepten/img_006188_890x594_JPG.jpg
  14. 14. Butter, salt & pepper • Analysis of food customs: • historians • dieticians • ethnologists • 1945 - 1995 Parool, Volkskrant, NRC & Trouw • Dataset and code available through: https:// github.com/DHLab-nl/historical-recipe-web • Winner National Library - Rijksmuseum - Network Digital Heritage HackaLOD Hackathon • You & other researchers are invited to work with us on case studies around food culture D I G I TA L H U M A N I T I E S L A B Image source: https://assets3.thrillist.com/v1/image/1623749/size/tl-horizontal_main_2x.jpg
  15. 15. Newspapers as a source for recipes • perception of a Dutch food culture formed in the 1950s • newspapers are producer and messengers of public discourse • newspapers contain views on daily life and customs • But: • keyword search for ‘recepten’ imprecise • noise from digitisation process Image source: delpher.nl D I G I TA L H U M A N I T I E S L A B
  16. 16. Newspaper dataset • Dutch National Library has digitised 90+ million book, newspaper and magazine pages • Newspapers published between 1618 - 1995 from the Netherlands, the Dutch Indies (present day Indonesia), the Antilles, the US and Surinam (15% of all newspapers published in the Netherlands) • Available via website, data dump (until 1876) and API (with agreement) D I G I TA L H U M A N I T I E S L A B Pages Articles Tokens Parool 14,194 2,380,697 612,036,106 Volkskrant 13,628 2,248,652 744,275,792 NRC 7,199 947,198 489,397,816 Trouw 13,891 2,578,731 656,941,631 Total: 48,912 8,155,278 2,502,651,345 article: https://www.delpher.nl/nl/kranten/view?coll=ddd&identifier=ddd:010627319:mpeg21:a0067
  17. 17. Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207 D I G I TA L H U M A N I T I E S L A B dinsdag 6 ossestaartsoep HUt *orstjes l 0( * bonen met ananas t t e bonen met ananas Va0,1 2 blikken witte bonen In 1 uitje, 1 eetlepel ?lWd- 2 eetlepels keuken- 12 knakwostjes, 1 klein „ ftaLananasDlokJes- SoJrJi het uitJe en meng dit ,Qoe h bonen met tomatensaus. Nir,;e groente in een ingevette ?fd h ste schaal. Roer de mos- Je hni?or de stroop en giet hier
  18. 18. OCR Quality D I G I TA L H U M A N I T I E S L A B
  19. 19. From newspapers to a recipe web D I G I TA L H U M A N I T I E S L A B Ingredients Recipe tags Recipe descriptions Recipe articles Information Extraction and Multilabel Classification Enrichment Ingredient and quantity extraction Recipe tags Structured newspaper recipes Origin DBpedia link Scientific name Recipe text detection Structured and enriched newspaper recipes Seed list Text classification
  20. 20. What & how much? • articles cannot automatically be segmented • OCR errors and non-grammatical sentences are a hurdle for standard NLP pipelines • lexicon-based extraction of ingredients and quantities Image source: https://cdn.pixabay.com/photo/2014/11/15/20/30/kitchen-scale-532651_960_720.jpg D I G I TA L H U M A N I T I E S L A B
  21. 21. Evaluation • 100 articles were manually annotated using Recogito • OCR errors in ingredients or quantities marked separately • IAA .85 but OCR boundaries difficult: jºar,anen’ vs ◦ºar,anen’ • Most precise lexicon: f1 = .67 • More research is needed for out-of-lexicon ingredients D I G I TA L H U M A N I T I E S L A B
  22. 22. Results ingredients extraction
  23. 23. 27,411 new (old) recipes • 34,479 Tags • 365,133 ingredients • >17,000 Links to external sources • Data and software available at: https:// github.com/DHLab-nl/historical-recipe-web Bron: https://static.ah.nl/static/recepten/img_074629_890x594_JPG.jpgD I G I TA L H U M A N I T I E S L A B
  24. 24. Bron: https://resolver.kb.nl/resolve?urn=ABCDDD:010848341:mpeg21:a0207 D I G I TA L H U M A N I T I E S L A B Take home message • OCR errors can impact information extraction • OCR post-correction is an active research field, but errors will remain • Focus on most important elements to extract source: https://resolver.kb.nl/resolve?urn=ABCDDD:010877049:mpeg21:a0158
  25. 25. Acknowledgements: Image source: https://twelvemilesfromalemondotcom.files.wordpress.com/2014/09/img_0326.jpg
  26. 26. Why Language Technology Can’t Handle Game of Thrones (yet) Niels Dekker, Tobias Kuhn & Marieke van Erp Image source: https://anibundel.files.wordpress.com/2015/04/jonsnow-leaves-ygritte.jpg
  27. 27. Background • Characters and relations are backbone of stories • Computational methods allow for scaling up network extraction and analysis • Relies on named entity recognition • Most work thusfar focuses on 19th and early 20th century novels • Research question: how do these tools perform on modern science fiction/fantasy novels? D I G I TA L H U M A N I T I E S L A B
  28. 28. Experimental setup • Collect 20 ‘old’ and 20 ‘new’ novels • Annotate first chapters for entities and relationships between entities (gold standard) • Evaluate entity recognition tools on the sets of ‘old’ and ‘new’ novels • Compare system outputs to gold standard annotations • Bonus: compare network structures Image source: delpher.nl D I G I TA L H U M A N I T I E S L A B Image source: https://cdn-images-1.medium.com/max/2400/1*QbCo9uE7jPbt1ttnMsqOog.jpeg
  29. 29. 19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels + availability through Project Gutenberg + used in earlier studies
  30. 30. ‘New’ Science Fiction and Fantasy novels, based on list from BestFantasyBooks.com
  31. 31. D I G I TA L H U M A N I T I E S L A B Data preprocessing • All books converted to plain text format • Ensure all texts have the same character encoding • Pro tip: check whether there are no odd or inconsistent quotation marks in your documents • Appendices, glossaries and reviews were removed manually D I G I TA L H U M A N I T I E S L A B Image source: https://www.dataentryoutsourced.com/blog/wp-content/uploads/2015/03/ Post-091-640x200.jpg
  32. 32. Gold standard annotations • Chapter lengths varied from 84 to 1,442 sentences • An average of 300 sentences close to a chapter boundary was selected • e.g. the third chapter in Alice in Wonderland ended after sentence 315, so for that book the first three chapters were annotated • 2 annotators (not the authors of the study) D I G I TA L H U M A N I T I E S L A B Image source: https://panmacmillan.azureedge.net/pmk11/panmacmillan/files/media/ panmacmillan/blogs/tws/august%202017/alice-in-wonderland-knowledge-quiz-header.png
  33. 33. D I G I TA L H U M A N I T I E S L A B Annotation Instructions • For each sentence: • Identify all characters in it • Identify anaphoric references (e.g. she refers to Alice) • To speed up the process, annotators were provided with a list of characters derived automatically • Missing characters could be added to the list • Ignore generic pronouns, exclamations, generic noun phrases, non-human named characters (Buckbeak) D I G I TA L H U M A N I T I E S L A B Image source: https://vignette.wikia.nocookie.net/p__/images/3/35/Erich_Mueller_and_Shannon_McGrath_are_glued_together_back_to_back_with_Tree_Resin.jpeg/revision/ latest?cb=20170331180847&path-prefix=protagonist
  34. 34. Named Entity Recognisers: BookNLP • NLP pipeline modified to deal with books • POS tagging, dependency parsing, NER, character name clustering, quotation speaker identification, pronominal coreference resolution, supersense tagging • NER module based on Stanford NER, with some modifications • We focus on NER, character name clustering and pronominal character resolution modules in our evaluation • https://github.com/dbamman/book-nlp D I G I TA L H U M A N I T I E S L A B Image source: https://cdn.aarp.net/content/dam/aarp/money/budgeting_savings/2016/04/1140- yeager-sell-your-used-books.imgcache.rev6feda141288df73e8fd100822bb375ea.jpg
  35. 35. Intermediate conclusion • No difference between ‘old’ and ‘new’ books • Within categories, great variety in entity distributions and results • If a central entity is missed, the performance suffers greatly (e.g. Brave New World) • Coreference resolution particularly difficult in this domain D I G I TA L H U M A N I T I E S L A B Image source: https://www.nuffoodsspectrum.in/uploads/articles/quarterly_results_bg-4192.jpg
  36. 36. J eJ e Ha SeHa Se B ac eB ac e L d R bbL d R bb CC P e Se MaP e Se Ma H eH e T e P ceT e P ce T a Me SeT a Me Se H e Se Ge dH e Se Ge d L d Va ceL d Va ceDa eDa e A a H e aceA a H e ace L d H dL d H d R be Ba a eR be Ba a eC e P eC e P e Ca L d B ceCa L d B ce E aE a S a Sa aS a Sa a M Ma eM Ma e AA R d Ca e SeR d Ca e Se TT L a aL a a Se D eSe D e N e aN e a S e eS e e Ta SaTa Sa JJ A a AA a A J cJ c Y eY e F e LadF e Lad Ra de Ma ceRa de Ma ce PP Ma de Se WMa de Se W C e aC e a JJ C eC e D SeD Se B e Y R ceB e Y R ce C eC e V e aV e a Ca e JCa e J G eG e L d SL d S Ha M eHa M e Ned S aNed S a S a B a dS a B a d M eM e G e BaG e Ba M ecM ec T a dT a d Da eDa e M a S eM a S e Hea ba eHea ba e Ja e Se R eJa e Se R e E e Se Va dE e Se Va d G dG d Ca e B acCa e B ac L d D da Be cL d D da Be c B de B acB de B ac Mae e LMae e L Mae e AeMae e Ae C a eC a e M dM d MaMa C e a e Sa dC e a e Sa d S aeS ae Ha e aHa e a L d Ne R ceL d Ne R ce PePe T adT ad P eP e L d d TL d d T MaMa Va H aVa H a R cR c E eE e L d AL d A Q aQ a L d P eL d P e L a Lad AL a Lad A B aaB aa Ma aMa a B ac e J L dB ac e J L d L d S e a dL d S e a d Ma de Se We deMa de Se We de T e aT e a T eT e Sa a a Se ASa a a Se A Ba a Se SeBa a Se Se Pa e Se IPa e Se I B MB M Pe SePe Se L d Ma e JaL d Ma e Ja Sa e TaSa e Ta P e VaP e Va J eJ e BeBe Ga edGa ed M eM e W e O e SeW e O e Se F e SF e S DaDa K e eK e e G eaG ea La e TLa e T Se M e Ma dSe M e Ma d L d W aL d W a Ha dHa d D eD e L d JL d J S a Ba a e L dS a Ba a e L d Je eJe e UU Fa TFa T Ja e Se La eJa e Se La e O K aO K a M a CaM a Ca Ca e MaCa e Ma A e Se T eA e Se T e Fa eFa e L d R beL d R be LL L d R aL d R a Je e P eJe e P e TT CC MaeMae HaHa Va eVa e Ed e Se TEd e Se T H a deH a de Ga eGa e H HH H C aC a Hedd e Ma aHedd e Ma a Mae e MMae e M Lad Ca e S aLad Ca e S a CaCa Be S aBe S a MaMa Lad MLad M KK R be AR be A Ge dGe d X Ja ab aX Ja ab a K a eeK a ee L d Ba a e ReL d Ba a e Re AA L d Bae PeL d Bae Pe Lad Sa aLad Sa a M Ma DM Ma D L d F e Wa deL d F e Wa de Fa eFa e Se Adda Ma b a dSe Adda Ma b a d H SeH Se O d NaO d Na L aL a JacJac R ae a Ta a eR ae a Ta a e J e P ceJ e P ce B Se BB Se B Va ce KaVa ce Ka JJ A Da e SeA Da e Se M da e Se aM da e Se a Se Ta a He aSe Ta a He a L d T B ac dL d T B ac d T L d La eT L d La e Y TY T Je BeJe Be Ha deHa de S a aS a a A JA J DD BaeBae GG T e Se L aT e Se L a La e Se Ke aLa e Se Ke a S e F e SeS e F e Se Ta da LadTa da Lad Ra Da SeRa Da Se S a dS a d L d T H eL d T H e A SeA Se F e Ja eF e Ja e W Se W deW Se W de DaDa He a dHe a d W e DaW e Da FF Ma eMa e WW R a a K aR a a K a M caM ca JaJa F e e B a SeF e e B a Se U aU a R ba SeR ba Se NaeNae C eC e T b MT b M Be e S aBe e S a MM L e eL e e L d T eL d T e B de Se TB de Se T HaHa M ce aM ce a SS O e Ya cO e Ya c G e T eG e T e II Mae e P ce eMae e P ce e G e W dG e W d Q Ha a dQ Ha a d Jae aeJae ae L d CeL d Ce C daC da Ra aRa a D eD e Ma e IMa e I T eT e Ae Ta a eAe Ta a e B e MaB e Ma Da H dDa H d R eR e C e a e G e SeC e a e G e Se S JS J RaRa Ae Ta a eAe Ta a e D K aD K a V e Ta a eV e Ta a e QQ W e LadW e Lad H bb T ee-F eH bb T ee-F e D aD a R ce Se A daR ce Se A da Ka SeKa Se Ha eHa e La ceLa ce H eeH ee Mace T eMace T e L d H eL d H e Ha M eHa M e D a VaeD a Vae Dae e Ta a eDae e Ta a e L d Le dL d Le d V aV a G e Ga baG e Ga ba R aeR ae B R eB R e Ca e TCa e T La e Ce eLa e Ce e JJ Wa a Se R ceWa a Se R ce L B eL B e L d Ta Ra dL d Ta Ra d De L dDe L d Ja ed F e SeJa ed F e Se TT Se S a BaSe S a Ba L d VaL d Va B aB a Ha Ka aHa Ka a J aJ a D eaD ea Ha deHa de bb Ja SJa S B e MB e M A a S aA a S a Dae e Ta a eDae e Ta a e C b a L SeC b a L Se H dH d R be G eR be G e HaHa L d Ka a R c a dL d Ka a R c a d BB H bbe SeH bbe Se K a JK a J H a SeH a Se L d ML d M De dDe d S aS a R bb S aR bb S a L d Ha d dL d Ha d d A beA be N e D aN e D a J a Se MJ a Se M
  37. 37. CC E aE a AA JJ C e aC e a JJ S aeS ae PePe MaMa Va H aVa H a E eE e Q aQ a dd T eT e DaDa a e Ta e T D eD e UU O K aO K a LL CC HaHa H a deH a de KK K a eeK a ee M Ma DM Ma D R ae a Ta a eR ae a Ta a e Va ce KaVa ce Ka Y TY T S a aS a a DD GG La e Se Ke aLa e Se Ke a Ra Da SeRa Da Se FF R a a K aR a a K a F e e B a SeF e e B a Se U aU a NaeNae C eC e L d T eL d T e II Ra aRa a Ma e IMa e I Ae Ta a eAe Ta a e D K aD K a V e Ta a eV e Ta a e QQ D aD a Ka SeKa Se D a VaeD a Vae Dae e Ta a eDae e Ta a e L d Le dL d Le d R aeR ae La e Ce eLa e Ce e JJ De L dDe L d TT J aJ a D eaD ea MM Dae e Ta a eDae e Ta a e K a JK a J L d ML d M R bb S aR bb S a J a Se MJ a Se M
  38. 38. Image source: https://i.pinimg.com/originals/30/25/20/302520dbb49bb4a01b5687a7e6c6bf60.jpg
  39. 39. From NLP output to KGs • Names aren’t just about labels • Context has meaning too • Collapse Dany/Daenerys? • depends on your research question • NLP often stops after recognising names and coreference links D I G I TA L H U M A N I T I E S L A B Image source: http://imagens.tiespecialistas.com.br/2011/10/Figura02.png
  40. 40. The Three Musketeers: F1 32 - 48
  41. 41. The Three Musketeers after rewriting d’Artagnan to Dartagnan
  42. 42. D I G I TA L H U M A N I T I E S L A B Why is fiction hard for NLP? • Fiction writers don’t have to abide by conventions: they can use language more creatively than newspaper journalists • mix languages • make up languages • use nicknames • Narratives written from first-person perspective confuse the software D I G I TA L H U M A N I T I E S L A B Image source: https://steamuserimages-a.akamaihd.net/ugc/859477733475369907/F34770D6EFEC30A70A84BEFE93C2C522C0B4A902/
  43. 43. Performance fixes • Replace word names with generic names • Remove apostrophes from names • But: • Requires manual intervention • Doesn’t scale D I G I TA L H U M A N I T I E S L A B
  44. 44. Image source: https://static.boredpanda.com/blog/wp-content/uploads/2015/10/funny-game-of-thrones-memes-fb__700.jpg
  45. 45. D I G I TA L H U M A N I T I E S L A B Where to go from here? • Robuster NLP tools are necessary to better understand novels (and other non-newspaper texts) • Background knowledge can help (e.g. GoT Wiki lists all Danaerys’ nicknames) • But: not all books are that popular • Also: different names are used in different contexts, you may not want to collapse them! • Always: don’t just assume it works, look into your data! • Full paper at: http://peerj.com/articles/cs-189 D I G I TA L H U M A N I T I E S L A B Image source: https://news.images.itv.com/image/file/1232718/stream_img.jpg
  46. 46. D I G I TA L H U M A N I T I E S L A B Conclusions • Huge gap between NLP research and use cases • Understanding of each other’s tools and questions • What NLP tools can handle • First: What does the research question really need? • Then: What is the mismatch between my data and what the tools can handle? • Next: Let’s get to work, there’s lots to do!
  47. 47. D I G I TA L H U M A N I T I E S L A B COST Action 18209 • Web-centred Linguistic Data Science • Various use cases (also digital humanities!) • Management committee members representing Finland: Jouni Tuominen & Eero Hyvönen and Mietta Lennes & Minna Tamper • Website still under construction, for now: https://www.cost.eu/actions/CA18209/ D I G I TA L H U M A N I T I E S L A B
  48. 48. Work in progress Historical Image Analysis (@MelvinWevers) Global Apple Pie (with Ulbe Bosma & Rebeca Ibáñez-Martîn) 18th century career mobility (DHLab + HI + DI) What makes or breaks an idea? (@AdinaNerghes) Amsterdam Time Machine (@merpeltje)
  49. 49. D I G I TA L H U M A N I T I E S L A B Teaser: CULTURAIL “Cultural AI is the study, design and development of socio-technological AI systems that are implicitly or explicitly aware of the subtle and subjective richness of human culture. It is as much about using AI for analyzing human culture as it is about using knowledge and expertise from the humanities to analyze and improve AI technology. It studies how to deal with cultural bias in data and technology and how to build AI that is optimized for cultural and ethical values.” Van Erp, Van den Bosch & Van Ossenbruggen, 2019 Image source: https://accuform-img2.akamaized.net/files/damObject/Image/huge/FRW304.jpg
  50. 50. dhlab.nl

×