SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Why Language Technology Can’t
Handle Game of Thrones (yet)
Marieke van Erp merpeltje

Joint work with: 

Niels Dekker & Tobias Kuhn
Image source: https://anibundel.files.wordpress.com/2015/04/jonsnow-leaves-ygritte.jpg
This talk
• NLP 101 

• Recognising named entities
in fiction

• Digital Humanities @KNAW
HuC
D I G I TA L H U M A N I T I E S L A B
Image source: https://vignette.wikia.nocookie.net/pirates/images/3/3c/
MediterraneanProfile.jpg/revision/latest?cb=20120312215230
Image source: http://www.jvwmoergestel.nl/site/wp-content/uploads/2016/12/KerstWoordzoeker.jpg
NLP 101
Image source: https://i.ytimg.com/vi/iuumnjJWFO4/maxresdefault.jpg
NLP 101: What is Text Mining?
• Extracting knowledge and information from texts in natural language:
• metadata for a text: author, publisher, time of publication, topic, its language, URL, URLs to and
from a web text
• people mentioned in text, but also companies, organisations, places, dates → links to Wikipedia,
Wikification of text
• Amounts: prices, age, size, distance, weight
• Facts (statements), concepts (terms) and relations between concepts
• Sentiment (positive/negative), opinions
• Emotions, purpose, intention, humour, sarcasm, irony, threats, style (formal, informal), genre (blog,
news, science, tax form)
Types of Knowledge to Extract
• Conceptual relations: define possible relations between concepts in an ontology, e.g.
what things have weight, size, age, get born, eat, drink, get an education, work, marry,
do sports, live and die.
• Factual relations: actual instantiations of concepts and relations that are the case in
some world (time and place), Barack Obama was born on Augus 4, 1961, in Honolulu,
Hawaii.
• Factual relations need to fit the ontological model but the ontology does not predict
actual facts only the possible facts!!!
• Opinions: epistemic and modal relations (believe, wish, hope, fear, expect) between
source and target expressed as a private state of the source, e.g. I am a fan of Barack
Obama, I believe Barack Obama will help people.
Text Mining pipeline
• Analysis starts at token-level
• Moves up to phrases, sentences and
documents
• Performance goes down as analyses
becomes deeper
• Statistical methods mostly used, but hybrid
methods are a promising research topic
Tokenisation
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Pragmatic Analysis
Input text
Speaker's intended meaning
Companies want text mining
• From click logs they can see what people looked at on their site
• To know what they think about it they need to mine reviews, tweets etc: text
mining
• To stay ahead of their competitors, they need to obfuscate their patents, and
find relevant patents from competitors: text mining
• To aid their information departments, they need access to relevant
information: text mining
Humanities researchers want text mining
• To evaluate gender bias in large corpora http://
literaryquality.huygens.knaw.nl/
• To trace concepts through time: https://www.esciencecenter.nl/project/
evidence
• Detecting and modelling populist movements on social media https://
www.meertens.knaw.nl/cms/en/research/projects/259-het-dagelijks-leven/
145541-populisme-social-media-en-religie
• Analysis of church registers, letters, ship journals etc…
State-of-the-art
• POS tagging: 97%
• Sentiment Analysis: 95% (document level) / 54% (fine-grained sentence level)
• Named Entity Recognition: 90%
• Temporal information extraction: 77%
Note: this holds for English and on standardised datasets
Image source: https://memegenerator.net/img/instances/56709008/are-we-there-yet.jpg
Image source: https://i.redd.it/dmnouc4hip521.jpg
Recognising named entities in fiction
Image source: https://wp-media.patheos.com/blogs/sites/1186/2019/04/mauricio-santos-503880-unsplash.jpg
Background
• Characters and relations are backbone of
stories 

• Computational methods allow for scaling
up network extraction and analysis 

• Relies on named entity recognition 

• Most work thusfar focuses on 19th and
early 20th century novels 

• Research question: how do these tools
perform on modern science fiction/fantasy
novels?
D I G I TA L H U M A N I T I E S L A B Image source: https://newleftreview.org/system/dragonfly/production/2019/03/09/9rcllsj7us_3020501.gif
Experimental setup
• Collect 20 ‘old’ and 20 ‘new’ novels 

• Annotate first chapters for entities and
relationships between entities (gold
standard)

• Run 4 named entity recognisers on the sets
of ‘old’ and ‘new’ novels 

• Compare system outputs to gold standard
annotations 

• Bonus: compare network structures
Image source: delpher.nl
D I G I TA L H U M A N I T I E S L A B
Image source: https://cdn-images-1.medium.com/max/2400/1*QbCo9uE7jPbt1ttnMsqOog.jpeg
19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels +
availability through Project Gutenberg + used in earlier studies
‘New’ Science Fiction and Fantasy novels, based on list from BestFantasyBooks.com
Data preprocessing
• All books converted to plain text format 

• Ensure all texts have the same character
encoding 

• Pro tip: check whether there are no
odd or inconsistent quotation marks in
your documents

• Appendices, glossaries and reviews were
removed manually
D I G I TA L H U M A N I T I E S L A B
Image source: https://www.dataentryoutsourced.com/blog/wp-content/uploads/2015/03/
Post-091-640x200.jpg
Gold standard annotations
• Chapter lengths varied from 84 to 1,442
sentences 

• An average of 300 sentences close to a
chapter boundary was selected 

• e.g. the third chapter in Alice in
Wonderland ended after sentence
315, so for that book the first three
chapters were annotated

• 2 annotators (not the authors of the study)
D I G I TA L H U M A N I T I E S L A B
Image source: https://panmacmillan.azureedge.net/pmk11/panmacmillan/files/media/
panmacmillan/blogs/tws/august%202017/alice-in-wonderland-knowledge-quiz-header.png
Annotation Instructions
• For each sentence:

• Identify all characters in it 

• Identify anaphoric references (e.g. she
refers to Alice) 

• To speed up the process, annotators were
provided with a list of characters derived
automatically

• Missing characters could be added to the
list 

• Ignore generic pronouns, exclamations,
generic noun phrases, non-human named
characters (Buckbeak)
D I G I TA L H U M A N I T I E S L A B
Image source: https://vignette.wikia.nocookie.net/p__/images/3/35/Erich_Mueller_and_Shannon_McGrath_are_glued_together_back_to_back_with_Tree_Resin.jpeg/revision/
latest?cb=20170331180847&path-prefix=protagonist
Named Entity Recognisers:
BookNLP
• NLP pipeline modified to deal with books 

• POS tagging, dependency parsing, NER,
character name clustering, quotation
speaker identification, pronominal
coreference resolution, supersense tagging

• NER module based on Stanford NER, with
some modifications 

• We focus on NER, character name
clustering and pronominal character
resolution modules in our evaluation

• https://github.com/dbamman/book-nlp
D I G I TA L H U M A N I T I E S L A B
Image source: https://cdn.aarp.net/content/dam/aarp/money/budgeting_savings/2016/04/1140-
yeager-sell-your-used-books.imgcache.rev6feda141288df73e8fd100822bb375ea.jpg
Named Entity Recognisers:
Stanford NER
• State-of-the-art CRF NER system

• Trained on CoNLL 2003 data (Reuters
newswire articles from 1996-08-20 to
1997-08-19)

• Cited 2,720 times 

• F1 = 86.31 on CoNLL 2003 test set

• https://nlp.stanford.edu/software/CRF-
NER.html
D I G I TA L H U M A N I T I E S L A B
Named Entity Recognisers:
Illinois Tagger
• Perceptron-based classifier 

• Includes contextual information

• 10,146 downloads 

• F1 = 90.57 on CoNLL 2003 test set 

• https://cogcomp.org/page/software_view/
NETagger
Image source: delpher.nl
D I G I TA L H U M A N I T I E S L A B
Named Entity Recognisers:
IXA-Pipe-NERC
• Perceptron model 

• additional background information from
Brown clusters

• F1 = 91.36 on CoNLL 2003 test 

• https://github.com/ixa-ehu/ixa-pipe-nerc
D I G I TA L H U M A N I T I E S L A B
JosethJoseth
Harys SerHarys Ser
BrackensBrackens
Lord RobbLord Robb
CoholloCohollo
Piper Ser MarqPiper Ser Marq
HullenHullen
Tommen PrinceTommen Prince
Trant Meryn SerTrant Meryn Ser
Hightower Ser GeroldHightower Ser Gerold
Lord VanceLord VanceDareonDareon
Arya HorsefaceArya Horseface
Lord HornwoodLord Hornwood
Robert BaratheonRobert BaratheonCotter PykeCotter Pyke
Caron Lord BryceCaron Lord Bryce
EliaElia
Stark SansaStark Sansa
Mott MasterMott Master
AggoAggo
Rodrik Cassel SerRodrik Cassel Ser ThorosThoros
LyannaLyanna
Ser DonnelSer Donnel
NymeriaNymeria
SherrerSherrer
Tarly SamTarly Sam
JhiquiJhiqui
Alyssa ArrynAlyssa Arryn
JyckJyck
YorenYoren
Frey LadyFrey Lady
Rayder ManceRayder Mance
PypPyp
Manderly Ser WylisManderly Ser Wylis
ChellaChella
JhogoJhogo
ChiggenChiggen
Dontos SerDontos Ser
Bronze Yohn RoyceBronze Yohn Royce
ChettChett
VisenyaVisenya
Cassel JoryCassel Jory
GrennGrenn
Lord SlyntLord Slynt
Hal MollenHal Mollen
Ned StarkNed Stark
Stark BrandonStark Brandon
MikkenMikken
Greyjoy BalonGreyjoy Balon
MorrecMorrec
TomardTomard
DanwellDanwell
Mya StoneMya Stone
HeartsbaneHeartsbane
Jaremy Ser RykkerJaremy Ser Rykker
Egen Ser VardisEgen Ser Vardis
GodwynGodwyn
Castle BlackCastle Black
Lord Dondarrion BericLord Dondarrion Beric
Brynden BlackfishBrynden Blackfish
Maester LuwinMaester Luwin
Maester AemonMaester Aemon
CravenCraven
MordMord
MattMatt
Clegane SandorClegane Sandor
ShaeShae
HarrenhalHarrenhal
Lord Nestor RoyceLord Nestor Royce
PentoshiPentoshi
ToadToad
PortherPorther
Lord lord TyrionLord lord Tyrion
MagoMago
Vargo HoatVargo Hoat
RickonRickon
EroehEroeh
Lord ArrynLord Arryn
QuaroQuaro
Lord PiperLord Piper
Lysa Lady ArrynLysa Lady Arryn
BraavosiBraavosi
MattharMatthar
Bracken Jonos LordBracken Jonos Lord
Lord StewardLord Steward
Manderly Ser WendelManderly Ser Wendel
TregarTregar
TimettTimett
Santagar Ser AronSantagar Ser Aron
Barristan Selmy SerBarristan Selmy Ser
Payne Ser IlynPayne Ser Ilyn
Boy MoonBoy Moon
Perwyn SerPerwyn Ser
Lord Mallister JasonLord Mallister Jason
Samwell TarlySamwell Tarly
Poole VayonPoole Vayon
JoffteyJofftey
BethBeth
GaredGared
MoreoMoreo
Whent Oswell SerWhent Oswell Ser
Forel SyrioForel Syrio
DanyDany
KurleketKurleket
GreatjonGreatjon
Lannister TyrionLannister Tyrion
Ser Moore MandonSer Moore Mandon
Lord WymanLord Wyman
HardinHardin
DorneDorne
Lord JonLord Jon
Stannis Baratheon LordStannis Baratheon Lord
JerenJeren
UlfUlf
Fat TomFat Tom
Jaime Ser LannisterJaime Ser Lannister
Ogo KhalOgo Khal
Moat CailinMoat Cailin
Cassel MartynCassel Martyn
Alliser Ser ThorneAlliser Ser Thorne
FarlenFarlen
Lord RobertLord Robert
LysLys
Lord RowanLord Rowan
Jeyne PooleJeyne Poole
TyroshiTyroshi
ConnConn
MaegorMaegor
HaggoHaggo
ValeVale
Edmure Ser TullyEdmure Ser Tully
HighgardenHighgarden
GageGage
Hill HornHill Horn
CorattCoratt
Heddle MashaHeddle Masha
Maege MormontMaege Mormont
Lady Catelyn StarkLady Catelyn Stark
CaynCayn
Ben StarkBen Stark
MarillionMarillion
Lady MormontLady Mormont
KingKing
Robert ArrynRobert Arryn
GendryGendry
Xho JalabharXho Jalabhar
KhaleesiKhaleesi
Lord Baratheon RenlyLord Baratheon Renly
AlynAlyn
Lord Baelish PetyrLord Baelish Petyr
Lady SansaLady Sansa
Mirri Maz DuurMirri Maz Duur
Lord Frey WalderLord Frey Walder
FatherFather
Ser Addam MarbrandSer Addam Marbrand
Hugh SerHugh Ser
Old NanOld Nan
LharysLharys
JacksJacks
Rhaegar TargaryenRhaegar Targaryen
Joffrey PrinceJoffrey Prince
Boros Ser BlountBoros Ser Blount
Vance KarylVance Karyl
JoffJoff
Arthur Dayne SerArthur Dayne Ser
Mordane SeptaMordane Septa
Ser Tallhart HelmanSer Tallhart Helman
Lord Tytos BlackwoodLord Tytos Blackwood
Tywin Lord LannisterTywin Lord Lannister
Yi TiYi Ti
Jen BenJen Ben
HalderHalder
ShaggaShagga
Arryn JonArryn Jon
DolfDolf
BaelorBaelor
GunthorGunthor
Tyrell Ser LorasTyrell Ser Loras
Lannister Ser KevanLannister Ser Kevan
Stevron Frey SerStevron Frey Ser
Tanda LadyTanda Lady
Raymun Darry SerRaymun Darry Ser
ShaggydogShaggydog
Lord Tully HosterLord Tully Hoster
Arys SerArys Ser
Flowers JaferFlowers Jafer
Willis Ser WodeWillis Ser Wode
DawnDawn
HewardHeward
Willem DarryWillem Darry
FogoFogo
MalleonMalleon
WillWill
Rhaggat KhalRhaggat Khal
MycahMycah
JaggotJaggot
Flement Brax SerFlement Brax Ser
UmarUmar
Robar SerRobar Ser
NaerysNaerys
CheykCheyk
Tobho MottTobho Mott
Benjen StarkBenjen Stark
MohorMohor
LittlefingerLittlefinger
Lord TyrellLord Tyrell
Brynden Ser TullyBrynden Ser Tully
HaliHali
MyrcellaMyrcella
StivStiv
Othell YarwyckOthell Yarwyck
Greyjoy TheonGreyjoy Theon
IrriIrri
Maester PycelleMaester Pycelle
Grey WindGrey Wind
Quorin HalfhandQuorin Halfhand
JaehaerysJaehaerys
Lord CerwynLord Cerwyn
ClydasClydas
RakharoRakharo
DywenDywen
Magister IllyrioMagister Illyrio
TorrhenTorrhen
Aegon TargaryenAegon Targaryen
Bowen MarshBowen Marsh
Daryn HornwoodDaryn Hornwood
RiverrunRiverrun
Clegane Gregor SerClegane Gregor Ser
Snow JonSnow Jon
RastRast
Aerys TargaryenAerys Targaryen
Drogo KhalDrogo Khal
Viserys TargaryenViserys Targaryen
QothoQotho
Whent LadyWhent Lady
Hobb Three-FingerHobb Three-Finger
DothrakiDothraki
Royce Ser AndarRoyce Ser Andar
Karyl SerKaryl Ser
HakeHake
LanceLance
HosteenHosteen
Mace TyrellMace Tyrell
Lord HunterLord Hunter
Hallis MollenHallis Mollen
Dothrak VaesDothrak Vaes
Daeren TargaryenDaeren Targaryen
Lord LeffordLord Lefford
VolantisVolantis
Glover GalbartGlover Galbart
RhaegoRhaego
Bolton RooseBolton Roose
Catelyn TullyCatelyn Tully
Lannister CerseiLannister Cersei
JossJoss
Waymar Ser RoyceWaymar Ser Royce
Lothor BruneLothor Brune
Lord Tarly RandyllLord Tarly Randyll
Derik LordDerik Lord
Jared Frey SerJared Frey Ser
TyroshTyrosh
Ser Swann BalonSer Swann Balon
Lord VarysLord Varys
BranBran
Harrion KarstarkHarrion Karstark
JhaqoJhaqo
DoreahDoreah
HaiderHaider
bushbush
Janos SlyntJanos Slynt
Brothers MoonBrothers Moon
Arya StarkArya Stark
Daenerys TargaryenDaenerys Targaryen
Corbray Lyn SerCorbray Lyn Ser
HodorHodor
Robett GloverRobett Glover
HarwinHarwin
Lord Karstark RickardLord Karstark Rickard
BronnBronn
Hobber SerHobber Ser
Khal JommoKhal Jommo
Horas SerHoras Ser
Lord MormontLord Mormont
DesmondDesmond
StarksStarks
Robb StarkRobb Stark
Lord Hand lordLord Hand lord
AlbettAlbett
Noye DonalNoye Donal
Jorah Ser MormontJorah Ser Mormont
CoholloCohollo
EliaElia
AggoAggo
JhiquiJhiqui
ChellaChella
JhogoJhogo
ShaeShae
PentoshiPentoshi
MagoMago
Vargo HoatVargo Hoat
EroehEroeh
QuaroQuaro
rdrd
TimettTimett
DanyDany
annister Tyrionannister Tyrion
DorneDorne
UlfUlf
Ogo KhalOgo Khal
LysLys
ConnConn
HaggoHaggo
HighgardenHighgarden
KingKing
KhaleesiKhaleesi
Mirri Maz DuurMirri Maz Duur
Rhaegar TargaryenRhaegar Targaryen
Vance KarylVance Karyl
Yi TiYi Ti
ShaggaShagga
DolfDolf
GunthorGunthor
Lannister Ser KevanLannister Ser Kevan
Raymun Darry SerRaymun Darry Ser
FogoFogo
Rhaggat KhalRhaggat Khal
Flement Brax SerFlement Brax Ser
UmarUmar
NaerysNaerys
CheykCheyk
Lord TyrellLord Tyrell
IrriIrri
RakharoRakharo
Magister IllyrioMagister Illyrio
Aegon TargaryenAegon Targaryen
Drogo KhalDrogo Khal
Viserys TargaryenViserys Targaryen
QothoQotho
DothrakiDothraki
Karyl SerKaryl Ser
Dothrak VaesDothrak Vaes
Daeren TargaryenDaeren Targaryen
Lord LeffordLord Lefford
RhaegoRhaego
Lannister CerseiLannister Cersei
JossJoss
Derik LordDerik Lord
TyroshTyrosh
JhaqoJhaqo
DoreahDoreah
MoonMoon
Daenerys TargaryenDaenerys Targaryen
onnonn
Khal JommoKhal Jommo
Lord MormontLord Mormont
Robb StarkRobb Stark
Jorah Ser MormontJorah Ser Mormont
Image source: https://i.pinimg.com/originals/30/25/20/302520dbb49bb4a01b5687a7e6c6bf60.jpg
Discussion
• No difference between ‘old’ and ‘new’
books 

• Within categories, great variety in entity
distributions and results 

• If a central entity is missed, the
performance suffers greatly (e.g.
Brave New World)

• Coreference resolution particularly difficult
in this domain
D I G I TA L H U M A N I T I E S L A B
Image source: https://www.nuffoodsspectrum.in/uploads/articles/quarterly_results_bg-4192.jpg
Why is fiction hard for NLP?
• Fiction writers don’t have to abide by
conventions: they can use language more
creatively than newspaper journalists

• mix languages

• make up languages 

• use nicknames 

• Narratives written from first-person
perspective confuse the software
D I G I TA L H U M A N I T I E S L A B
Image source: https://steamuserimages-a.akamaihd.net/ugc/859477733475369907/F34770D6EFEC30A70A84BEFE93C2C522C0B4A902/
ChalaisChalais
M. BonacieuxM. Bonacieux
de M. Busignyde M. Busigny
Houdiniere LaHoudiniere La
John FeltonJohn Felton
Bois-Tracy de Ma...Bois-Tracy de Ma...
de M. Schombergde M. Schomberg
LubinLubin
Porthos MonsieurPorthos Monsieur
la Harpe de Ruela Harpe de Rue
RochellaisRochellais
Richelieu deRichelieu de
de Busigny Monsi...de Busigny Monsi...
Milady ClarikMilady Clarik
RochefortRochefort
Grimaud MonsieurGrimaud Monsieur M. CoquenardM. Coquenard
de Treville Mons...de Treville Mons...
Mr. FeltonMr. Felton
MontagueMontague
dâArtagnan Mon...dâArtagnan Mon...
Buckingham de Mo...Buckingham de Mo...
de Monsieur Voit...de Monsieur Voit...
Monsieur Bernajo...Monsieur Bernajo...
III HenryIII Henry
Monsieur Dessess...Monsieur Dessess...
de Chevreuse Mad...de Chevreuse Mad...
Donna EstafaniaDonna Estafania
Lord DukeLord Duke
Quixote DonQuixote Don
Lorme de MarionLorme de Marion
de Cahusac Monsi...de Cahusac Monsi...
BazinBazin
Chevalier Monsie...Chevalier Monsie...
MusketeerMusketeer
Constance Bonaci...Constance Bonaci...
M. DessessartM. Dessessart
GermainGermain
de M. Cavoisde M. Cavois
JudithJudith
GasconGascon
MousquetonMousqueton
Monsieur AthosMonsieur Athos
Duke MonsieurDuke Monsieur
Charlotte BacksonCharlotte Backson
BethuneBethune
Planchet MonsieurPlanchet Monsieur
Louis XIIILouis XIII
Bonacieux MadameBonacieux Madame
de Benserade Mon...de Benserade Mon...
GervaisGervais
MeungMeung
Chesnaye LaChesnaye La
Bonacieux Monsie...Bonacieux Monsie...
ChrysostomChrysostom
Wardes de De M.Wardes de De M.
Coquenard Monsie...Coquenard Monsie...
PatrickPatrick
BerryBerry
MandeMande
Laporte M.Laporte M.
de M. Laffemasde M. Laffemas
Laporte MonsieurLaporte Monsieur
Louis XIVLouis XIV
AnneAnne
de M. Tremouille...de M. Tremouille...
NormanNorman
de M. Bassompier...de M. Bassompier...
IV HenryIV Henry
Villiers GeorgeVilliers George
BearnaisBearnais
I CharlesI Charles
PierrePierre
monsieur Aramis ...monsieur Aramis ...
JussacJussac
DenisDenis
GasconsGascons
Coquenard MadameCoquenard Madame
CrevecoeurCrevecoeur
PicardPicard
pope Popepope Pope
de M. Trevillede M. Treville
de Marie Medicisde Marie Medicis
LorraineLorraine
#N/A#N/A
Cardinal MonsieurCardinal Monsieur
FourreauFourreau
BicaratBicarat
Marie Michon MAR...Marie Michon MAR...
Lord de WinterLord de Winter
Milady de De Win...Milady de De Win...
M. dâArtagnanM. dâArtagnan
DukeDuke
Messieurs PorthosMessieurs Porthos
KittyKitty
The Three Musketeers: F1 32 - 48
ChalaisChalais
M. BonacieuxM. Bonacieux
de M. Busignyde M. Busigny
Houdiniere LaHoudiniere La
John FeltonJohn Felton
Bois-Tracy de Ma...Bois-Tracy de Ma...
de M. Schombergde M. Schomberg
LubinLubin
Porthos MonsieurPorthos Monsieur
la Harpe de Ruela Harpe de Rue
RochellaisRochellais
de Marie Medicisde Marie Medicis
de Busigny Monsi...de Busigny Monsi...
Milady ClarikMilady Clarik
RochefortRochefort
Grimaud MonsieurGrimaud Monsieur
M. CoquenardM. Coquenard
de Treville Mons...de Treville Mons...
Commissary Monsi...Commissary Monsi...
Mr. FeltonMr. Felton
MontagueMontague
Buckingham de Mo...Buckingham de Mo...
de Monsieur Voit...de Monsieur Voit...
M. DartagnanM. Dartagnan
Monsieur Bernajo...Monsieur Bernajo...
III HenryIII Henry
Monsieur Dessess...Monsieur Dessess...
de Chevreuse Mad...de Chevreuse Mad...
Donna EstafaniaDonna Estafania
Lord DukeLord Duke
Quixote DonQuixote Don
Lorme de MarionLorme de Marion
de Cahusac Monsi...de Cahusac Monsi...
BazinBazin
Chevalier Monsie...Chevalier Monsie...
MusketeerMusketeer
M. DessessartM. Dessessart
GermainGermain
de M. Cavoisde M. Cavois
JudithJudith
Monsieur Dartagn...Monsieur Dartagn...
GasconGascon
MousquetonMousqueton
Monsieur AthosMonsieur Athos
Duke MonsieurDuke Monsieur
Charlotte BacksonCharlotte Backson
BethuneBethune
Planchet MonsieurPlanchet Monsieur
Louis XIIILouis XIII
Milady de WinterMilady de Winter
Bonacieux MadameBonacieux Madame
de Benserade Mon...de Benserade Mon...
GervaisGervais
MeungMeung
Chesnaye LaChesnaye La
Bonacieux Monsie...Bonacieux Monsie...
ChrysostomChrysostom
Wardes de De M.Wardes de De M.
Coquenard Monsie...Coquenard Monsie...
PatrickPatrick
Lord de De WinterLord de De Winter
BerryBerry
MandeMande
Laporte M.Laporte M.
Richelieu deRichelieu de
GodeauGodeau
Laporte MonsieurLaporte Monsieur
Louis XIVLouis XIV
AnneAnne
de M. Tremouille...de M. Tremouille...
NormanNorman
de M. Bassompier...de M. Bassompier...
IV HenryIV Henry
Villiers GeorgeVilliers George
de M. Laffemasde M. Laffemas
BearnaisBearnais
PierrePierre
monsieur Aramis ...monsieur Aramis ...
JussacJussac
DenisDenis
GasconsGascons
CrevecoeurCrevecoeur
PicardPicard
pope Popepope Pope
de M. Trevillede M. Treville
de Monsieur Cavo...de Monsieur Cavo...
LorraineLorraine
Dangouleme DucDangouleme Duc
#N/A#N/A
Cardinal MonsieurCardinal Monsieur
FourreauFourreau
BicaratBicarat
Marie Michon MAR...Marie Michon MAR...
I CharlesI CharlesDukeDuke
VilleroyVilleroy
Messieurs PorthosMessieurs Porthos
KittyKitty
Bonacieux Consta...Bonacieux Consta...
The Three Musketeers after rewriting d’Artagnan to Dartagnan
Image source: https://static.boredpanda.com/blog/wp-content/uploads/2015/10/funny-game-of-thrones-memes-fb__700.jpg
Performance fixes
• Replace word names with generic names

• Remove apostrophes from names 

• But:

• Requires manual intervention

• Doesn’t scale
D I G I TA L H U M A N I T I E S L A B
Where to go from here?
• Robuster NLP tools are necessary to better
understand novels (and other non-newspaper
texts)

• Background knowledge can help (e.g. GoT
Wiki lists all Danaerys’ nicknames)

• But: not all books are that popular 

• Also: different names are used in different
contexts, you may not want to collapse them! 

• Always: don’t just assume it works, look into
your data! 

• Full paper at: http://peerj.com/articles/cs-189
D I G I TA L H U M A N I T I E S L A B Image source: https://news.images.itv.com/image/file/1232718/stream_img.jpg
Digital Humanities Lab
History,
Literary Studies,
History of Science
& Scholarship
Social History
Dutch Language
& Culture
https://huc.knaw.nl/
Slide by Antal van den Bosch
Slide by Antal van den Bosch
Cultural Artificial Intelligence
Making AI culturally aware
Appreciate the user
Being contextually appropriate
Understand the issues
What do you get when you invert
“Digital Humanities”?
Slide by Antal van den Bosch
Applications of Cultural AI: Filters and flags
• Toxicity
• Protective filters (like spam filters and
ad blockers)
• Gender
• Linguistic filters and helpers
• Fake news
• Meme detectors, explanations
Slide by Antal van den Bosch
Theory of Cultural AI: Understanding & nuance
• Understanding concepts
• Changes over time
• Perspectives
• Evolution
• Knowing the origins of digital
stories
• Understanding viral potential
• Language is “social and
cultural data” (Nguyen, 2017)
Slide by Antal van den Bosch
Some DHLab projects
• Food culture via newspaper recipes
(Meertens and IISH)

• Analysing online debates: refugee vs
migrant (with EUR)

• Amsterdam Time Machine (with many
partners)

• Tracing 18th century career trajectories
(with HuC-DI & Huygens Institute)

• Analysing the concept ‘violence’ through
time (with NLeSc, OU & NIOD)
D I G I TA L H U M A N I T I E S L A B
Debates on the refugee crisis
• From 2015 on, wider use of both
‘European refugee crisis’ and ‘European
migrant crisis’ in the news and social
media 

• “Framing labels” (Knoll, Redlawsk, &
Sanborn, 2011) imply two different frames:

• ‘Refugee’ – people fleeing conflict or
persecution

• ‘Migrant’ – improving economic situation

• Mixed usage and mislabeling have
implications for refugees, e.g., negative
influence on perceptions of host countries
D I G I TA L H U M A N I T I E S L A B
DHLab@HuC:
Advancing the humanities through digital methods
• DHLabHuC / adinanerghes / melvinwevers
/ merpeltje 

• https://dhlab.nl (under construction)
Melvin WeversAdina NerghesMarieke van Erp
Grazie per la vostra attenzione!

Contenu connexe

Similaire à Why language technology can’t handle Game of Thrones (yet)

Canoe the Open Content Rapids
Canoe the Open Content RapidsCanoe the Open Content Rapids
Canoe the Open Content Rapids
Dorothea Salo
 
It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011
Ross Singer
 

Similaire à Why language technology can’t handle Game of Thrones (yet) (20)

Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
N8_R_for_Text_Analysis_Slides.pptx
N8_R_for_Text_Analysis_Slides.pptxN8_R_for_Text_Analysis_Slides.pptx
N8_R_for_Text_Analysis_Slides.pptx
 
R in the Humanities: Text Analysis
R in the Humanities: Text AnalysisR in the Humanities: Text Analysis
R in the Humanities: Text Analysis
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Intro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & MuseumsIntro to Linked Open Data in Libraries, Archives & Museums
Intro to Linked Open Data in Libraries, Archives & Museums
 
AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101  AHRC CDP Digital Humanities 101
AHRC CDP Digital Humanities 101
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
Ngsp
NgspNgsp
Ngsp
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Canoe the Open Content Rapids
Canoe the Open Content RapidsCanoe the Open Content Rapids
Canoe the Open Content Rapids
 
Linking American Art to the Cloud
Linking American Art to the CloudLinking American Art to the Cloud
Linking American Art to the Cloud
 
It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011It's not rocket surgery - Linked In: ALA 2011
It's not rocket surgery - Linked In: ALA 2011
 
Digital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the field
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
Open data and linked data
Open data and linked dataOpen data and linked data
Open data and linked data
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
 
Radically Open at the National Archives
Radically Open at the National ArchivesRadically Open at the National Archives
Radically Open at the National Archives
 

Plus de Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
Marieke van Erp
 

Plus de Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Why language technology can’t handle Game of Thrones (yet)

  • 1. Why Language Technology Can’t Handle Game of Thrones (yet) Marieke van Erp merpeltje Joint work with: Niels Dekker & Tobias Kuhn Image source: https://anibundel.files.wordpress.com/2015/04/jonsnow-leaves-ygritte.jpg
  • 2. This talk • NLP 101 • Recognising named entities in fiction • Digital Humanities @KNAW HuC D I G I TA L H U M A N I T I E S L A B Image source: https://vignette.wikia.nocookie.net/pirates/images/3/3c/ MediterraneanProfile.jpg/revision/latest?cb=20120312215230
  • 4.
  • 5. NLP 101 Image source: https://i.ytimg.com/vi/iuumnjJWFO4/maxresdefault.jpg
  • 6. NLP 101: What is Text Mining? • Extracting knowledge and information from texts in natural language: • metadata for a text: author, publisher, time of publication, topic, its language, URL, URLs to and from a web text • people mentioned in text, but also companies, organisations, places, dates → links to Wikipedia, Wikification of text • Amounts: prices, age, size, distance, weight • Facts (statements), concepts (terms) and relations between concepts • Sentiment (positive/negative), opinions • Emotions, purpose, intention, humour, sarcasm, irony, threats, style (formal, informal), genre (blog, news, science, tax form)
  • 7. Types of Knowledge to Extract • Conceptual relations: define possible relations between concepts in an ontology, e.g. what things have weight, size, age, get born, eat, drink, get an education, work, marry, do sports, live and die. • Factual relations: actual instantiations of concepts and relations that are the case in some world (time and place), Barack Obama was born on Augus 4, 1961, in Honolulu, Hawaii. • Factual relations need to fit the ontological model but the ontology does not predict actual facts only the possible facts!!! • Opinions: epistemic and modal relations (believe, wish, hope, fear, expect) between source and target expressed as a private state of the source, e.g. I am a fan of Barack Obama, I believe Barack Obama will help people.
  • 8.
  • 9.
  • 10. Text Mining pipeline • Analysis starts at token-level • Moves up to phrases, sentences and documents • Performance goes down as analyses becomes deeper • Statistical methods mostly used, but hybrid methods are a promising research topic Tokenisation Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatic Analysis Input text Speaker's intended meaning
  • 11. Companies want text mining • From click logs they can see what people looked at on their site • To know what they think about it they need to mine reviews, tweets etc: text mining • To stay ahead of their competitors, they need to obfuscate their patents, and find relevant patents from competitors: text mining • To aid their information departments, they need access to relevant information: text mining
  • 12. Humanities researchers want text mining • To evaluate gender bias in large corpora http:// literaryquality.huygens.knaw.nl/ • To trace concepts through time: https://www.esciencecenter.nl/project/ evidence • Detecting and modelling populist movements on social media https:// www.meertens.knaw.nl/cms/en/research/projects/259-het-dagelijks-leven/ 145541-populisme-social-media-en-religie • Analysis of church registers, letters, ship journals etc…
  • 13. State-of-the-art • POS tagging: 97% • Sentiment Analysis: 95% (document level) / 54% (fine-grained sentence level) • Named Entity Recognition: 90% • Temporal information extraction: 77% Note: this holds for English and on standardised datasets
  • 16. Recognising named entities in fiction Image source: https://wp-media.patheos.com/blogs/sites/1186/2019/04/mauricio-santos-503880-unsplash.jpg
  • 17. Background • Characters and relations are backbone of stories • Computational methods allow for scaling up network extraction and analysis • Relies on named entity recognition • Most work thusfar focuses on 19th and early 20th century novels • Research question: how do these tools perform on modern science fiction/fantasy novels? D I G I TA L H U M A N I T I E S L A B Image source: https://newleftreview.org/system/dragonfly/production/2019/03/09/9rcllsj7us_3020501.gif
  • 18. Experimental setup • Collect 20 ‘old’ and 20 ‘new’ novels • Annotate first chapters for entities and relationships between entities (gold standard) • Run 4 named entity recognisers on the sets of ‘old’ and ‘new’ novels • Compare system outputs to gold standard annotations • Bonus: compare network structures Image source: delpher.nl D I G I TA L H U M A N I T I E S L A B Image source: https://cdn-images-1.medium.com/max/2400/1*QbCo9uE7jPbt1ttnMsqOog.jpeg
  • 19. 19th and early 20th century novels, based on The Guardian’s Top 100 Classic novels + availability through Project Gutenberg + used in earlier studies
  • 20. ‘New’ Science Fiction and Fantasy novels, based on list from BestFantasyBooks.com
  • 21. Data preprocessing • All books converted to plain text format • Ensure all texts have the same character encoding • Pro tip: check whether there are no odd or inconsistent quotation marks in your documents • Appendices, glossaries and reviews were removed manually D I G I TA L H U M A N I T I E S L A B Image source: https://www.dataentryoutsourced.com/blog/wp-content/uploads/2015/03/ Post-091-640x200.jpg
  • 22. Gold standard annotations • Chapter lengths varied from 84 to 1,442 sentences • An average of 300 sentences close to a chapter boundary was selected • e.g. the third chapter in Alice in Wonderland ended after sentence 315, so for that book the first three chapters were annotated • 2 annotators (not the authors of the study) D I G I TA L H U M A N I T I E S L A B Image source: https://panmacmillan.azureedge.net/pmk11/panmacmillan/files/media/ panmacmillan/blogs/tws/august%202017/alice-in-wonderland-knowledge-quiz-header.png
  • 23. Annotation Instructions • For each sentence: • Identify all characters in it • Identify anaphoric references (e.g. she refers to Alice) • To speed up the process, annotators were provided with a list of characters derived automatically • Missing characters could be added to the list • Ignore generic pronouns, exclamations, generic noun phrases, non-human named characters (Buckbeak) D I G I TA L H U M A N I T I E S L A B Image source: https://vignette.wikia.nocookie.net/p__/images/3/35/Erich_Mueller_and_Shannon_McGrath_are_glued_together_back_to_back_with_Tree_Resin.jpeg/revision/ latest?cb=20170331180847&path-prefix=protagonist
  • 24. Named Entity Recognisers: BookNLP • NLP pipeline modified to deal with books • POS tagging, dependency parsing, NER, character name clustering, quotation speaker identification, pronominal coreference resolution, supersense tagging • NER module based on Stanford NER, with some modifications • We focus on NER, character name clustering and pronominal character resolution modules in our evaluation • https://github.com/dbamman/book-nlp D I G I TA L H U M A N I T I E S L A B Image source: https://cdn.aarp.net/content/dam/aarp/money/budgeting_savings/2016/04/1140- yeager-sell-your-used-books.imgcache.rev6feda141288df73e8fd100822bb375ea.jpg
  • 25. Named Entity Recognisers: Stanford NER • State-of-the-art CRF NER system • Trained on CoNLL 2003 data (Reuters newswire articles from 1996-08-20 to 1997-08-19) • Cited 2,720 times • F1 = 86.31 on CoNLL 2003 test set • https://nlp.stanford.edu/software/CRF- NER.html D I G I TA L H U M A N I T I E S L A B
  • 26. Named Entity Recognisers: Illinois Tagger • Perceptron-based classifier • Includes contextual information • 10,146 downloads • F1 = 90.57 on CoNLL 2003 test set • https://cogcomp.org/page/software_view/ NETagger Image source: delpher.nl D I G I TA L H U M A N I T I E S L A B
  • 27. Named Entity Recognisers: IXA-Pipe-NERC • Perceptron model • additional background information from Brown clusters • F1 = 91.36 on CoNLL 2003 test • https://github.com/ixa-ehu/ixa-pipe-nerc D I G I TA L H U M A N I T I E S L A B
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. JosethJoseth Harys SerHarys Ser BrackensBrackens Lord RobbLord Robb CoholloCohollo Piper Ser MarqPiper Ser Marq HullenHullen Tommen PrinceTommen Prince Trant Meryn SerTrant Meryn Ser Hightower Ser GeroldHightower Ser Gerold Lord VanceLord VanceDareonDareon Arya HorsefaceArya Horseface Lord HornwoodLord Hornwood Robert BaratheonRobert BaratheonCotter PykeCotter Pyke Caron Lord BryceCaron Lord Bryce EliaElia Stark SansaStark Sansa Mott MasterMott Master AggoAggo Rodrik Cassel SerRodrik Cassel Ser ThorosThoros LyannaLyanna Ser DonnelSer Donnel NymeriaNymeria SherrerSherrer Tarly SamTarly Sam JhiquiJhiqui Alyssa ArrynAlyssa Arryn JyckJyck YorenYoren Frey LadyFrey Lady Rayder ManceRayder Mance PypPyp Manderly Ser WylisManderly Ser Wylis ChellaChella JhogoJhogo ChiggenChiggen Dontos SerDontos Ser Bronze Yohn RoyceBronze Yohn Royce ChettChett VisenyaVisenya Cassel JoryCassel Jory GrennGrenn Lord SlyntLord Slynt Hal MollenHal Mollen Ned StarkNed Stark Stark BrandonStark Brandon MikkenMikken Greyjoy BalonGreyjoy Balon MorrecMorrec TomardTomard DanwellDanwell Mya StoneMya Stone HeartsbaneHeartsbane Jaremy Ser RykkerJaremy Ser Rykker Egen Ser VardisEgen Ser Vardis GodwynGodwyn Castle BlackCastle Black Lord Dondarrion BericLord Dondarrion Beric Brynden BlackfishBrynden Blackfish Maester LuwinMaester Luwin Maester AemonMaester Aemon CravenCraven MordMord MattMatt Clegane SandorClegane Sandor ShaeShae HarrenhalHarrenhal Lord Nestor RoyceLord Nestor Royce PentoshiPentoshi ToadToad PortherPorther Lord lord TyrionLord lord Tyrion MagoMago Vargo HoatVargo Hoat RickonRickon EroehEroeh Lord ArrynLord Arryn QuaroQuaro Lord PiperLord Piper Lysa Lady ArrynLysa Lady Arryn BraavosiBraavosi MattharMatthar Bracken Jonos LordBracken Jonos Lord Lord StewardLord Steward Manderly Ser WendelManderly Ser Wendel TregarTregar TimettTimett Santagar Ser AronSantagar Ser Aron Barristan Selmy SerBarristan Selmy Ser Payne Ser IlynPayne Ser Ilyn Boy MoonBoy Moon Perwyn SerPerwyn Ser Lord Mallister JasonLord Mallister Jason Samwell TarlySamwell Tarly Poole VayonPoole Vayon JoffteyJofftey BethBeth GaredGared MoreoMoreo Whent Oswell SerWhent Oswell Ser Forel SyrioForel Syrio DanyDany KurleketKurleket GreatjonGreatjon Lannister TyrionLannister Tyrion Ser Moore MandonSer Moore Mandon Lord WymanLord Wyman HardinHardin DorneDorne Lord JonLord Jon Stannis Baratheon LordStannis Baratheon Lord JerenJeren UlfUlf Fat TomFat Tom Jaime Ser LannisterJaime Ser Lannister Ogo KhalOgo Khal Moat CailinMoat Cailin Cassel MartynCassel Martyn Alliser Ser ThorneAlliser Ser Thorne FarlenFarlen Lord RobertLord Robert LysLys Lord RowanLord Rowan Jeyne PooleJeyne Poole TyroshiTyroshi ConnConn MaegorMaegor HaggoHaggo ValeVale Edmure Ser TullyEdmure Ser Tully HighgardenHighgarden GageGage Hill HornHill Horn CorattCoratt Heddle MashaHeddle Masha Maege MormontMaege Mormont Lady Catelyn StarkLady Catelyn Stark CaynCayn Ben StarkBen Stark MarillionMarillion Lady MormontLady Mormont KingKing Robert ArrynRobert Arryn GendryGendry Xho JalabharXho Jalabhar KhaleesiKhaleesi Lord Baratheon RenlyLord Baratheon Renly AlynAlyn Lord Baelish PetyrLord Baelish Petyr Lady SansaLady Sansa Mirri Maz DuurMirri Maz Duur Lord Frey WalderLord Frey Walder FatherFather Ser Addam MarbrandSer Addam Marbrand Hugh SerHugh Ser Old NanOld Nan LharysLharys JacksJacks Rhaegar TargaryenRhaegar Targaryen Joffrey PrinceJoffrey Prince Boros Ser BlountBoros Ser Blount Vance KarylVance Karyl JoffJoff Arthur Dayne SerArthur Dayne Ser Mordane SeptaMordane Septa Ser Tallhart HelmanSer Tallhart Helman Lord Tytos BlackwoodLord Tytos Blackwood Tywin Lord LannisterTywin Lord Lannister Yi TiYi Ti Jen BenJen Ben HalderHalder ShaggaShagga Arryn JonArryn Jon DolfDolf BaelorBaelor GunthorGunthor Tyrell Ser LorasTyrell Ser Loras Lannister Ser KevanLannister Ser Kevan Stevron Frey SerStevron Frey Ser Tanda LadyTanda Lady Raymun Darry SerRaymun Darry Ser ShaggydogShaggydog Lord Tully HosterLord Tully Hoster Arys SerArys Ser Flowers JaferFlowers Jafer Willis Ser WodeWillis Ser Wode DawnDawn HewardHeward Willem DarryWillem Darry FogoFogo MalleonMalleon WillWill Rhaggat KhalRhaggat Khal MycahMycah JaggotJaggot Flement Brax SerFlement Brax Ser UmarUmar Robar SerRobar Ser NaerysNaerys CheykCheyk Tobho MottTobho Mott Benjen StarkBenjen Stark MohorMohor LittlefingerLittlefinger Lord TyrellLord Tyrell Brynden Ser TullyBrynden Ser Tully HaliHali MyrcellaMyrcella StivStiv Othell YarwyckOthell Yarwyck Greyjoy TheonGreyjoy Theon IrriIrri Maester PycelleMaester Pycelle Grey WindGrey Wind Quorin HalfhandQuorin Halfhand JaehaerysJaehaerys Lord CerwynLord Cerwyn ClydasClydas RakharoRakharo DywenDywen Magister IllyrioMagister Illyrio TorrhenTorrhen Aegon TargaryenAegon Targaryen Bowen MarshBowen Marsh Daryn HornwoodDaryn Hornwood RiverrunRiverrun Clegane Gregor SerClegane Gregor Ser Snow JonSnow Jon RastRast Aerys TargaryenAerys Targaryen Drogo KhalDrogo Khal Viserys TargaryenViserys Targaryen QothoQotho Whent LadyWhent Lady Hobb Three-FingerHobb Three-Finger DothrakiDothraki Royce Ser AndarRoyce Ser Andar Karyl SerKaryl Ser HakeHake LanceLance HosteenHosteen Mace TyrellMace Tyrell Lord HunterLord Hunter Hallis MollenHallis Mollen Dothrak VaesDothrak Vaes Daeren TargaryenDaeren Targaryen Lord LeffordLord Lefford VolantisVolantis Glover GalbartGlover Galbart RhaegoRhaego Bolton RooseBolton Roose Catelyn TullyCatelyn Tully Lannister CerseiLannister Cersei JossJoss Waymar Ser RoyceWaymar Ser Royce Lothor BruneLothor Brune Lord Tarly RandyllLord Tarly Randyll Derik LordDerik Lord Jared Frey SerJared Frey Ser TyroshTyrosh Ser Swann BalonSer Swann Balon Lord VarysLord Varys BranBran Harrion KarstarkHarrion Karstark JhaqoJhaqo DoreahDoreah HaiderHaider bushbush Janos SlyntJanos Slynt Brothers MoonBrothers Moon Arya StarkArya Stark Daenerys TargaryenDaenerys Targaryen Corbray Lyn SerCorbray Lyn Ser HodorHodor Robett GloverRobett Glover HarwinHarwin Lord Karstark RickardLord Karstark Rickard BronnBronn Hobber SerHobber Ser Khal JommoKhal Jommo Horas SerHoras Ser Lord MormontLord Mormont DesmondDesmond StarksStarks Robb StarkRobb Stark Lord Hand lordLord Hand lord AlbettAlbett Noye DonalNoye Donal Jorah Ser MormontJorah Ser Mormont
  • 35. CoholloCohollo EliaElia AggoAggo JhiquiJhiqui ChellaChella JhogoJhogo ShaeShae PentoshiPentoshi MagoMago Vargo HoatVargo Hoat EroehEroeh QuaroQuaro rdrd TimettTimett DanyDany annister Tyrionannister Tyrion DorneDorne UlfUlf Ogo KhalOgo Khal LysLys ConnConn HaggoHaggo HighgardenHighgarden KingKing KhaleesiKhaleesi Mirri Maz DuurMirri Maz Duur Rhaegar TargaryenRhaegar Targaryen Vance KarylVance Karyl Yi TiYi Ti ShaggaShagga DolfDolf GunthorGunthor Lannister Ser KevanLannister Ser Kevan Raymun Darry SerRaymun Darry Ser FogoFogo Rhaggat KhalRhaggat Khal Flement Brax SerFlement Brax Ser UmarUmar NaerysNaerys CheykCheyk Lord TyrellLord Tyrell IrriIrri RakharoRakharo Magister IllyrioMagister Illyrio Aegon TargaryenAegon Targaryen Drogo KhalDrogo Khal Viserys TargaryenViserys Targaryen QothoQotho DothrakiDothraki Karyl SerKaryl Ser Dothrak VaesDothrak Vaes Daeren TargaryenDaeren Targaryen Lord LeffordLord Lefford RhaegoRhaego Lannister CerseiLannister Cersei JossJoss Derik LordDerik Lord TyroshTyrosh JhaqoJhaqo DoreahDoreah MoonMoon Daenerys TargaryenDaenerys Targaryen onnonn Khal JommoKhal Jommo Lord MormontLord Mormont Robb StarkRobb Stark Jorah Ser MormontJorah Ser Mormont
  • 37.
  • 38.
  • 39. Discussion • No difference between ‘old’ and ‘new’ books • Within categories, great variety in entity distributions and results • If a central entity is missed, the performance suffers greatly (e.g. Brave New World) • Coreference resolution particularly difficult in this domain D I G I TA L H U M A N I T I E S L A B Image source: https://www.nuffoodsspectrum.in/uploads/articles/quarterly_results_bg-4192.jpg
  • 40. Why is fiction hard for NLP? • Fiction writers don’t have to abide by conventions: they can use language more creatively than newspaper journalists • mix languages • make up languages • use nicknames • Narratives written from first-person perspective confuse the software D I G I TA L H U M A N I T I E S L A B Image source: https://steamuserimages-a.akamaihd.net/ugc/859477733475369907/F34770D6EFEC30A70A84BEFE93C2C522C0B4A902/
  • 41. ChalaisChalais M. BonacieuxM. Bonacieux de M. Busignyde M. Busigny Houdiniere LaHoudiniere La John FeltonJohn Felton Bois-Tracy de Ma...Bois-Tracy de Ma... de M. Schombergde M. Schomberg LubinLubin Porthos MonsieurPorthos Monsieur la Harpe de Ruela Harpe de Rue RochellaisRochellais Richelieu deRichelieu de de Busigny Monsi...de Busigny Monsi... Milady ClarikMilady Clarik RochefortRochefort Grimaud MonsieurGrimaud Monsieur M. CoquenardM. Coquenard de Treville Mons...de Treville Mons... Mr. FeltonMr. Felton MontagueMontague dâArtagnan Mon...dâArtagnan Mon... Buckingham de Mo...Buckingham de Mo... de Monsieur Voit...de Monsieur Voit... Monsieur Bernajo...Monsieur Bernajo... III HenryIII Henry Monsieur Dessess...Monsieur Dessess... de Chevreuse Mad...de Chevreuse Mad... Donna EstafaniaDonna Estafania Lord DukeLord Duke Quixote DonQuixote Don Lorme de MarionLorme de Marion de Cahusac Monsi...de Cahusac Monsi... BazinBazin Chevalier Monsie...Chevalier Monsie... MusketeerMusketeer Constance Bonaci...Constance Bonaci... M. DessessartM. Dessessart GermainGermain de M. Cavoisde M. Cavois JudithJudith GasconGascon MousquetonMousqueton Monsieur AthosMonsieur Athos Duke MonsieurDuke Monsieur Charlotte BacksonCharlotte Backson BethuneBethune Planchet MonsieurPlanchet Monsieur Louis XIIILouis XIII Bonacieux MadameBonacieux Madame de Benserade Mon...de Benserade Mon... GervaisGervais MeungMeung Chesnaye LaChesnaye La Bonacieux Monsie...Bonacieux Monsie... ChrysostomChrysostom Wardes de De M.Wardes de De M. Coquenard Monsie...Coquenard Monsie... PatrickPatrick BerryBerry MandeMande Laporte M.Laporte M. de M. Laffemasde M. Laffemas Laporte MonsieurLaporte Monsieur Louis XIVLouis XIV AnneAnne de M. Tremouille...de M. Tremouille... NormanNorman de M. Bassompier...de M. Bassompier... IV HenryIV Henry Villiers GeorgeVilliers George BearnaisBearnais I CharlesI Charles PierrePierre monsieur Aramis ...monsieur Aramis ... JussacJussac DenisDenis GasconsGascons Coquenard MadameCoquenard Madame CrevecoeurCrevecoeur PicardPicard pope Popepope Pope de M. Trevillede M. Treville de Marie Medicisde Marie Medicis LorraineLorraine #N/A#N/A Cardinal MonsieurCardinal Monsieur FourreauFourreau BicaratBicarat Marie Michon MAR...Marie Michon MAR... Lord de WinterLord de Winter Milady de De Win...Milady de De Win... M. dâArtagnanM. dâArtagnan DukeDuke Messieurs PorthosMessieurs Porthos KittyKitty The Three Musketeers: F1 32 - 48
  • 42. ChalaisChalais M. BonacieuxM. Bonacieux de M. Busignyde M. Busigny Houdiniere LaHoudiniere La John FeltonJohn Felton Bois-Tracy de Ma...Bois-Tracy de Ma... de M. Schombergde M. Schomberg LubinLubin Porthos MonsieurPorthos Monsieur la Harpe de Ruela Harpe de Rue RochellaisRochellais de Marie Medicisde Marie Medicis de Busigny Monsi...de Busigny Monsi... Milady ClarikMilady Clarik RochefortRochefort Grimaud MonsieurGrimaud Monsieur M. CoquenardM. Coquenard de Treville Mons...de Treville Mons... Commissary Monsi...Commissary Monsi... Mr. FeltonMr. Felton MontagueMontague Buckingham de Mo...Buckingham de Mo... de Monsieur Voit...de Monsieur Voit... M. DartagnanM. Dartagnan Monsieur Bernajo...Monsieur Bernajo... III HenryIII Henry Monsieur Dessess...Monsieur Dessess... de Chevreuse Mad...de Chevreuse Mad... Donna EstafaniaDonna Estafania Lord DukeLord Duke Quixote DonQuixote Don Lorme de MarionLorme de Marion de Cahusac Monsi...de Cahusac Monsi... BazinBazin Chevalier Monsie...Chevalier Monsie... MusketeerMusketeer M. DessessartM. Dessessart GermainGermain de M. Cavoisde M. Cavois JudithJudith Monsieur Dartagn...Monsieur Dartagn... GasconGascon MousquetonMousqueton Monsieur AthosMonsieur Athos Duke MonsieurDuke Monsieur Charlotte BacksonCharlotte Backson BethuneBethune Planchet MonsieurPlanchet Monsieur Louis XIIILouis XIII Milady de WinterMilady de Winter Bonacieux MadameBonacieux Madame de Benserade Mon...de Benserade Mon... GervaisGervais MeungMeung Chesnaye LaChesnaye La Bonacieux Monsie...Bonacieux Monsie... ChrysostomChrysostom Wardes de De M.Wardes de De M. Coquenard Monsie...Coquenard Monsie... PatrickPatrick Lord de De WinterLord de De Winter BerryBerry MandeMande Laporte M.Laporte M. Richelieu deRichelieu de GodeauGodeau Laporte MonsieurLaporte Monsieur Louis XIVLouis XIV AnneAnne de M. Tremouille...de M. Tremouille... NormanNorman de M. Bassompier...de M. Bassompier... IV HenryIV Henry Villiers GeorgeVilliers George de M. Laffemasde M. Laffemas BearnaisBearnais PierrePierre monsieur Aramis ...monsieur Aramis ... JussacJussac DenisDenis GasconsGascons CrevecoeurCrevecoeur PicardPicard pope Popepope Pope de M. Trevillede M. Treville de Monsieur Cavo...de Monsieur Cavo... LorraineLorraine Dangouleme DucDangouleme Duc #N/A#N/A Cardinal MonsieurCardinal Monsieur FourreauFourreau BicaratBicarat Marie Michon MAR...Marie Michon MAR... I CharlesI CharlesDukeDuke VilleroyVilleroy Messieurs PorthosMessieurs Porthos KittyKitty Bonacieux Consta...Bonacieux Consta... The Three Musketeers after rewriting d’Artagnan to Dartagnan
  • 44. Performance fixes • Replace word names with generic names • Remove apostrophes from names • But: • Requires manual intervention • Doesn’t scale D I G I TA L H U M A N I T I E S L A B
  • 45. Where to go from here? • Robuster NLP tools are necessary to better understand novels (and other non-newspaper texts) • Background knowledge can help (e.g. GoT Wiki lists all Danaerys’ nicknames) • But: not all books are that popular • Also: different names are used in different contexts, you may not want to collapse them! • Always: don’t just assume it works, look into your data! • Full paper at: http://peerj.com/articles/cs-189 D I G I TA L H U M A N I T I E S L A B Image source: https://news.images.itv.com/image/file/1232718/stream_img.jpg
  • 46. Digital Humanities Lab History, Literary Studies, History of Science & Scholarship Social History Dutch Language & Culture https://huc.knaw.nl/
  • 47. Slide by Antal van den Bosch
  • 48. Slide by Antal van den Bosch
  • 49. Cultural Artificial Intelligence Making AI culturally aware Appreciate the user Being contextually appropriate Understand the issues What do you get when you invert “Digital Humanities”? Slide by Antal van den Bosch
  • 50. Applications of Cultural AI: Filters and flags • Toxicity • Protective filters (like spam filters and ad blockers) • Gender • Linguistic filters and helpers • Fake news • Meme detectors, explanations Slide by Antal van den Bosch
  • 51. Theory of Cultural AI: Understanding & nuance • Understanding concepts • Changes over time • Perspectives • Evolution • Knowing the origins of digital stories • Understanding viral potential • Language is “social and cultural data” (Nguyen, 2017) Slide by Antal van den Bosch
  • 52. Some DHLab projects • Food culture via newspaper recipes (Meertens and IISH) • Analysing online debates: refugee vs migrant (with EUR) • Amsterdam Time Machine (with many partners) • Tracing 18th century career trajectories (with HuC-DI & Huygens Institute) • Analysing the concept ‘violence’ through time (with NLeSc, OU & NIOD) D I G I TA L H U M A N I T I E S L A B
  • 53. Debates on the refugee crisis • From 2015 on, wider use of both ‘European refugee crisis’ and ‘European migrant crisis’ in the news and social media • “Framing labels” (Knoll, Redlawsk, & Sanborn, 2011) imply two different frames: • ‘Refugee’ – people fleeing conflict or persecution • ‘Migrant’ – improving economic situation • Mixed usage and mislabeling have implications for refugees, e.g., negative influence on perceptions of host countries D I G I TA L H U M A N I T I E S L A B
  • 54.
  • 55. DHLab@HuC: Advancing the humanities through digital methods • DHLabHuC / adinanerghes / melvinwevers / merpeltje • https://dhlab.nl (under construction) Melvin WeversAdina NerghesMarieke van Erp
  • 56. Grazie per la vostra attenzione!