SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
Stylometry of literary papyri
Holger Essler, Jeremi K. Ochab
Institute of Physics
Jagiellonian University
DATeCH 2019
10th May 2019 Brussels
Questions&Aims
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
Data
Metadata
Processing
Data
Data
Data
https://github.com/DCLP/idp.data/tree/dclp/DCLP
Data
10
14624
metadata
Data
14624
metadata
748 transcriptions
298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions
298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions (paraliterary)
298 transcriptions
748 transcriptions
Data
14624
metadata
Data: metadata
14624
metadata
748 transcriptions
• Greek
• known author
• >50 words
Data: metadata
14624
metadata
298 transcriptions
748 transcriptions
• Greek
• known author
• >50 words
Data: metadata
14624
metadata
298 transcriptions
Data: metadata
14624
metadata
298 transcriptions
www.trismegistos.org
/place/2722
/authorwork/3062
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
Philodemus
Single-text authors
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: cleaning
14624
298 transcriptions
http://papyri.info/docs/leiden_plus
Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Data: cleaning
298 transcriptions
Two strategies:
v diversifying: by retaining <orig>, <hi>,
but omitting <reg> and <ex>
v normalising: by omitting <orig>, <hi>,
but retaining <reg> and <ex>
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Methods
Distance-based clustering
Community detection in networks
Clustering quality measures
Distance-based clustering
Compute text similarity
Distance-based clustering
Compute text similarity » word frequencies
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchically cluster (unsupervised)
› single, complete, …
› Ward linkage
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchicaly cluster (unsupervised)
› single, complete, …
› Ward linkage
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Community detection
in networks
› Louvain
(modularity)
› Informap
› OSLOM
› …
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
› Louvain
(modularity)
› Informap
› OSLOM
› …
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
› Louvain
(modularity)
› Informap
› OSLOM
› …
Clustering quality measures
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
Clustering quality measures
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index
› mutual information
Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
› some selection bias remaining
(number and size of clusters)
Results
Results
It is hard!
Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
› Which similarity measure
» Burrows’s delta: AMI<0.1 (terrible)
» cosine delta: AMI=0.25 (very low, ~0.6 in novels)
» number of clusters: 15-25 (close)
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
Results
Results
Results
Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
Problems:
mbalanced data
text sizes
Outlook:
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
› Outlook
o N-grams + SVD to circumvent sparseness
o augment texts preserved by medieval transmission
o supervised ML to narrow down:
genre/text type, dates, places, …
o Documentary papyri
J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
H Essler
S Pielström
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019.
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
Thank
you!
Questions?
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
H Essler
S Pielström
58
Thank
you!
Questions?

Contenu connexe

Similaire à Session6 02.jeremi ochab

AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webFabien Gandon
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
Computer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoComputer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoAdam Perzynski, PhD
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...paper_reader
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Anita de Waard
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsAustin Benson
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserDavid Dias
 

Similaire à Session6 02.jeremi ochab (20)

AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing Costs
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Computer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivoComputer Software in Qualitative Research: An Introduction to NVivo
Computer Software in Qualitative Research: An Introduction to NVivo
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Fusing semantic data
Fusing semantic dataFusing semantic data
Fusing semantic data
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifs
 
Tianpei research summary
Tianpei research summaryTianpei research summary
Tianpei research summary
 
tianpei_research_summary
tianpei_research_summarytianpei_research_summary
tianpei_research_summary
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the Browser
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Dernier

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Session6 02.jeremi ochab

  • 1. Stylometry of literary papyri Holger Essler, Jeremi K. Ochab Institute of Physics Jagiellonian University DATeCH 2019 10th May 2019 Brussels
  • 3. Questions&Aims How can we correct/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › …
  • 4. Questions&Aims How can we correct/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › … › Can we extract them from text?
  • 5. Questions&Aims How can we correct/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › … › Can we extract them from text?
  • 15. Data: metadata 14624 metadata 748 transcriptions • Greek • known author • >50 words
  • 16. Data: metadata 14624 metadata 298 transcriptions 748 transcriptions • Greek • known author • >50 words
  • 19. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 20. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 21. Data: metadata 14624 metadata 298 transcriptions Philodemus Single-text authors • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 22. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 23. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 24. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 25. Data: metadata 14624 metadata 298 transcriptions • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 27. Data: cleaning 14624 298 transcriptions Manually tagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 28. Data: cleaning 14624 298 transcriptions Manually tagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 29. Data: cleaning 298 transcriptions Two strategies: v diversifying: by retaining <orig>, <hi>, but omitting <reg> and <ex> v normalising: by omitting <orig>, <hi>, but retaining <reg> and <ex> Manually tagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 30. Methods Distance-based clustering Community detection in networks Clustering quality measures
  • 32. Distance-based clustering Compute text similarity » word frequencies
  • 33. Distance-based clustering Compute text similarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 34. Distance-based clustering Compute text similarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Hierarchically cluster (unsupervised) › single, complete, … › Ward linkage Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 35. Distance-based clustering Compute text similarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Hierarchicaly cluster (unsupervised) › single, complete, … › Ward linkage J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 36. Community detection in networks MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256
  • 37. Community detection in networks › Louvain (modularity) › Informap › OSLOM › … MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256
  • 38. Community detection in networks MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256 Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64. › Louvain (modularity) › Informap › OSLOM › …
  • 39. Community detection in networks MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256 Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64. J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 › Louvain (modularity) › Informap › OSLOM › …
  • 40. Clustering quality measures Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, …
  • 41. Clustering quality measures Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index › mutual information
  • 42. Clustering quality measures Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th International Conference on Machine Learning. PMLR. 1073–1080. Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index » adjusted › mutual inf. » normalised » adjusted » standardised
  • 43. Clustering quality measures Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th International Conference on Machine Learning. PMLR. 1073–1080. Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index » adjusted › mutual inf. » normalised » adjusted » standardised › some selection bias remaining (number and size of clusters)
  • 46. Results › Best network clustering » modularity optimisation: AMI=0.22 (very low) » number of clusters: 7 Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004), 025101.
  • 47. Results › Best network clustering » modularity optimisation: AMI=0.22 (very low) » number of clusters: 7 › Which similarity measure » Burrows’s delta: AMI<0.1 (terrible) » cosine delta: AMI=0.25 (very low, ~0.6 in novels) » number of clusters: 15-25 (close) Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004), 025101.
  • 51. Results Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64.
  • 52. Results Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64.
  • 53. Conclusions › Results o clustering depends on text regularisation o trade-off between sparseness and distinctivness of features(?) Problems: mbalanced data text sizes Outlook: N-grams + SVD to circumvent sparseness augment texts preserved by medieval transmission supervised ML Predict or narrow down: genre/text type, dates,places, … Documentary papyri
  • 54. Conclusions › Results o clustering depends on text regularisation o trade-off between sparseness and distinctivness of features(?) › Problems o imbalanced data o texts too small N-grams + SVD to circumvent sparseness augment texts preserved by medieval transmission supervised ML Predict or narrow down: genre/text type, dates,places, … Documentary papyri
  • 55. Conclusions › Results o clustering depends on text regularisation o trade-off between sparseness and distinctivness of features(?) › Problems o imbalanced data o texts too small › Outlook o N-grams + SVD to circumvent sparseness o augment texts preserved by medieval transmission o supervised ML to narrow down: genre/text type, dates, places, … o Documentary papyri
  • 56. J Rybicki Institute of English Studies Jagiellonian University Grants: 2017/26/E/HS2/01019 M Eder J Byszuk H Essler S Pielström References: › J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019. › computationalstylistics.github.io › https://github.com/computation alstylistics/stylometry_of_papyri
  • 57. J Rybicki Institute of English Studies Jagiellonian University Grants: 2017/26/E/HS2/01019 M Eder J Byszuk Thank you! Questions? References: › J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 › computationalstylistics.github.io › https://github.com/computation alstylistics/stylometry_of_papyri H Essler S Pielström