Session6 02.jeremi ochab

Stylometry of literary papyri
Holger Essler, Jeremi K. Ochab
Institute of Physics
Jagiellonian University
DATeCH 2019
10th May 2019 Brussels

Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …

Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?

Data
https://github.com/DCLP/idp.data/tree/dclp/DCLP

Data
14624
metadata
748 transcriptions

298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions

298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions (paraliterary)

298 transcriptions
748 transcriptions
Data
14624
metadata

Data: metadata
14624
metadata
748 transcriptions
• Greek
• known author
• >50 words

Data: metadata
14624
metadata
298 transcriptions
748 transcriptions
• Greek
• known author
• >50 words

Data: metadata
14624
metadata
298 transcriptions

Data: metadata
14624
metadata
298 transcriptions
www.trismegistos.org
/place/2722
/authorwork/3062

Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery

Data: metadata
14624
metadata
298 transcriptions
Philodemus
Single-text authors
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery

Data: cleaning
14624
298 transcriptions
http://papyri.info/docs/leiden_plus

Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above

Data: cleaning
298 transcriptions
Two strategies:
v diversifying: by retaining <orig>, <hi>,
but omitting <reg> and <ex>
v normalising: by omitting <orig>, <hi>,
but retaining <reg> and <ex>
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above

Methods
Distance-based clustering
Community detection in networks
Clustering quality measures

Compute text similarity

Compute text similarity » word frequencies

› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.

› cosine delta
Hierarchically cluster (unsupervised)
› single, complete, …
› Ward linkage

› cosine delta
Hierarchicaly cluster (unsupervised)
› single, complete, …
› Ward linkage
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019

Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256

Community detection
in networks
› Louvain
(modularity)
› Informap
› OSLOM
› …
167–256

Community detection
in networks
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
› Louvain
(modularity)
› Informap
› OSLOM
› …

Community detection
in networks
167–256
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
› Louvain
(modularity)
› Informap
› OSLOM
› …

Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …

› Rand index
› mutual information

Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised

Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
› some selection bias remaining
(number and size of clusters)

Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.

Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
› Which similarity measure
» Burrows’s delta: AMI<0.1 (terrible)
» cosine delta: AMI=0.25 (very low, ~0.6 in novels)
» number of clusters: 15-25 (close)
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.

Results

Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
Problems:
mbalanced data
text sizes
Outlook:
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri

Conclusions
› Results
features(?)
› Problems
o imbalanced data
o texts too small
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri

Conclusions
› Results
features(?)
› Problems
o imbalanced data
o texts too small
› Outlook
o N-grams + SVD to circumvent sparseness
o augment texts preserved by medieval transmission
o supervised ML to narrow down:
genre/text type, dates, places, …
o Documentary papyri

J Rybicki
Institute of English Studies
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
H Essler
S Pielström
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019.
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri

J Rybicki
Institute of English Studies
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
Thank
you!
Questions?
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
H Essler
S Pielström

Session6 02.jeremi ochab

Recommandé

Recommandé

Contenu connexe

Similaire à Session6 02.jeremi ochab

Similaire à Session6 02.jeremi ochab (20)

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Dernier

Dernier (20)

Session6 02.jeremi ochab