3. Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
4. Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
5. Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
19. Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
20. Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
22. Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
23. Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
24. Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
25. Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
27. Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
28. Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
29. Data: cleaning
298 transcriptions
Two strategies:
v diversifying: by retaining <orig>, <hi>,
but omitting <reg> and <ex>
v normalising: by omitting <orig>, <hi>,
but retaining <reg> and <ex>
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
33. Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
34. Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchically cluster (unsupervised)
› single, complete, …
› Ward linkage
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
35. Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchicaly cluster (unsupervised)
› single, complete, …
› Ward linkage
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
37. Community detection
in networks
› Louvain
(modularity)
› Informap
› OSLOM
› …
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
38. Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
› Louvain
(modularity)
› Informap
› OSLOM
› …
39. Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
› Louvain
(modularity)
› Informap
› OSLOM
› …
41. Clustering quality measures
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index
› mutual information
42. Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
43. Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
› some selection bias remaining
(number and size of clusters)
46. Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
47. Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
› Which similarity measure
» Burrows’s delta: AMI<0.1 (terrible)
» cosine delta: AMI=0.25 (very low, ~0.6 in novels)
» number of clusters: 15-25 (close)
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
51. Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
52. Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
53. Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
Problems:
mbalanced data
text sizes
Outlook:
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
54. Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
55. Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
› Outlook
o N-grams + SVD to circumvent sparseness
o augment texts preserved by medieval transmission
o supervised ML to narrow down:
genre/text type, dates, places, …
o Documentary papyri
56. J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
H Essler
S Pielström
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019.
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
57. J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
Thank
you!
Questions?
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
H Essler
S Pielström