This document proposes a methodology for discovering patterns in scientific literature using a case study of digital library evaluation. It involves:
1. Classifying documents to identify relevant papers using naive Bayes classification.
2. Semantically annotating papers with concepts from a Digital Library Evaluation Ontology using the GoNTogle annotation tool. Over 2,600 annotations were generated.
3. Clustering the annotated papers into coherent groups using k-means clustering.
4. Interpreting the clusters with the assistance of the ontology to discover patterns and trends in the literature. Benchmarking tests were performed to evaluate effectiveness of the methodology.
Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
1. Charting the Digital Library Evaluation
Domain with a Semantically Enhanced
Mining Methodology
S
Eleni Afiontzi,1 Giannis Kazadeis,1 Leonidas Papachristopoulos,2
Michalis Sfakakis,2 Giannis Tsakonas,2 Christos Papatheodorou2
13th ACM/IEEE Joint Conference on Digital Libraries, July 22-26, Indianapolis, IN, USA
1. Department of Informatics,
Athens University of Economics & Business
2. Database & Information Systems
Group, Department of Archives & Library
Science, Ionian University
4. aim & scope of research
• To propose a methodology for discovering patterns in the
scientific literature.
5. aim & scope of research
• To propose a methodology for discovering patterns in the
scientific literature.
• Our case study is performed in the digital library evaluation
domain and its conference literature.
6. aim & scope of research
• To propose a methodology for discovering patterns in the
scientific literature.
• Our case study is performed in the digital library evaluation
domain and its conference literature.
• We question:
7. aim & scope of research
• To propose a methodology for discovering patterns in the
scientific literature.
• Our case study is performed in the digital library evaluation
domain and its conference literature.
• We question:
- how we select relevant studies,
8. aim & scope of research
• To propose a methodology for discovering patterns in the
scientific literature.
• Our case study is performed in the digital library evaluation
domain and its conference literature.
• We question:
- how we select relevant studies,
- how we annotate them,
9. aim & scope of research
• To propose a methodology for discovering patterns in the
scientific literature.
• Our case study is performed in the digital library evaluation
domain and its conference literature.
• We question:
- how we select relevant studies,
- how we annotate them,
- how we discover these patterns,
10. aim & scope of research
• To propose a methodology for discovering patterns in the
scientific literature.
• Our case study is performed in the digital library evaluation
domain and its conference literature.
• We question:
- how we select relevant studies,
- how we annotate them,
- how we discover these patterns,
in an effective, machine-operated way, in order to have reusable
and interpretable data?
14. why
• Abundance of scientific information
• Limitations of existing tools, such as reusability
15. why
• Abundance of scientific information
• Limitations of existing tools, such as reusability
• Lack of contextualized analytic tools
16. why
• Abundance of scientific information
• Limitations of existing tools, such as reusability
• Lack of contextualized analytic tools
• Supervised automated processes
20. panorama
1. Document classification to identify relevant papers
- We use a corpus of 1,824 papers from the JCDL and ECDL
(now TPDL) conferences, era 2001-2011.
21. panorama
1. Document classification to identify relevant papers
- We use a corpus of 1,824 papers from the JCDL and ECDL
(now TPDL) conferences, era 2001-2011.
2. Semantic annotation processes to mark up important concepts
22. panorama
1. Document classification to identify relevant papers
- We use a corpus of 1,824 papers from the JCDL and ECDL
(now TPDL) conferences, era 2001-2011.
2. Semantic annotation processes to mark up important concepts
- We use a schema for semantic annotation, the Digital Library
Evaluation Ontology, and a semantic annotation tool,
GoNTogle.
23. panorama
1. Document classification to identify relevant papers
- We use a corpus of 1,824 papers from the JCDL and ECDL
(now TPDL) conferences, era 2001-2011.
2. Semantic annotation processes to mark up important concepts
- We use a schema for semantic annotation, the Digital Library
Evaluation Ontology, and a semantic annotation tool,
GoNTogle.
3. Clustering to form coherent groups (K=11)
24. panorama
1. Document classification to identify relevant papers
- We use a corpus of 1,824 papers from the JCDL and ECDL
(now TPDL) conferences, era 2001-2011.
2. Semantic annotation processes to mark up important concepts
- We use a schema for semantic annotation, the Digital Library
Evaluation Ontology, and a semantic annotation tool,
GoNTogle.
3. Clustering to form coherent groups (K=11)
4. Interpretation with the assistance of the ontology schema
25. panorama
1. Document classification to identify relevant papers
- We use a corpus of 1,824 papers from the JCDL and ECDL
(now TPDL) conferences, era 2001-2011.
2. Semantic annotation processes to mark up important concepts
- We use a schema for semantic annotation, the Digital Library
Evaluation Ontology, and a semantic annotation tool,
GoNTogle.
3. Clustering to form coherent groups (K=11)
4. Interpretation with the assistance of the ontology schema
26. panorama
1. Document classification to identify relevant papers
- We use a corpus of 1,824 papers from the JCDL and ECDL
(now TPDL) conferences, era 2001-2011.
2. Semantic annotation processes to mark up important concepts
- We use a schema for semantic annotation, the Digital Library
Evaluation Ontology, and a semantic annotation tool,
GoNTogle.
3. Clustering to form coherent groups (K=11)
4. Interpretation with the assistance of the ontology schema
• During this process we perform benchmarking tests to qualify
specific components to effectively automate the exploration of
the literature and the discovery of research patterns.
31. training phase
• e aim was to train a classifier to identify relevant papers.
32. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
33. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
34. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
- descriptors: title, abstract & author keywords
35. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
- descriptors: title, abstract & author keywords
- rater’s agreement: 82.96% for JCDL, 78% for ECDL
36. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
- descriptors: title, abstract & author keywords
- rater’s agreement: 82.96% for JCDL, 78% for ECDL
- inter-rater agreement: moderate levels of Cohen’s Kappa
37. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
- descriptors: title, abstract & author keywords
- rater’s agreement: 82.96% for JCDL, 78% for ECDL
- inter-rater agreement: moderate levels of Cohen’s Kappa
- 12% positive # 88% negative
38. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
- descriptors: title, abstract & author keywords
- rater’s agreement: 82.96% for JCDL, 78% for ECDL
- inter-rater agreement: moderate levels of Cohen’s Kappa
- 12% positive # 88% negative
• Skewness of data addressed via resampling:
39. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
- descriptors: title, abstract & author keywords
- rater’s agreement: 82.96% for JCDL, 78% for ECDL
- inter-rater agreement: moderate levels of Cohen’s Kappa
- 12% positive # 88% negative
• Skewness of data addressed via resampling:
- under-sampling (Tomek Links)
40. training phase
• e aim was to train a classifier to identify relevant papers.
• Categorization
- two researchers categorized, a third one supervised
- descriptors: title, abstract & author keywords
- rater’s agreement: 82.96% for JCDL, 78% for ECDL
- inter-rater agreement: moderate levels of Cohen’s Kappa
- 12% positive # 88% negative
• Skewness of data addressed via resampling:
- under-sampling (Tomek Links)
- over-sampling (random over-sampling)
45. corpus definition
• Classification algorithm: Naïve Bayes
• Two sub-sets: a development (75%) and a test (25%)
• Ten-fold validation: the development set was randomly divided
to 10 equal; 9/10 as training set and 1/10 as test set.
46. corpus definition
• Classification algorithm: Naïve Bayes
• Two sub-sets: a development (75%) and a test (25%)
• Ten-fold validation: the development set was randomly divided
to 10 equal; 9/10 as training set and 1/10 as test set.
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0
47. corpus definition
• Classification algorithm: Naïve Bayes
• Two sub-sets: a development (75%) and a test (25%)
• Ten-fold validation: the development set was randomly divided
to 10 equal; 9/10 as training set and 1/10 as test set.
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0
Test
Development
48. corpus definition
• Classification algorithm: Naïve Bayes
• Two sub-sets: a development (75%) and a test (25%)
• Ten-fold validation: the development set was randomly divided
to 10 equal; 9/10 as training set and 1/10 as test set.
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0
Test
Development
fp rate
49. corpus definition
• Classification algorithm: Naïve Bayes
• Two sub-sets: a development (75%) and a test (25%)
• Ten-fold validation: the development set was randomly divided
to 10 equal; 9/10 as training set and 1/10 as test set.
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0
Test
Development
fp rate
tp rate
50. corpus definition
• Classification algorithm: Naïve Bayes
• Two sub-sets: a development (75%) and a test (25%)
• Ten-fold validation: the development set was randomly divided
to 10 equal; 9/10 as training set and 1/10 as test set.
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0
Test
Development
fp rate
tp rate
54. the schema - DiLEO
• DiLEO aims to conceptualize the DL evaluation domain by
exploring its key entities, their attributes and their relationships.
55. the schema - DiLEO
• DiLEO aims to conceptualize the DL evaluation domain by
exploring its key entities, their attributes and their relationships.
• A two layered ontology:
56. the schema - DiLEO
• DiLEO aims to conceptualize the DL evaluation domain by
exploring its key entities, their attributes and their relationships.
• A two layered ontology:
- Strategic level: consists of a set of classes related with the
scope and aim of an evaluation.
57. the schema - DiLEO
• DiLEO aims to conceptualize the DL evaluation domain by
exploring its key entities, their attributes and their relationships.
• A two layered ontology:
- Strategic level: consists of a set of classes related with the
scope and aim of an evaluation.
- Procedural level: consists of classes dealing with practical
issues.
61. the instrument - GoNTogle
• We used GoNTogle
to generate a RDFS
knowledge base.
62. the instrument - GoNTogle
• We used GoNTogle
to generate a RDFS
knowledge base.
• GoNTogle uses the
weighted k-NN
algorithm to support
either manual, or
automated ontology-
based annotation.
63. the instrument - GoNTogle
• We used GoNTogle
to generate a RDFS
knowledge base.
• GoNTogle uses the
weighted k-NN
algorithm to support
either manual, or
automated ontology-
based annotation.
64. the instrument - GoNTogle
• We used GoNTogle
to generate a RDFS
knowledge base.
• GoNTogle uses the
weighted k-NN
algorithm to support
either manual, or
automated ontology-
based annotation.
• http://bit.ly/12nlryh
67. the process - 1/3
• GoNTogle estimates a score for each class/subclass, calculating
its presence in the k nearest neighbors.
68. the process - 1/3
• GoNTogle estimates a score for each class/subclass, calculating
its presence in the k nearest neighbors.
• We set a score threshold above which a class is assigned to a new
instance (optimal score: 0.18).
69. the process - 1/3
• GoNTogle estimates a score for each class/subclass, calculating
its presence in the k nearest neighbors.
• We set a score threshold above which a class is assigned to a new
instance (optimal score: 0.18).
• e user is presented with a ranked list of the suggested classes/
subclasses and their score ranging from 0 to 1.
70. the process - 1/3
• GoNTogle estimates a score for each class/subclass, calculating
its presence in the k nearest neighbors.
• We set a score threshold above which a class is assigned to a new
instance (optimal score: 0.18).
• e user is presented with a ranked list of the suggested classes/
subclasses and their score ranging from 0 to 1.
• 2,672 annotations were manually generated.
73. the process - 2/3
• RDFS statements were processed to construct a new data set
(removal of stopwords, symbols, lowercasing, etc.)
74. the process - 2/3
• RDFS statements were processed to construct a new data set
(removal of stopwords, symbols, lowercasing, etc.)
• Experiments both with un-stemmed (4,880 features) and
stemmed (3,257 features) words.
75. the process - 2/3
• RDFS statements were processed to construct a new data set
(removal of stopwords, symbols, lowercasing, etc.)
• Experiments both with un-stemmed (4,880 features) and
stemmed (3,257 features) words.
• Multi-label classification via the ML framework Meka.
76. the process - 2/3
• RDFS statements were processed to construct a new data set
(removal of stopwords, symbols, lowercasing, etc.)
• Experiments both with un-stemmed (4,880 features) and
stemmed (3,257 features) words.
• Multi-label classification via the ML framework Meka.
• Four methods
- binary
representation
- Label powersets
- RAkEL
- ML-kNN
• Four algorithms
- Naïve Bayes
- Multinomial
Naïve Bayes
- k-Nearest-
Neighbors
- Support Vector
Machines
• Four metrics
- Hamming Loss
- Accuracy
- One-error
- F1 macro
79. the process - 3/3
• Performance tests were repeated using GoNTogle.
80. the process - 3/3
• Performance tests were repeated using GoNTogle.
• GoNTogle’s algorithm achieves good results in relation to the
tested multi-label classification algorithms.
81. the process - 3/3
• Performance tests were repeated using GoNTogle.
• GoNTogle’s algorithm achieves good results in relation to the
tested multi-label classification algorithms.
0
0.2
0.4
0.6
0.8
1.0
Hamming Loss Accuracy One - Error F1 macro
0.44
0.27
0.63
0.02
0.39
0.29
0.49
0.02
82. the process - 3/3
• Performance tests were repeated using GoNTogle.
• GoNTogle’s algorithm achieves good results in relation to the
tested multi-label classification algorithms.
0
0.2
0.4
0.6
0.8
1.0
Hamming Loss Accuracy One - Error F1 macro
0.44
0.27
0.63
0.02
0.39
0.29
0.49
0.02
GoNTogle
Meka
87. clustering - 1/3
• e final data set consists of 224 vectors of 53 features
88. clustering - 1/3
• e final data set consists of 224 vectors of 53 features
- represents the assigned annotations from the DiLEO
vocabulary to the document corpus.
89. clustering - 1/3
• e final data set consists of 224 vectors of 53 features
- represents the assigned annotations from the DiLEO
vocabulary to the document corpus.
• We represent the annotated documents by 2 vector models:
90. clustering - 1/3
• e final data set consists of 224 vectors of 53 features
- represents the assigned annotations from the DiLEO
vocabulary to the document corpus.
• We represent the annotated documents by 2 vector models:
- binary: fi has the value of 1, if the respective to fi subclass is
assigned to the document m, otherwise 0.
91. clustering - 1/3
• e final data set consists of 224 vectors of 53 features
- represents the assigned annotations from the DiLEO
vocabulary to the document corpus.
• We represent the annotated documents by 2 vector models:
- binary: fi has the value of 1, if the respective to fi subclass is
assigned to the document m, otherwise 0.
- tf-idf: feature frequency ffi of fi in all vectors is equal to 1
when the respective subclass is annotated to the respective
document m; idfi is the inverse document frequency of the
feature i in documents M.
94. clustering - 2/3
• We cluster the vector representations of the annotations by
applying 2 clustering algorithms:
95. clustering - 2/3
• We cluster the vector representations of the annotations by
applying 2 clustering algorithms:
- K-Means: partitions M data points to K clusters. e rate of
decrease peaked for K near 11 when plotted the Objective
function (cost or error) for various values of K.
96. clustering - 2/3
• We cluster the vector representations of the annotations by
applying 2 clustering algorithms:
- K-Means: partitions M data points to K clusters. e rate of
decrease peaked for K near 11 when plotted the Objective
function (cost or error) for various values of K.
- Agglomerative Hierarchical Clustering: a ‘bottom up’ built
hierarchy of clusters.
99. clustering - 3/3
• We assess each feature of each cluster using the frequency
increase metric.
100. clustering - 3/3
• We assess each feature of each cluster using the frequency
increase metric.
- it calculates the increase of the frequency of a feature fi in the
cluster k (cfi,k) compared to its document frequency dfi in the
entire data set
101. clustering - 3/3
• We assess each feature of each cluster using the frequency
increase metric.
- it calculates the increase of the frequency of a feature fi in the
cluster k (cfi,k) compared to its document frequency dfi in the
entire data set
• We select the threshold a that maximizes the F1-measure, the
harmonic mean of Coverage and Dissimilarity mean.
102. clustering - 3/3
• We assess each feature of each cluster using the frequency
increase metric.
- it calculates the increase of the frequency of a feature fi in the
cluster k (cfi,k) compared to its document frequency dfi in the
entire data set
• We select the threshold a that maximizes the F1-measure, the
harmonic mean of Coverage and Dissimilarity mean.
- Coverage: the proportion of features participating in the
clusters to the total number of features
103. clustering - 3/3
• We assess each feature of each cluster using the frequency
increase metric.
- it calculates the increase of the frequency of a feature fi in the
cluster k (cfi,k) compared to its document frequency dfi in the
entire data set
• We select the threshold a that maximizes the F1-measure, the
harmonic mean of Coverage and Dissimilarity mean.
- Coverage: the proportion of features participating in the
clusters to the total number of features
- Dissimilarity mean: the average of the distinctiveness of the
clusters, defined in terms of the dissimilarity di,j between all
the possible pairs of the clusters.
116. conclusions
• e patterns reflect and - up to a point - confirm the
anecdotally evident research practices of DL researchers.
117. conclusions
• e patterns reflect and - up to a point - confirm the
anecdotally evident research practices of DL researchers.
• Patterns have similar properties to a map.
118. conclusions
• e patterns reflect and - up to a point - confirm the
anecdotally evident research practices of DL researchers.
• Patterns have similar properties to a map.
- ey can provide the main and the alternative routes one can
follow to reach to a destination, taking into account several
practical parameters that might not know.
119. conclusions
• e patterns reflect and - up to a point - confirm the
anecdotally evident research practices of DL researchers.
• Patterns have similar properties to a map.
- ey can provide the main and the alternative routes one can
follow to reach to a destination, taking into account several
practical parameters that might not know.
• By exploring previous profiles, one can weight all the available
options.
120. conclusions
• e patterns reflect and - up to a point - confirm the
anecdotally evident research practices of DL researchers.
• Patterns have similar properties to a map.
- ey can provide the main and the alternative routes one can
follow to reach to a destination, taking into account several
practical parameters that might not know.
• By exploring previous profiles, one can weight all the available
options.
• is approach can extend other coding methodologies in terms
of transparency, standardization and reusability.