The use of graph theory for analyzing network-like data has gained central importance with the rise of the Web 2.0. However, many graph-based techniques are not well-disseminated and neither explored at their full potential, what might depend on a complimentary approach achieved with the combination of multiple techniques. This paper describes the systematic use of graph-based techniques of different types (multimodal) combining the resultant analytical insights around a common domain, the Digital Bibliography & Library Project (DBLP). To do so, we introduce an analytical ensemble based on statistical (degree, and weakly-connected components distribution), topological (average clustering coefficient, and effective diameter evolution), algorithmic (link prediction/machine learning), and algebraic techniques to inspect non-evident features of DBLP at the same time that we interpret the heterogeneous discoveries found along the work. As a result, we have put together a set of techniques demonstrating over DBLP what we call multimodal analysis, an innovative process of information understanding that demands a wide technical knowledge and a deep understanding of the data domain. We expect that our methodology and our findings will foster other multimodal analyses and also that they will bring light over the Computer Science research.
Multimodal graph-based analysis over the DBLP repository: critical discoveries and hypotheses
1. Introduction Methodology Experiments Conclusions
Multimodal graph-based analysis over the DBLP
repository: critical discoveries and hypotheses
Gabriel Perri Gimenes, Hugo Gualdron, Jose F Rodrigues Jr 1
Mario Gazziro 2
1University of Sao Paulo 2Fed. University of Santo Andre
Av Trab Sao-carlense, 400 Av dos Estados, 500
Sao Carlos, SP, Brazil - 13566-590 Santo Andre, SP, Brazil - 09210-580
{ggimenes,gualdron,junio}@icmc.usp.br mario.gazziro@ufabc.edu.br
This work has financial support from Fapesp (2013/10026-7)
http://www.icmc.usp.br/pessoas/junio/Site/index.htm
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 1/21
4. Introduction Methodology Experiments Conclusions
Introduction
High demand for informations about the behavior of
scientists: authors, editors, funding agencies and society
Combining analytical techniques - multimodal approach
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 4/21
5. Introduction Methodology Experiments Conclusions
Problem
Finding non-evident facts about DBLP is a non-trivial task
Single-technique approaches - limited analytical potential
Sistematic process - can be applied on similar data from other
domains
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 5/21
6. Introduction Methodology Experiments Conclusions
Hypothesis
Hypothesis
The use of multiple analytical techniques, through a well-defined
process, is capable of revealing important aspects of the scientific
community in computer science
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 6/21
8. Introduction Methodology Experiments Conclusions
Materials
Cardinality of the entities extracted from DBLP - XML
Entity Number
Authors 1.060.221
Articles 1.801.576
Events 14.654
Publications 4.262
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 8/21
9. Introduction Methodology Experiments Conclusions
Data migration
Semi-structured format ⇒ Relational model
Need of specific software for the migration
Definition of the entity-relationship model:
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 9/21
10. Introduction Methodology Experiments Conclusions
Extracted relationships
Relationship Description
Co-authorship Authors that published an article
togheter.
Co-edition Authors that appear as editors in the
same event or journal.
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 10/21
12. Introduction Methodology Experiments Conclusions
Multimodal Analysis - WCC
Weakly-connected components distribution - Co-authorship
13% small components with up to 30 nodes
Giant component with 87% of the authors
44.000 sub-networks of co-authorship - eventual researchers,
industry white papers
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 12/21
13. Introduction Methodology Experiments Conclusions
Multimodal Analysis - ACC
Node degree × average clustering coefficient - Co-authorship
High coefficient values are found in nodes with degree < 10
Coefficient value decreases as the node degree increases - ACC ∝ degree−1.06
Authors tend to colaborate with the co-authors of their co-authors - triangles
Young authors vs. older authors
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 13/21
14. Introduction Methodology Experiments Conclusions
Multimodal Analysis - Densification
Degree distribution - Co-autorship
As new authors appear new edges also appear - e(t) ∝ n(t)1.47 - densification
Edges appear exponentially vs. publication of elaborated articles
Master and Ph.D as regular courses
Funding agencies - numbers
More authors per paper
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 14/21
15. Introduction Methodology Experiments Conclusions
Multimodal Analysis - Diameter
Effective diameter evolution - Co-edition
Peaked near 1995 - beginning of a shrink period
Before that - new editors/publication vehicles vs. after that - same editor/same
vehicles
Densification period: more new edges than new nodes - editor commitees rotate
between same members
Editor: experience and expertise - limitations for new researchers
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 15/21
16. Introduction Methodology Experiments Conclusions
Multimodal Analysis - Previsibility
Previsibility analysis - Co-authoring
Can we predict new interactions in the DBLP newtork?
Extraction of topological features → supervised learning
Figure: Results - Interval G[1995, 2005], G[2006, 2007]
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 16/21
17. Introduction Methodology Experiments Conclusions
Multimodal Analysis - Counting and algebraic analysis
Counting - Bipartite author-article network with timestamps
Accomplishment: number of years with at least one
publication
Silence: number of consecutive years with no publications
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 17/21
18. Introduction Methodology Experiments Conclusions
Multimodal Analysis - Counting and algebraic analysis
Proposed metric
Importance = 1√
silence+1
∗ log(Accomplishment)
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 18/21
20. Introduction Methodology Experiments Conclusions
Conclusions
Well-defined analytical process - combination of multiple
techniques
Non-trivial extraction of information from DBLP
Multi-perspective interpretations about the past and future of
the academic community in computer science
Application in the decision making process of funding agencies
and academic personnel
The 30th ACM/SIGAPP Symposium On Applied Computing, 2015 20/21