Benchmarking graph databases on the problem of community detection

Benchmarking graph databases on the problem
of community detection
Sotirios Beis, Symeon Papadopoulos, Yiannis Kompatsiaris
Centre for Research and Technology Hellas, Information Technologies Institute (CERTH-ITI)
ADBIS 2014 Conference
Ohrid, Republic of Macedonia, September 7-10, 2014

Important Note
• This is an update of the actual presentation given in
September 2014. After the presentation, we received
comments about the implementation of our benchmark,
which led us to update the implementation and rerun the
experiments. We would like to thank @lgarulli and
@OrientDB for their contribution.
• The updated paper is now available on:
http://mklab.iti.gr/files/beis_adbis2014_corrected.pdf
while the original is still available on the Springer site:
http://link.springer.com/chapter/10.1007/978-3-319-10518-5_1
#2

#3
Overview
• Problem formulation
• Related work
• Workload description
• Evaluation
• Conclusions

Motivation
• The rapid growth of Online Social Networks contributes
to the creation of high-volume and velocity graph data
– Twitter had 145M users in 2010 and 300M in 2011
http://dstevenwhite.com/2011/12/29/social-media-growth-2006-2011/
• The need for massive graph data management and
processing systems is constantly increasing
– Many methods and technologies available (RDBMS, OODBMS
and graph databases)
• Benchmarks to evaluate candidate solutions are
absolutely essential!
– We had to choose one in the context of the SocialSensor
research project
#4

Problem Formulation
• Given a set of database management systems assess
the performance of each system with respect to a
common mining task: community detection
• Many DBMS available to use
– Graph databases selected, as they are designed to store
and manage efficiently big graph data
– Benchmarked systems: Titan, OrientDB and Neo4j
• Many operations used in databases to benchmark
– Workloads that simulate common operations in OSNs
employed
#5

Related Work (1)
• Evaluation between different DBMS
– Giatsoglou et al. focused on the Social Tagging System use
case scenario
– Angles et al. and Armstrong et al. proposed a synthetic
graph generator with OSN characteristics and use this data
for evaluation
– Grossniklaus et al. used a workloads that cover a wide
variety of graph data use case
– Vicknair et al. utilized query workload that simulates
typical operations performed in provenance system
#6

Related Work (2)
• Evaluation between graph DBMS
– Bader et al. proposed a benchmark with four operations:
insert, h-hops node and edge selection, while Dominguez
et al. reports the implementation.
– A similar workload is proposed by Jouili et al. emphasizing
the effects of increasing multiple concurrent users.
– Ciglan et al. proposed a benchmark focusing on graph
traversal operations.
– Dayarathna et al. used similar workload, focusing on graph
databases server and cloud environments.
#7

Workloads Description (1)
Our Contribution
•Clustering Workload (CW)
– Very important due to its numerous applications, such as
topic detection, photo clustering and event detection.
– Until now most community detection algorithms used
main memory to store the graph and perform the
required operations  fast execution time, unable to
manage big graphs.
– We propose an implementation of Louvain Method on top
of graph databases employing cache techniques  take
advantage of both graph database capabilities and in-memory
execution speed.
#8

Workload Description (2)
Supplementary Workloads
•Massive Insertion Workload (MIW)
– Populates a graph with bulk load operations.
– Simulates batch creation of a graph.
•Single Insertion Workload (SIW)
– Every object insertion is committed directly.
– Simulates the growth of a OSN.
•Query Workload (QW)
– FindNeighbors query (FN)  Finds the neighbors of all nodes
• simulates the friends retrieval a Facebook user
– FindAdjacentNodes query (FA)  Finds the adjacent nodes of all edges
• used to find whether two user joined the same Facebook group
– FindShortestPath (FS)  Finds the shortest paths between 100 nodes
• used to find the level two LinkedIn users are connected
#9

Experimental Study
• Datasets
– Synthetic for CW, generated with LFR generator.
– Real for MIW, SIW & QW, from SNAP dataset collection.
• Evaluation Measure: execution time (second)
#10

Results: Comparison wrt CW
#11

Results: Cache size impact
#12
Titan
Neo4j
OrientDB

Results: comparison wrt MIW and QW
#13

Results: Comparison wrt SIW
YouTube LiveJournal
#14

Conclusions
SUMMARY
• OrientDB is the most efficient solution for CW, while Titan is
the winner in SIW
• Neo4j is the fastest graph database for MIW and QW
• Titan and OrientDB have competitive performance in MIW
and QW respectively  don't scale as good as Neo4j
FUTURE WORK
• Test with bigger graphs.
• Run the benchmark employing the distributed version of
Titan and OrientDB and other graph databases (Sparksee
already added).
• Improve performance of the implemented community
detection method.
#15

References (1)
Evaluation between different DBMS
Giatsoglou, M., Papadopoulos, S., Vakali, A.: Massive graph management for the web
and web 2.0. In Vakali, A., Jain, L., eds.: New Directions in Web Data
Management 1. Volume 331 of Studies in Computational Intelligence. Springer
Berlin Heidelberg (2011) 19-58
Angles, R., Prat-Perez, A., Dominguez-Sal, D., Larriba-Pey, J.L.: Benchmarking database
systems for social network applications. In: First International Workshop on
Graph Data Management Experiences and Systems. GRADES '13, New York, NY,
USA, ACM (2013) 15:1-15:7
Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database
benchmark based on the facebook social graph. (2013)
Grossniklaus, M., Leone, S., Zaschke, T.: Towards a benchmark for graph data
management and processing. (2013)
Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a
graph database and a relational database: A data provenance perspective. In:
Proceedings of the 48th Annual Southeast Regional Conference. ACM SE '10,
New York, NY, USA, ACM (2010) 42:1-42:6
#16

References (2)
Evaluation between graph DBMS
Bader, D.A., Feo, J., Gilbert, J., Kepner, J., Koester, D., Loh, E., Madduri, K., Mann,
B., Meuse, T., Robinson, E.: Hpc scalable graph analysis benchmark (2009)
Dominguez-Sal, D., Urbn-Bayes, P., Gimnez-Va, A., Gmez-Villamor, S., Martnez-
Bazn, N., Larriba-Pey, J.: Survey of graph database performance on the hpc
scalable graph analysis benchmark. In Shen, H., Pei, J., zsu, M., Zou, L., Lu, J.,
Ling, T.W., Yu, G., Zhuang, Y., Shao, J., eds.:Web-Age Information Management.
Volume 6185 of Lecture Notes in Computer Science. Springer Berlin Heidelberg
(2010) 37-48
Ciglan, M., Averbuch, A., Hluchy, L.: Benchmarking traversal operations over graph
databases. In: Data Engineering Workshops (ICDEW), 2012 IEEE 28th
International Conference on. (April 2012) 186-189
Jouili, S., Vansteenberghe, V.: An empirical comparison of graph databases. In: Social
Computing (SocialCom), 2013 International Conference on. (Sept 2013) 708-
715
Dayarathna, M., Suzumura, T.: Xgdbench: A benchmarking platform for graph stores
in exascale clouds. In: Cloud Computing Technology and Science (Cloud-Com),
2012 IEEE 4th International Conference on. (Dec 2012) 363-370
#17

Source: https://github.com/socialsensor/graphdb-benchmarks
#18
Questions
Further contact:
sotbeis@iti.gr / papadop@iti.gr
Acknowledgement:
www.socialsensor.eu

Benchmarking graph databases on the problem of community detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Benchmarking graph databases on the problem of community detection

Similar to Benchmarking graph databases on the problem of community detection (20)

More from Symeon Papadopoulos

More from Symeon Papadopoulos (20)

Recently uploaded

Recently uploaded (20)

Benchmarking graph databases on the problem of community detection