In Cassandra Lunch #95, Obioma Anomnachi will discuss the DSEGraphFrames library which allows Spark to perform operations on graph databases. We discussed the difference between transactional and analytical operations on DSE graph.
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Cassandra Lunch #95: Spark Graph Operations with DSEGraphFrames Scala API
1. Version 1.0
Spark Graph Operations with
DSEGraphFrames Scala API
Scala libraries for interacting and processing data from
graph databases like DSE Graph.
Obioma Anomnachi
Engineer @ Anant
2. DSE Graph
● DSE Graph is a distributed graph database built on top of Cassandra that is part of Datastax
Enterprise (DSE)
○ It maintains many of the advantages of using Casandra/DSE, including potentially global distribution, zero
downtime, and DSE security protection
○ It also gains many of the benefits of being a graph database, namely in storage and analysis of complex and
inter-related data sets
● Can combine with DSEs included Search and Analytics capabilities
● Integrates with DSE support tools like OpsCenter and Datastax Studio
3. DSE Graph Analytics
● Most graph traversals (operations done using the adjacency of nodes and edges within a graph)
can be done in real time without making use of DSE Analytics aka Spark resources
○ Deep queries are traverals on a graph with extremely high density or branching factor (nodes are on average
connected to a large number of other nodes)
○ Scan queries traverse whole graphs or large parts of graphs
○ Either of these can require memory or computational resources beyond what the normal processing of graph
queries can provide
■ In these cases we can get better performance by having these queries run via DSE Analytics
● There are two methods for performing Analytical queries on DSE graph instances
○ OLAP queries use an alternate traversal source that uses the SparkGraphComputer to run queries on the
DSE Analytics nodes
○ The DSEGraphFrames library, support a subset of the Gremlin graph traversal language for use in Java and
Scala applications running on Spark
4. OLAP Queries
● Normal DSE Graph queries use Online Transactional Processing (OLTP)
○ Consists of a large number of short transactions for processing queries quickly
○ Used primarily for data entry and retrieval
○ Uses filters and subgraphs to speed up access to data in specific parts of the larger graph
● Online Analytical Processing (OLAP) is a Spark backed method for performing multidimensional
data analysis
○ Takes longer that OLTP queries
○ Works by interpreting the graph as a sequence of “star graphs” centered on a single vertex
○ For queries that process over the entire graph or at least large portions of a graph
5. DSE GraphFrame
● Spark API for analytics operations on DSE Graph
○ Inspired by Databricks’ GraphFrame library
○ Supports a subset of Gremlin graph traversal language
○ Faster than OLAP queries for doing filtering and counts
● Graph represented as two virtual tables
○ V() method for vertex dataframe
○ E() method for edge dataframe
● Can be used to import/export graphs
● Also supports a subset of Apache Tinkerpop traversals