SlideShare une entreprise Scribd logo
1  sur  30
Using Spark for
Timeseries Graph
Analytics
SIGMOID 12
TH
MEETUP
-Ved Mulkalwar
ved.mulkalwar@gmail.com
9564211606
Content
 Introduction to sample problem statement
 Which Graph database is used and why
 Installing Titan
 Titan with Cassandra
 The Gremlin Cassandra script: A way to store data in cassandra from Titan
Gremlin
 Accessing Titan with Spark
Introducing which kind of time-series
problems that can be solved using
graph analytics and introducing the
problem statement which I have been
working on solving.
The dynamics of a complex system is usually
recorded in the form of time series. In recent
years, the visibility graph algorithm and the
horizontal visibility graph algorithm have been
recently introduced as the mapping between
time series and complex networks.
Transforming time series into the graphs, the
algorithms allows applying the methods of
graph theoretical tools for characterizing time
series, opening the possibility of building fruitful
connections between time series analysis,
nonlinear dynamics, and graph theory.
 The problem statement which I have been working on:
 Our initial goal was finding anomalies in the given timeseries. We started
with slicing the data into small meaningfull parts so that the data becomes
smaller and contiguous data points having the same property get clubbed
together.
 Later we created a graph with these parts and tried to find the parts having
similar properties. Doing this will single out anomalies which was the goal.
Which graph database I used and why
i.e. Difference between graphX, neo4j
and titan , why we used titan
Titan vs graphX
 The fundamental difference between Titan and GraphX lies in how they
persistence data and how they process data, Titan by default persists data
(vertices and edges and properties) to a distributed data store in the form
of tables bound to a specific schema. This schema can be stored in
Berkley DB tables, Cassandra tables or Hbase tables.
 In the case of Titan the graph is stored as vertices in a vertex table and
edges in an edge table.
 GraphX has no real persistence layer (yet?), yes it can persist to HDFS
files, but it cannot persist to a distributed datastore in a common schema
like form as in Titan’s case.
 A graph only exists in GraphX when it is loaded into memory off raw data
and interpreted as Graph RDDs, Titan stores the graph permanently.
 The other major difference between the two solutions is how they process
graph data.
 By default, GraphX solves queries via distributed processing on many
nodes in parallel where possible as opposed to Titan processing pipelines
on a single node. Titan can also take advantage of parallel processing via
Faunus/HDFS if necessary.
Neo4j vs titan
 The primary difference between Titan and Neo4j is scalability: Titan can
distribute the graph across multiple machines (using either Cassandra or
Hbase as the storage backend) which does three things:
 1) It allows Titan to store really, really large graphs by scaling out
 2) It allows Titan to handle massive write loads
 3) No single point of failure for very high availability. While Neo4j's HA
setup gives you redundancy through replication, death of the master in
such a setup leads to temporary service interruption while a new master is
elected.
 This allows Titan to scale to large deployments with thousands of
concurrent users as demonstrated in the benchmark:
 http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/
 Neo4j has been around much longer and is therefore more "mature" in
some regards:
 1) Integrated with an external lucene index gives it more sophisticated
search capabilities
 2) More integration points into other programming languages and
development frameworks (e.g. spring)
Intro to Titan graph db and how to use
it with cassandra & spark
Installing Titan
 Downloaded the latest prebuilt version (0.9.0-M2) of Titan at
s3.thinkaurelius.com/downloads/titan/titan-0.9.0-M2-hadoop1.zip.
 Carry out the following steps to ensure that Titan is installed on each node
in the cluster:
 Now, use the Linux su (switch user) command to change to the root
account, and move the install to the /usr/local/ location. Change the file
and group membership of the install to the hadoop user, and create a
symbolic link called titan so that the current Titan release can be referred to
as the simplified path called /usr/local/titan:
Titan with Cassandra
 In this section, the Cassandra NoSQL database will be used as a storage
mechanism for Titan. Although it does not use Hadoop, it is a large-scale,
cluster-based database in its own right, and can scale to very large cluster
sizes. A graph will be created, and stored in Cassandra using the Titan
Gremlin shell. It will then be checked using Gremlin, and the stored data
will be checked in Cassandra. The raw Titan Cassandra graph-based data
will then be accessed from Spark. The first step then will be to install
Cassandra on each node in the cluster.
 Install Cassandra on all the nodes
 Set up the Cassandra configuration under /etc/cassandra/conf by altering
the cassandra.yaml file:
 Install Cassandra on all the nodes
 Set up the Cassandra configuration under /etc/cassandra/conf by altering
the cassandra.yaml file:
 Log files can be found under /var/log/cassandra, and the data is stored
under /var/lib/cassandra. The nodetool command can be used on any
Cassandra node to check the status of the Cassandra cluster:
 The Cassandra CQL shell command called cqlsh can be used to access
the cluster, and create objects. The shell is invoked next, and it shows that
Cassandra version 2.0.13 is installed:
 The Cassandra query language next shows a key space called keyspace1
that is being created and used via the CQL shell:
The Gremlin Cassandra script
 The interactive Titan Gremlin shell can be found within the bin directory of
the Titan install, as shown here. Once started, it offers a Gremlin prompt:
 The following script will be entered using the Gremlin shell. The first
section of the script defines the configuration in terms of the storage
(Cassandra), the port number, and the keyspace name that is to be used:
 Next define the generic vertex properties' name and age for the graph to
be created using the Management System. It then commits the
management system changes:
 Now, six vertices are added to the graph. Each one is given a numeric
label to represent its identity. Each vertex is given an age and name value:
 Finally, the graph edges are added to join the vertices together. Each edge
has a relationship value. Once created, the changes are committed to
store them to Titan, and therefore Cassandra:
 This results in a simple person-based graph, shown in the following figure
 This graph can then be tested in Titan via the Gremlin shell using a similar
script to the previous one. Just enter the following script at the gremlin>
prompt, as was shown previously. It uses the same initial six lines to create
the titanGraph configuration, but it then creates a graph traversal variable
g.
 The graph traversal variable can be used to check the graph contents.
Using the ValueMap option, it is possible to search for the graph nodes
called Mike and Flo. They have been successfully found here:
 Using the Cassandra CQL shell, and the Titan keyspace, it can be seen
that a number of Titan tables have been created in Cassandra:
 It can also be seen that the data exists in the edgestore table within
Cassandra:
 This assures us that a Titan graph has been created in the Gremlin shell,
and is stored in Cassandra. Now, I will try to access the data from Spark.
Accessing Titan with Spark
 So far, Titan 0.9.0-M2 has been installed, and the graphs have
successfully been created using Cassandra as backend storage options.
These graphs have been created using Gremlin-based scripts. In this
section, a properties file will be used via a Gremlin script to process a
Titan-based graph using Apache Spark. Cassandra will be used with Titan
as backend storage option.
The following figure, shows the
architecture used in this section.
 Let us examine a properties file that can be used to connect to Cassandra
as a storage backend for Titan. It contains sections for Cassandra, Apache
Spark, and the Hadoop Gremlin configuration. My Cassandra properties
file is called cassandra.properties, and it looks like this
Into Titan code
 The following necessary TinkerPop and Aurelius classes that will be used:
Scala code
Output:
Thank You

Contenu connexe

Tendances

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Michael Noll
 

Tendances (20)

So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Monitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaMonitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafana
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Cloudyn - Multi vendor Cloud management
Cloudyn - Multi vendor Cloud management Cloudyn - Multi vendor Cloud management
Cloudyn - Multi vendor Cloud management
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019
 
Spark
SparkSpark
Spark
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
 

En vedette (19)

Cm i sm_5
Cm i sm_5Cm i sm_5
Cm i sm_5
 
presentaion of hrd (luminous)
 presentaion of hrd (luminous) presentaion of hrd (luminous)
presentaion of hrd (luminous)
 
Cm i sm_7
Cm i sm_7Cm i sm_7
Cm i sm_7
 
Hondakinen gizartea
Hondakinen gizarteaHondakinen gizartea
Hondakinen gizartea
 
Tics en el aula
Tics en el aulaTics en el aula
Tics en el aula
 
Eks about us
Eks   about usEks   about us
Eks about us
 
Ktm 1190 Adventure paper "La moda"
Ktm 1190 Adventure paper "La moda"Ktm 1190 Adventure paper "La moda"
Ktm 1190 Adventure paper "La moda"
 
Hondakinen arazoak
Hondakinen arazoakHondakinen arazoak
Hondakinen arazoak
 
Genetika
GenetikaGenetika
Genetika
 
We Are EKS
We Are EKSWe Are EKS
We Are EKS
 
Cm i sm_sessio_3
Cm i sm_sessio_3Cm i sm_sessio_3
Cm i sm_sessio_3
 
Egiving getting started july
Egiving    getting started julyEgiving    getting started july
Egiving getting started july
 
Reflexión 2
Reflexión 2Reflexión 2
Reflexión 2
 
Biografia beatriz
Biografia beatrizBiografia beatriz
Biografia beatriz
 
LSGlogoFinalAugust
LSGlogoFinalAugustLSGlogoFinalAugust
LSGlogoFinalAugust
 
Roof covering.
Roof covering.Roof covering.
Roof covering.
 
Juan sebastián elcano
Juan sebastián elcanoJuan sebastián elcano
Juan sebastián elcano
 
Paradigm shift in technology integration_CAGUIOA_III-A BSITE
Paradigm shift in technology integration_CAGUIOA_III-A BSITEParadigm shift in technology integration_CAGUIOA_III-A BSITE
Paradigm shift in technology integration_CAGUIOA_III-A BSITE
 
merged_document_6
merged_document_6merged_document_6
merged_document_6
 

Similaire à Using Spark for Timeseries Graph Analytics ved

GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAGRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
Shaunak Das
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
PL dream
 

Similaire à Using Spark for Timeseries Graph Analytics ved (20)

GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAGRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Scala+data
Scala+dataScala+data
Scala+data
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
No sql
No sqlNo sql
No sql
 
cassandra
cassandracassandra
cassandra
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 

Using Spark for Timeseries Graph Analytics ved

  • 1. Using Spark for Timeseries Graph Analytics SIGMOID 12 TH MEETUP -Ved Mulkalwar ved.mulkalwar@gmail.com 9564211606
  • 2. Content  Introduction to sample problem statement  Which Graph database is used and why  Installing Titan  Titan with Cassandra  The Gremlin Cassandra script: A way to store data in cassandra from Titan Gremlin  Accessing Titan with Spark
  • 3. Introducing which kind of time-series problems that can be solved using graph analytics and introducing the problem statement which I have been working on solving.
  • 4. The dynamics of a complex system is usually recorded in the form of time series. In recent years, the visibility graph algorithm and the horizontal visibility graph algorithm have been recently introduced as the mapping between time series and complex networks. Transforming time series into the graphs, the algorithms allows applying the methods of graph theoretical tools for characterizing time series, opening the possibility of building fruitful connections between time series analysis, nonlinear dynamics, and graph theory.
  • 5.  The problem statement which I have been working on:  Our initial goal was finding anomalies in the given timeseries. We started with slicing the data into small meaningfull parts so that the data becomes smaller and contiguous data points having the same property get clubbed together.  Later we created a graph with these parts and tried to find the parts having similar properties. Doing this will single out anomalies which was the goal.
  • 6. Which graph database I used and why i.e. Difference between graphX, neo4j and titan , why we used titan
  • 7. Titan vs graphX  The fundamental difference between Titan and GraphX lies in how they persistence data and how they process data, Titan by default persists data (vertices and edges and properties) to a distributed data store in the form of tables bound to a specific schema. This schema can be stored in Berkley DB tables, Cassandra tables or Hbase tables.  In the case of Titan the graph is stored as vertices in a vertex table and edges in an edge table.  GraphX has no real persistence layer (yet?), yes it can persist to HDFS files, but it cannot persist to a distributed datastore in a common schema like form as in Titan’s case.  A graph only exists in GraphX when it is loaded into memory off raw data and interpreted as Graph RDDs, Titan stores the graph permanently.  The other major difference between the two solutions is how they process graph data.  By default, GraphX solves queries via distributed processing on many nodes in parallel where possible as opposed to Titan processing pipelines on a single node. Titan can also take advantage of parallel processing via Faunus/HDFS if necessary.
  • 8. Neo4j vs titan  The primary difference between Titan and Neo4j is scalability: Titan can distribute the graph across multiple machines (using either Cassandra or Hbase as the storage backend) which does three things:  1) It allows Titan to store really, really large graphs by scaling out  2) It allows Titan to handle massive write loads  3) No single point of failure for very high availability. While Neo4j's HA setup gives you redundancy through replication, death of the master in such a setup leads to temporary service interruption while a new master is elected.  This allows Titan to scale to large deployments with thousands of concurrent users as demonstrated in the benchmark:  http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/  Neo4j has been around much longer and is therefore more "mature" in some regards:  1) Integrated with an external lucene index gives it more sophisticated search capabilities  2) More integration points into other programming languages and development frameworks (e.g. spring)
  • 9. Intro to Titan graph db and how to use it with cassandra & spark
  • 10. Installing Titan  Downloaded the latest prebuilt version (0.9.0-M2) of Titan at s3.thinkaurelius.com/downloads/titan/titan-0.9.0-M2-hadoop1.zip.  Carry out the following steps to ensure that Titan is installed on each node in the cluster:
  • 11.  Now, use the Linux su (switch user) command to change to the root account, and move the install to the /usr/local/ location. Change the file and group membership of the install to the hadoop user, and create a symbolic link called titan so that the current Titan release can be referred to as the simplified path called /usr/local/titan:
  • 12. Titan with Cassandra  In this section, the Cassandra NoSQL database will be used as a storage mechanism for Titan. Although it does not use Hadoop, it is a large-scale, cluster-based database in its own right, and can scale to very large cluster sizes. A graph will be created, and stored in Cassandra using the Titan Gremlin shell. It will then be checked using Gremlin, and the stored data will be checked in Cassandra. The raw Titan Cassandra graph-based data will then be accessed from Spark. The first step then will be to install Cassandra on each node in the cluster.
  • 13.  Install Cassandra on all the nodes  Set up the Cassandra configuration under /etc/cassandra/conf by altering the cassandra.yaml file:  Install Cassandra on all the nodes  Set up the Cassandra configuration under /etc/cassandra/conf by altering the cassandra.yaml file:
  • 14.  Log files can be found under /var/log/cassandra, and the data is stored under /var/lib/cassandra. The nodetool command can be used on any Cassandra node to check the status of the Cassandra cluster:  The Cassandra CQL shell command called cqlsh can be used to access the cluster, and create objects. The shell is invoked next, and it shows that Cassandra version 2.0.13 is installed:
  • 15.  The Cassandra query language next shows a key space called keyspace1 that is being created and used via the CQL shell:
  • 16. The Gremlin Cassandra script  The interactive Titan Gremlin shell can be found within the bin directory of the Titan install, as shown here. Once started, it offers a Gremlin prompt:
  • 17.  The following script will be entered using the Gremlin shell. The first section of the script defines the configuration in terms of the storage (Cassandra), the port number, and the keyspace name that is to be used:
  • 18.  Next define the generic vertex properties' name and age for the graph to be created using the Management System. It then commits the management system changes:
  • 19.  Now, six vertices are added to the graph. Each one is given a numeric label to represent its identity. Each vertex is given an age and name value:
  • 20.  Finally, the graph edges are added to join the vertices together. Each edge has a relationship value. Once created, the changes are committed to store them to Titan, and therefore Cassandra:
  • 21.  This results in a simple person-based graph, shown in the following figure
  • 22.  This graph can then be tested in Titan via the Gremlin shell using a similar script to the previous one. Just enter the following script at the gremlin> prompt, as was shown previously. It uses the same initial six lines to create the titanGraph configuration, but it then creates a graph traversal variable g.  The graph traversal variable can be used to check the graph contents. Using the ValueMap option, it is possible to search for the graph nodes called Mike and Flo. They have been successfully found here:
  • 23.  Using the Cassandra CQL shell, and the Titan keyspace, it can be seen that a number of Titan tables have been created in Cassandra:  It can also be seen that the data exists in the edgestore table within Cassandra:  This assures us that a Titan graph has been created in the Gremlin shell, and is stored in Cassandra. Now, I will try to access the data from Spark.
  • 24. Accessing Titan with Spark  So far, Titan 0.9.0-M2 has been installed, and the graphs have successfully been created using Cassandra as backend storage options. These graphs have been created using Gremlin-based scripts. In this section, a properties file will be used via a Gremlin script to process a Titan-based graph using Apache Spark. Cassandra will be used with Titan as backend storage option.
  • 25. The following figure, shows the architecture used in this section.
  • 26.  Let us examine a properties file that can be used to connect to Cassandra as a storage backend for Titan. It contains sections for Cassandra, Apache Spark, and the Hadoop Gremlin configuration. My Cassandra properties file is called cassandra.properties, and it looks like this
  • 27. Into Titan code  The following necessary TinkerPop and Aurelius classes that will be used: