Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

Indexing 3-dimensional trajectories:
Apache Spark and Cassandra integration
Cesare Cugnasco: 8th of April 2015

Who am I ?
• Research Support Engineer @
in the Autonomic Systems and e-Business Platforms group since 2012
– Bachelor thesis on social network databases in 2011
– Master thesis: “Design and implementation of a Benchmarking
Platform for Cassandra Data Base” in 2013
– Conference paper : “Aeneas: A tool to enable applications to
effectively use non-relational databases”, C. Cugnasco, R.
Hernandez, Y. Becerra, J. Torres, E. Ayguadé - ICCS
2013
– Aeneas: https://github.com/cugni/aeneas
2

Nice render, but how to work with it?
Simulation needs to be visualized, explored,
queried with a human bearable response time.
One can’t wait 1 hour to see how a trajectory
looks like!

First approaches
• Trajectory size ~ 60GB:
– MySQL:
• Days to load the data
• Queries are very slow
– cat tryectory|awk ‘{ if ($12> -0.2…. was faster
– Impala on HDFS:
scales extremely, run at top CPU: still, reads all data in memory for
each query
– Cassandra+SOLR:
some trick for 2D, no true support for 3D

We have to find our own solution!

NoSQL databases
• Built from scratch to cope with Big-Data by scaling
linearly and always being available.
• How big Big-Data?
– Apple: over 75,000 nodes storing over 10 PetaBytes
– Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per
day
– eBay: over 100 nodes, 250 TB.
7

How did they scale up
• Compared to Relational databases, they have a reduced set of
functionalities:
– No distributed locks
• No isolation
• Limited atomicity
– Eventual consistency
– No memory intensive operations:
• JOINs
• GROUP BYs
• ARBITRARY FILTERING
8

Cassandra datamodel
• Essentially a HashMap where each entry contains a
SortedMap.
CREATE SCHEMA particles(
part_id Int,
time Float,
x Float,
y Float,
z Float,
PRIMARY KEY(part_id, time)
);
HashMap<Int,SortedMap<Float,Point>> particles = new ..
Partition Key Clustering Key
An example of how to store the
position of particles in time.
10

queries
SELECT * FROM particles
WHERE part_id=10
particles.get(10)
WHERE part_id=10
AND time>=1.234
AND time<2.345
particles
.get(10)
.subMap(1.234,2.345)
POSSIBLE
IMPOSSIBLE
WHERE time=1.234
WHERE x>=1.0 AND x<2.0
AND y>=1.0 AND y<2.0
AND z>=1.0 AND z<2.0
Needs a different model
Needs a multidimensional
index 11

Wait! We have secondary indexes!
Cassandra allows to have multiple secondary indexes on
attributes of a column, but
1. they work correctly only when indexing few discrete values.
SELECT * FROM user
WHERE mail=‘bad@usage.com’ NO!
SELECT * FROM user
WHERE country=‘ES’ Better

2. You can create multiple secondary indexes and use filtering
conditions on them, but only the most selective index will be
used, the other will be filtered in memory=>BAD!
SELECT * FROM user
WHERE state=‘UK’
AND sex=‘M’
AND month=‘April’
The query will read
from disk all the UK
users, and then it will
filter them in memory
by sex and month
It will crash!

3. They are indexed locally=> a query must be sent to
all nodes of the cluster!
Little scalability!
1M req/s
3M req/s
1 server 3 servers

Spark/Cassandra connector
Main idea: run a Spark’s cluster on the top of a
Cassandra’s cluster
Small difference:
Spark has a master,
Cassandra only peers
master
Each worker reads
preferably the data
stored locally in
Cassandra

Spark/Cassandra connector
The queries are partitioned using the Cassandra
node token
SELECT *
FROM particles
client
SELECT *
FROM particles
WHERE TOKEN(id)>= 1
AND TOKEN(id)< 2
SELECT *
FROM particles
WHERE TOKEN(id)>= 3
AND TOKEN(id)< 1
SELECT *
FROM particles
WHERE TOKEN(id)>= 2
AND TOKEN(id)< 3
Actual tokens are spread between 0 and 264
1
23

Spark/Cassandra connector: benefits
• Push down filtering
– Currently stable
• Select : vertical filtering
• where (“country = ‘es’”)
=> it uses C* secondary indexes, the predicate is appended to the
token filtering predicate
– Since 1.2, still in RC – not stable
• joinWithCassandraTable &&
repartitionByCassandraReplica
You can use an RDD to access all the matching rows in Cassandra.
You don’t need a full table scan for doing the join BUT you perform a
request for each line!

Spark/Cassandra connector: benefits
• Spark SQL integration!
Yes, you read right, SQL on NoSQL!
• Spark Streaming integration
• Mapping between Cassandra’s rows and
Object
• Implicitly save to Cassandra –saveToCassandra

Multidimensional
indexes
• Hierarchical structures that allow an efficient
lookup of information when we set constraints
based on two or more attributes.
• Most famous algorithms are:
• Quad-trees
• KD-trees
• R-trees
• What is important to take into consideration is
that:
1. Each algorithm fits better for some use cases.
2. They all organize data hierarchically in trees.
19

Time for code
• Find some examples at
– https://github.com/cugni/meetupExamples

No shortcut: make our own index
We finally decided to create our own index on
the top of key-value data store.
• We create indexes with Spark
• We store indexed data on Cassandra
• Queries:
– Low latency ones: done by simply reading from
Cassandra
– Aggregation, complex ones: executed with Spark

Application architecture
Entry point
Simple query direct to
Cassandra
Aggregation sent
to Spark
Thrift RPC
connection

Lesson learnt
• Heap can be a problem, with Cassandra and
Spark on the same node
• Compaction can be a problem
• if your data is not uniformly distributed,
neither will spark's work load
• The fact that API allows you, doesn’t mean
you have to!

Future works
• Spark SQL integration
– we can instruct Spark to create a query plan using
our indexes. It must understand when it’s useful
to use the index and when it is not
• Streaming indexing
– indexing and visualizing data while the
simulations are being created

Special thanks
• A special thanks to the people of CASE,
especially Antoni Artigues who is working
with me on this project on the C/C++ Paraview
side and on the simulations generated with
Alya (http://www.bsc.es/alya)

Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

Similaire à Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration (20)

Dernier

Dernier (20)

Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

Notes de l'éditeur