SlideShare une entreprise Scribd logo
1  sur  27
Indexing 3-dimensional trajectories:
Apache Spark and Cassandra integration
Cesare Cugnasco: 8th of April 2015
Who am I ?
• Research Support Engineer @
in the Autonomic Systems and e-Business Platforms group since 2012
– Bachelor thesis on social network databases in 2011
– Master thesis: “Design and implementation of a Benchmarking
Platform for Cassandra Data Base” in 2013
– Conference paper : “Aeneas: A tool to enable applications to
effectively use non-relational databases”, C. Cugnasco, R.
Hernandez, Y. Becerra, J. Torres, E. Ayguadé - ICCS
2013
– Aeneas: https://github.com/cugni/aeneas
2
Use case: Nose simulation
Nice render, but how to work with it?
Simulation needs to be visualized, explored,
queried with a human bearable response time.
One can’t wait 1 hour to see how a trajectory
looks like!
First approaches
• Trajectory size ~ 60GB:
– MySQL:
• Days to load the data
• Queries are very slow
– cat tryectory|awk ‘{ if ($12> -0.2…. was faster
– Impala on HDFS:
scales extremely, run at top CPU: still, reads all data in memory for
each query
– Cassandra+SOLR:
some trick for 2D, no true support for 3D
We have to find our own solution!
NoSQL databases
• Built from scratch to cope with Big-Data by scaling
linearly and always being available.
• How big Big-Data?
– Apple: over 75,000 nodes storing over 10 PetaBytes
– Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per
day
– eBay: over 100 nodes, 250 TB.
7
How did they scale up
• Compared to Relational databases, they have a reduced set of
functionalities:
– No distributed locks
• No isolation
• Limited atomicity
– Eventual consistency
– No memory intensive operations:
• JOINs
• GROUP BYs
• ARBITRARY FILTERING
8
Cassandra architecture
Cassandra datamodel
• Essentially a HashMap where each entry contains a
SortedMap.
CREATE SCHEMA particles(
part_id Int,
time Float,
x Float,
y Float,
z Float,
PRIMARY KEY(part_id, time)
);
HashMap<Int,SortedMap<Float,Point>> particles = new ..
Partition Key Clustering Key
An example of how to store the
position of particles in time.
10
queries
SELECT * FROM particles
WHERE part_id=10
particles.get(10)
SELECT * FROM particles
WHERE part_id=10
AND time>=1.234
AND time<2.345
particles
.get(10)
.subMap(1.234,2.345)
POSSIBLE
IMPOSSIBLE
SELECT * FROM particles
WHERE time=1.234
SELECT * FROM particles
WHERE x>=1.0 AND x<2.0
AND y>=1.0 AND y<2.0
AND z>=1.0 AND z<2.0
Needs a different model
Needs a multidimensional
index 11
Wait! We have secondary indexes!
Cassandra allows to have multiple secondary indexes on
attributes of a column, but
1. they work correctly only when indexing few discrete values.
SELECT * FROM user
WHERE mail=‘bad@usage.com’ NO!
SELECT * FROM user
WHERE country=‘ES’ Better
Wait! We have secondary indexes!
2. You can create multiple secondary indexes and use filtering
conditions on them, but only the most selective index will be
used, the other will be filtered in memory=>BAD!
SELECT * FROM user
WHERE state=‘UK’
AND sex=‘M’
AND month=‘April’
The query will read
from disk all the UK
users, and then it will
filter them in memory
by sex and month
It will crash!
Wait! We have secondary indexes!
3. They are indexed locally=> a query must be sent to
all nodes of the cluster!
Little scalability!
1M req/s
3M req/s
1 server 3 servers
Spark/Cassandra connector
Main idea: run a Spark’s cluster on the top of a
Cassandra’s cluster
Small difference:
Spark has a master,
Cassandra only peers
master
Each worker reads
preferably the data
stored locally in
Cassandra
Spark/Cassandra connector
The queries are partitioned using the Cassandra
node token
SELECT *
FROM particles
client
SELECT *
FROM particles
WHERE TOKEN(id)>= 1
AND TOKEN(id)< 2
SELECT *
FROM particles
WHERE TOKEN(id)>= 3
AND TOKEN(id)< 1
SELECT *
FROM particles
WHERE TOKEN(id)>= 2
AND TOKEN(id)< 3
Actual tokens are spread between 0 and 264
1
23
Spark/Cassandra connector: benefits
• Push down filtering
– Currently stable
• Select : vertical filtering
• where (“country = ‘es’”)
=> it uses C* secondary indexes, the predicate is appended to the
token filtering predicate
– Since 1.2, still in RC – not stable
• joinWithCassandraTable &&
repartitionByCassandraReplica
You can use an RDD to access all the matching rows in Cassandra.
You don’t need a full table scan for doing the join BUT you perform a
request for each line!
Spark/Cassandra connector: benefits
• Spark SQL integration!
Yes, you read right, SQL on NoSQL!
• Spark Streaming integration
• Mapping between Cassandra’s rows and
Object
• Implicitly save to Cassandra –saveToCassandra
Multidimensional
indexes
• Hierarchical structures that allow an efficient
lookup of information when we set constraints
based on two or more attributes.
• Most famous algorithms are:
• Quad-trees
• KD-trees
• R-trees
• What is important to take into consideration is
that:
1. Each algorithm fits better for some use cases.
2. They all organize data hierarchically in trees.
19
Quad-tree
Time for code
• Find some examples at
– https://github.com/cugni/meetupExamples
No shortcut: make our own index
We finally decided to create our own index on
the top of key-value data store.
• We create indexes with Spark
• We store indexed data on Cassandra
• Queries:
– Low latency ones: done by simply reading from
Cassandra
– Aggregation, complex ones: executed with Spark
Application architecture
Entry point
Simple query direct to
Cassandra
Aggregation sent
to Spark
Thrift RPC
connection
Lesson learnt
• Heap can be a problem, with Cassandra and
Spark on the same node
• Compaction can be a problem
• if your data is not uniformly distributed,
neither will spark's work load
• The fact that API allows you, doesn’t mean
you have to!
Future works
• Spark SQL integration
– we can instruct Spark to create a query plan using
our indexes. It must understand when it’s useful
to use the index and when it is not
• Streaming indexing
– indexing and visualizing data while the
simulations are being created
Special thanks
• A special thanks to the people of CASE,
especially Antoni Artigues who is working
with me on this project on the C/C++ Paraview
side and on the simulations generated with
Alya (http://www.bsc.es/alya)

Contenu connexe

Tendances

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on HadoopTyler Mitchell
 
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsHow To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsKinetica
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDataWorks Summit/Hadoop Summit
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesKinetica
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightDataWorks Summit/Hadoop Summit
 

Tendances (20)

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsHow To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske Bank
 
On Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and AmbariOn Demand HDP Clusters using Cloudbreak and Ambari
On Demand HDP Clusters using Cloudbreak and Ambari
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
 

Similaire à Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xshradha ambekar
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraMichael Kjellman
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanDataStax Academy
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Jon Haddad
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraStratio
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at PollfishPollfish
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSVipul Thakur
 

Similaire à Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration (20)

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Spark Introduction
Spark IntroductionSpark Introduction
Spark Introduction
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMS
 

Dernier

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Dernier (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

  • 1. Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration Cesare Cugnasco: 8th of April 2015
  • 2. Who am I ? • Research Support Engineer @ in the Autonomic Systems and e-Business Platforms group since 2012 – Bachelor thesis on social network databases in 2011 – Master thesis: “Design and implementation of a Benchmarking Platform for Cassandra Data Base” in 2013 – Conference paper : “Aeneas: A tool to enable applications to effectively use non-relational databases”, C. Cugnasco, R. Hernandez, Y. Becerra, J. Torres, E. Ayguadé - ICCS 2013 – Aeneas: https://github.com/cugni/aeneas 2
  • 3. Use case: Nose simulation
  • 4. Nice render, but how to work with it? Simulation needs to be visualized, explored, queried with a human bearable response time. One can’t wait 1 hour to see how a trajectory looks like!
  • 5. First approaches • Trajectory size ~ 60GB: – MySQL: • Days to load the data • Queries are very slow – cat tryectory|awk ‘{ if ($12> -0.2…. was faster – Impala on HDFS: scales extremely, run at top CPU: still, reads all data in memory for each query – Cassandra+SOLR: some trick for 2D, no true support for 3D
  • 6. We have to find our own solution!
  • 7. NoSQL databases • Built from scratch to cope with Big-Data by scaling linearly and always being available. • How big Big-Data? – Apple: over 75,000 nodes storing over 10 PetaBytes – Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per day – eBay: over 100 nodes, 250 TB. 7
  • 8. How did they scale up • Compared to Relational databases, they have a reduced set of functionalities: – No distributed locks • No isolation • Limited atomicity – Eventual consistency – No memory intensive operations: • JOINs • GROUP BYs • ARBITRARY FILTERING 8
  • 10. Cassandra datamodel • Essentially a HashMap where each entry contains a SortedMap. CREATE SCHEMA particles( part_id Int, time Float, x Float, y Float, z Float, PRIMARY KEY(part_id, time) ); HashMap<Int,SortedMap<Float,Point>> particles = new .. Partition Key Clustering Key An example of how to store the position of particles in time. 10
  • 11. queries SELECT * FROM particles WHERE part_id=10 particles.get(10) SELECT * FROM particles WHERE part_id=10 AND time>=1.234 AND time<2.345 particles .get(10) .subMap(1.234,2.345) POSSIBLE IMPOSSIBLE SELECT * FROM particles WHERE time=1.234 SELECT * FROM particles WHERE x>=1.0 AND x<2.0 AND y>=1.0 AND y<2.0 AND z>=1.0 AND z<2.0 Needs a different model Needs a multidimensional index 11
  • 12. Wait! We have secondary indexes! Cassandra allows to have multiple secondary indexes on attributes of a column, but 1. they work correctly only when indexing few discrete values. SELECT * FROM user WHERE mail=‘bad@usage.com’ NO! SELECT * FROM user WHERE country=‘ES’ Better
  • 13. Wait! We have secondary indexes! 2. You can create multiple secondary indexes and use filtering conditions on them, but only the most selective index will be used, the other will be filtered in memory=>BAD! SELECT * FROM user WHERE state=‘UK’ AND sex=‘M’ AND month=‘April’ The query will read from disk all the UK users, and then it will filter them in memory by sex and month It will crash!
  • 14. Wait! We have secondary indexes! 3. They are indexed locally=> a query must be sent to all nodes of the cluster! Little scalability! 1M req/s 3M req/s 1 server 3 servers
  • 15. Spark/Cassandra connector Main idea: run a Spark’s cluster on the top of a Cassandra’s cluster Small difference: Spark has a master, Cassandra only peers master Each worker reads preferably the data stored locally in Cassandra
  • 16. Spark/Cassandra connector The queries are partitioned using the Cassandra node token SELECT * FROM particles client SELECT * FROM particles WHERE TOKEN(id)>= 1 AND TOKEN(id)< 2 SELECT * FROM particles WHERE TOKEN(id)>= 3 AND TOKEN(id)< 1 SELECT * FROM particles WHERE TOKEN(id)>= 2 AND TOKEN(id)< 3 Actual tokens are spread between 0 and 264 1 23
  • 17. Spark/Cassandra connector: benefits • Push down filtering – Currently stable • Select : vertical filtering • where (“country = ‘es’”) => it uses C* secondary indexes, the predicate is appended to the token filtering predicate – Since 1.2, still in RC – not stable • joinWithCassandraTable && repartitionByCassandraReplica You can use an RDD to access all the matching rows in Cassandra. You don’t need a full table scan for doing the join BUT you perform a request for each line!
  • 18. Spark/Cassandra connector: benefits • Spark SQL integration! Yes, you read right, SQL on NoSQL! • Spark Streaming integration • Mapping between Cassandra’s rows and Object • Implicitly save to Cassandra –saveToCassandra
  • 19. Multidimensional indexes • Hierarchical structures that allow an efficient lookup of information when we set constraints based on two or more attributes. • Most famous algorithms are: • Quad-trees • KD-trees • R-trees • What is important to take into consideration is that: 1. Each algorithm fits better for some use cases. 2. They all organize data hierarchically in trees. 19
  • 21.
  • 22. Time for code • Find some examples at – https://github.com/cugni/meetupExamples
  • 23. No shortcut: make our own index We finally decided to create our own index on the top of key-value data store. • We create indexes with Spark • We store indexed data on Cassandra • Queries: – Low latency ones: done by simply reading from Cassandra – Aggregation, complex ones: executed with Spark
  • 24. Application architecture Entry point Simple query direct to Cassandra Aggregation sent to Spark Thrift RPC connection
  • 25. Lesson learnt • Heap can be a problem, with Cassandra and Spark on the same node • Compaction can be a problem • if your data is not uniformly distributed, neither will spark's work load • The fact that API allows you, doesn’t mean you have to!
  • 26. Future works • Spark SQL integration – we can instruct Spark to create a query plan using our indexes. It must understand when it’s useful to use the index and when it is not • Streaming indexing – indexing and visualizing data while the simulations are being created
  • 27. Special thanks • A special thanks to the people of CASE, especially Antoni Artigues who is working with me on this project on the C/C++ Paraview side and on the simulations generated with Alya (http://www.bsc.es/alya)

Notes de l'éditeur

  1. Remind what is particle key and clustering key
  2. ----- Meeting Notes (10/02/15 17:11) ----- slow