SlideShare une entreprise Scribd logo
1  sur  39
Relational model in Cassandra:
Will it fit?
Distributed Data Days SF, September 2018
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Why this talk?
Agenda
Cassandra data model
Options and alternatives
UDT use case and Apache Spark
Who is using Cassandra?
What are 3 major Cassandra issues
Data model
Over-expectations
Poor resource planning
What are 3 major Cassandra issues
Data model
Over-expectations
Poor resource planning
Cassandra data model
It’s simple
Cassandra data model
It’s simple
Map[k, Map[k, v]]
Cassandra data model
It’s simple
Map[k, Map[k, v]]
It sucks
Cassandra data model
It’s simple
Map[k, Map[k, v]]
It sucks
Or not...
Data model
Cassandra data model
Primary key DATA
Slim row
Cassandra data model
Partition key
DATA
Wide row
Clustering key
DATA
Clustering key
DATA
Clustering key
...
But my data model looks like this
Or even...
What are my options?
Denormalization
Query based data model
Employee
EmployeeID
OrganizationID
Name
OrganizationID Name
Employee name
EmployeeID
1. Select all employees for a given organizationID
EmployeeID Name
2. Select employee for a given employeeID
OrganizationID
Organization
OrganizationID
Name
Relational model
Denormalization
Application level joins
Organization
OrganizationID
Name
Employee
EmployeeID
OrganizationID
Firstname
Lastname
Email
...
Relational model
OrganizationID Name
1. Select all employees for a given organization
2. Select employee for a given employeeID
EmployeeID Name OrganizationID
Results in multiple select statements
EmployeeID EmployeeID
...
Denormalization
Secondary indexes
1. Select all employees for a given organization
2. Select employee for a given employeeID
EmployeeID Name OrganizationID
Performance impact
...
CREATE SECONDARY INDEX
Organization
OrganizationID
Name
Employee
EmployeeID
OrganizationID
Firstname
Lastname
Email
...
Relational model
Denormalization
PROS
Fast reads
One query per request (usually)
Scalable (probably)
CONS
Complex data management
Can be extremely hard and complex on
insert/update/delete
Need to know all queries upfront
UDTs
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees list<frozen<employee>>
);
CREATE TYPE test.employee (
employeeid bigint,
firstname text,
lastname text,
email text
);
OrganizationID Name
Employees
Employee Employee ...
UDTs
PROS
Fast(er) reads
One query per request
Scalable (should be!!)
Indexing?
CONS
Complex data management
No partial updates
Need to know all queries upfront
Indexing?
Blob data
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees text / blob
);
OrganizationID Name
Employees
Employees list as a JSON text
or a serialized objects blob
JSON text or serialized objects
Blob data
PROS
Fast reads
One query per request
No need to serialize into JSON
CONS
Complex data management
No partial updates
Need to know all queries upfront
No indexing option
Relational database
PROS
It’s made for relational data
CONS
Scaling
Availability
Fault tolerance
Performance
Other options
Cassandra+Indexing
Cassandra+RDB
...
Leveraging UDTs
Use case
Highly nested data model
Impossible to denormalize
Fairly simple access patterns
Top level (root) entity
Data model
Root entity
Child entity Child entity Child entity
Child entityChild entity Child entity Child entity
Child entity Child entity
Child entity
Child entity
Child entity Child entity Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
How to insert data
Insert AS JSON (2.2+)
Inserted as string, stored as a column type
Easy to manage and debug
Keep track of the data size!!!
Spark dataframe UDT mapping
dataframe.as("parent").join(
child.groupBy(seq.map(col): _*)
.agg(collect_list(struct(columns.map(col): _*))
.alias(alias)), seq, joinType
)
dataframe.join(child
.withColumn(alias, struct(child.columns.map(col): _*))
.select(joinColumn, alias), Seq(joinColumn), joinType)
One to many
One to one
Inserting from Spark
// Save to cassandra
dataframe.write
.format("org.apache.spark.sql.cassandra")
.options(Map(
"keyspace" -> s"$keyspace",
"table" -> s"$table"
))
.mode(SaveMode.Append)
.save
Indexing UDTs
Not possible with just Cassandra
Lucene/Solr based secondary index
Indexing of fields on nested UDTs
Field analyzers
Solr schema example
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="solrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
...
<fields>
<field docValues="true" indexed="true" multiValued="false" name="partition_key" stored="true"
type="TrieIntField"/>
<field docValues="true" indexed="true" multiValued="false" name="clustering_key" stored="true"
type="TrieDateField"/>
<field docValues="true" indexed="true" multiValued="false" name="some_type.id" stored="true"
type="TrieIntField" />
<field docValues="true" indexed="true" multiValued="false" name="some_type.some_other_type.id"
stored="true" type="TrieIntField" />
...
</fields>
<uniqueKey>(partition_key,clustering_key)</uniqueKey>
</schema>
But will it blend?
Closing notes
Cassandra data model supports a lot of use cases
Data modeling skills are required
Relational model is hard but not impossible
Additional tools in the ecosystem
Don’t be stubborn
Q&A
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Thank you

Contenu connexe

Tendances

NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...ijdms
 
Solid pods and the future of the spatial web
Solid pods and the future of the spatial webSolid pods and the future of the spatial web
Solid pods and the future of the spatial webKurt Cagle
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4jNeo4j
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Ontotext
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise Ontotext
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
 
Linked Data Experiences at Springer Nature
Linked Data Experiences at Springer NatureLinked Data Experiences at Springer Nature
Linked Data Experiences at Springer NatureMichele Pasin
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databasesGraph-TA
 
Choosing your NoSQL storage
Choosing your NoSQL storageChoosing your NoSQL storage
Choosing your NoSQL storageImteyaz Khan
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsOntotext
 
Connected datalondon metadata-driven apps
Connected datalondon metadata-driven appsConnected datalondon metadata-driven apps
Connected datalondon metadata-driven appsConnected Data World
 
RDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesRDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesKurt Cagle
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataGraph-TA
 
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...Connected Data World
 

Tendances (20)

NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
 
Solid pods and the future of the spatial web
Solid pods and the future of the spatial webSolid pods and the future of the spatial web
Solid pods and the future of the spatial web
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
 
Hibernate
HibernateHibernate
Hibernate
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
Linked Data Experiences at Springer Nature
Linked Data Experiences at Springer NatureLinked Data Experiences at Springer Nature
Linked Data Experiences at Springer Nature
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
Choosing your NoSQL storage
Choosing your NoSQL storageChoosing your NoSQL storage
Choosing your NoSQL storage
 
GraphDB
GraphDBGraphDB
GraphDB
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
Connected datalondon metadata-driven apps
Connected datalondon metadata-driven appsConnected datalondon metadata-driven apps
Connected datalondon metadata-driven apps
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
RDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data FramesRDF SHACL, Annotations, and Data Frames
RDF SHACL, Annotations, and Data Frames
 
Graph databases
Graph databasesGraph databases
Graph databases
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Graph database
Graph database Graph database
Graph database
 
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
 

Similaire à Relational data model in Cassandra: Will it fit?

NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudManu Cohen-Yashar
 
Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Amazon Web Services
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarAmazon Web Services
 
Big Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesBig Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesAmazon Web Services
 
7 Databases in 70 minutes
7 Databases in 70 minutes7 Databases in 70 minutes
7 Databases in 70 minutesKaren Lopez
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauMongoDB
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaperRajesh Kumar
 
Beginner's guide to Mongodb and NoSQL
Beginner's guide to Mongodb and NoSQL  Beginner's guide to Mongodb and NoSQL
Beginner's guide to Mongodb and NoSQL Maulin Shah
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational modelChirag vasava
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Dbms Lec Uog 02
Dbms Lec Uog 02Dbms Lec Uog 02
Dbms Lec Uog 02smelltulip
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 

Similaire à Relational data model in Cassandra: Will it fit? (20)

NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloud
 
Hadoop and DynamoDB
Hadoop and DynamoDBHadoop and DynamoDB
Hadoop and DynamoDB
 
Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services Hadoop and HBase on Amazon Web Services
Hadoop and HBase on Amazon Web Services
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Big Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace WebinarBig Data Analytics with AWS and AWS Marketplace Webinar
Big Data Analytics with AWS and AWS Marketplace Webinar
 
Big Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesBig Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web Services
 
7 Databases in 70 minutes
7 Databases in 70 minutes7 Databases in 70 minutes
7 Databases in 70 minutes
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & Tableau
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaper
 
Intro to RavenDB
Intro to RavenDBIntro to RavenDB
Intro to RavenDB
 
Nosql
NosqlNosql
Nosql
 
Nosql
NosqlNosql
Nosql
 
Beginner's guide to Mongodb and NoSQL
Beginner's guide to Mongodb and NoSQL  Beginner's guide to Mongodb and NoSQL
Beginner's guide to Mongodb and NoSQL
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Dbms Lec Uog 02
Dbms Lec Uog 02Dbms Lec Uog 02
Dbms Lec Uog 02
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
 
ch02models.pptx
ch02models.pptxch02models.pptx
ch02models.pptx
 

Dernier

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringJuanCarlosMorales19600
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 

Dernier (20)

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineering
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 

Relational data model in Cassandra: Will it fit?

  • 1.
  • 2. Relational model in Cassandra: Will it fit? Distributed Data Days SF, September 2018 Matija Gobec matija.gobec@smartcat.io @mad_max0204
  • 4. Agenda Cassandra data model Options and alternatives UDT use case and Apache Spark
  • 5. Who is using Cassandra?
  • 6. What are 3 major Cassandra issues Data model Over-expectations Poor resource planning
  • 7. What are 3 major Cassandra issues Data model Over-expectations Poor resource planning
  • 9. Cassandra data model It’s simple Map[k, Map[k, v]]
  • 10. Cassandra data model It’s simple Map[k, Map[k, v]] It sucks
  • 11. Cassandra data model It’s simple Map[k, Map[k, v]] It sucks Or not...
  • 13. Cassandra data model Primary key DATA Slim row
  • 14. Cassandra data model Partition key DATA Wide row Clustering key DATA Clustering key DATA Clustering key ...
  • 15. But my data model looks like this
  • 17. What are my options?
  • 18. Denormalization Query based data model Employee EmployeeID OrganizationID Name OrganizationID Name Employee name EmployeeID 1. Select all employees for a given organizationID EmployeeID Name 2. Select employee for a given employeeID OrganizationID Organization OrganizationID Name Relational model
  • 19. Denormalization Application level joins Organization OrganizationID Name Employee EmployeeID OrganizationID Firstname Lastname Email ... Relational model OrganizationID Name 1. Select all employees for a given organization 2. Select employee for a given employeeID EmployeeID Name OrganizationID Results in multiple select statements EmployeeID EmployeeID ...
  • 20. Denormalization Secondary indexes 1. Select all employees for a given organization 2. Select employee for a given employeeID EmployeeID Name OrganizationID Performance impact ... CREATE SECONDARY INDEX Organization OrganizationID Name Employee EmployeeID OrganizationID Firstname Lastname Email ... Relational model
  • 21. Denormalization PROS Fast reads One query per request (usually) Scalable (probably) CONS Complex data management Can be extremely hard and complex on insert/update/delete Need to know all queries upfront
  • 22. UDTs CREATE TABLE keyspace.organization ( organizationid bigint PRIMARY KEY, name text, employees list<frozen<employee>> ); CREATE TYPE test.employee ( employeeid bigint, firstname text, lastname text, email text ); OrganizationID Name Employees Employee Employee ...
  • 23. UDTs PROS Fast(er) reads One query per request Scalable (should be!!) Indexing? CONS Complex data management No partial updates Need to know all queries upfront Indexing?
  • 24. Blob data CREATE TABLE keyspace.organization ( organizationid bigint PRIMARY KEY, name text, employees text / blob ); OrganizationID Name Employees Employees list as a JSON text or a serialized objects blob JSON text or serialized objects
  • 25. Blob data PROS Fast reads One query per request No need to serialize into JSON CONS Complex data management No partial updates Need to know all queries upfront No indexing option
  • 26. Relational database PROS It’s made for relational data CONS Scaling Availability Fault tolerance Performance
  • 29. Use case Highly nested data model Impossible to denormalize Fairly simple access patterns Top level (root) entity
  • 30. Data model Root entity Child entity Child entity Child entity Child entityChild entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity
  • 31. How to insert data Insert AS JSON (2.2+) Inserted as string, stored as a column type Easy to manage and debug Keep track of the data size!!!
  • 32. Spark dataframe UDT mapping dataframe.as("parent").join( child.groupBy(seq.map(col): _*) .agg(collect_list(struct(columns.map(col): _*)) .alias(alias)), seq, joinType ) dataframe.join(child .withColumn(alias, struct(child.columns.map(col): _*)) .select(joinColumn, alias), Seq(joinColumn), joinType) One to many One to one
  • 33. Inserting from Spark // Save to cassandra dataframe.write .format("org.apache.spark.sql.cassandra") .options(Map( "keyspace" -> s"$keyspace", "table" -> s"$table" )) .mode(SaveMode.Append) .save
  • 34. Indexing UDTs Not possible with just Cassandra Lucene/Solr based secondary index Indexing of fields on nested UDTs Field analyzers
  • 35. Solr schema example <?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="solrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/> <fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/> ... <fields> <field docValues="true" indexed="true" multiValued="false" name="partition_key" stored="true" type="TrieIntField"/> <field docValues="true" indexed="true" multiValued="false" name="clustering_key" stored="true" type="TrieDateField"/> <field docValues="true" indexed="true" multiValued="false" name="some_type.id" stored="true" type="TrieIntField" /> <field docValues="true" indexed="true" multiValued="false" name="some_type.some_other_type.id" stored="true" type="TrieIntField" /> ... </fields> <uniqueKey>(partition_key,clustering_key)</uniqueKey> </schema>
  • 36. But will it blend?
  • 37. Closing notes Cassandra data model supports a lot of use cases Data modeling skills are required Relational model is hard but not impossible Additional tools in the ecosystem Don’t be stubborn
  • 38. Q&A