SlideShare une entreprise Scribd logo
1  sur  13
NoSQL database systems are designed for scalability. The down side of that is a primitive key-
value data model and, as the name suggest, no support for SQL. It might sound like a serious
limitation – how can I “select”, “join”, “group” and “sort” the data? This post explains how all
these operations can be quite naturally and efficiently implemented in one of the most famous
NoSQL system – Cassandra.
To understand this post you need to know the Cassandra data model. You can find a quick
introduction in my previous post. The power of the Cassandra data model is that it extends a
basic key-value store with efficient data nesting (via columns and super columns). It means that
you can read/update a column (or a super column) without retrieving the whole record. Below I
describe how we can exploit data nesting to support various query operations.
Let’s consider a basic example: departments and employees with one-to-many relationships
respectively. So we have two column families: Emps and Deps. In Emps employee IDs are used
as keys and there are Name,Birthdate, and City columns. In Deps keys are department IDs
and the single column is Name.
1) Select
For example: select * from Emps where Birthdate = '25/04/1975'
To support this query we need to add one more column family named Birthdate_Emps in
which key is a date and column names are IDs of those employees that were born on the date.
The values are not used here and can be an empty byte array (denoted “-”). Every time when a
new employee is inserted/deleted into/from Empswe need to update Birthdate_Emps. To
execute the query we just need to retrieve all the columns for the
key'25/04/1975' from Birthdate_Emps.
Notice that Birthdate_Emps is essentially an index that allows us to execute the query very
efficiently. And this index is scalable as it is distributed across Cassandra nodes. You can go
even further to speed up the query by redundantly storing information about employees (i.e.
employee’s columns from Emps) in Birthdate_Emps. In this case employee IDs becomes names
of super columns that contain corresponding employee columns.
2) Join
For example: select * from Emps e, Deps d where e.dep_id = d.dep_id
What does join essentially mean? It constructs records that represent relationship between
entities. Such relationships can be easily (and even more naturally) represented via nesting. To
do that add column familyDep_Emps in which key is a department ID and column names are IDs
of the corresponding employees.
3) Group By
For example: select count(*) from Emps group by City
From implementation viewpoint Group By is very similar to select/indexing described above.
You just need to add a column family City_Emps with cities as keys and employee IDs as
column names. In this case you will count the number of employees on retrieval. Or you can
have a single column named count which value is the pre-calculated number of employees in
the city.
4) Order By
To keep data sorted in Cassandra you can use two mechanisms: (a) records can be sorted by
keys using OrderPreservingPartitioner with range queries (more on this in Cassandra:
RandomPartitioner vs OrderPreservingPartitioner). To keep nested data sorted you can use
automatically supported ordering for column names.
To support all these operations we store redundant data optimized for each particular query. It
has two implications:
1) You must know queries in advance (i.e. no support for ad-hoc queries). However, typically in
Web applications and enterprise OLTP applications queries are well known in advance, few in
number, and do not change often. Read Mike Stonebraker convincingly talking about that. BTW,
Constraint Tree Schema, described in the latter paper, also exploits nesting to organize data for
predefined queries.
2) We shift the burden from querying to updating because what we essentially do is supporting
materialized views (i.e. pre-computed results of queries). But it makes a lot of sense in case of
using Cassandra as Cassandra is very much optimized for updates (thanks to eventual
consistency and “log-structured” storage borrowed from Google BigTable). So we can use fast
updates to speed up query execution. Moreover, use-cases typical for social applications are
proven to be only scalable with push-on-change model (i.e. preliminary data propagation via
updates with simple queries – the approach taken in this post) in comparison with pull-on-
demand model (i.e. data are stored normalized and combined by queries on demand – classical
relational approach). On push-on-change versus pull-on-demand read WHY ARE FACEBOOK,
DIGG, AND TWITTER SO HARD TO SCALE?
Consideration for NoSql
Do you need a more flexible data model to manage data that goes beyond a rigid
RDBMS table/row data structure and instead includes a combination of structured,
semi-structured, and unstructured data?
• Do you need continuous availability with redundancy in both data and function
across one or more locations versus simple failover for the database?
• Do you need a database that runs over multiple data centers / cloud availability
zones?
• Do you need to handle high velocity data coming in via sensors, mobile devices,
and the like, and have extreme write speed and low latency query speed?
• Do you need to go beyond single machine limits for scale-up and instead go to a
scale-out architecture to support the easy addition of more processing power and
storage capacity?
• Do you need to run different workloads (e.g. online, analytics, search) on the same
data without needing to manually ETL the data to separate systems/machines?
• Do you need to manage a widely distributed system with minimal staff?
MIGRATING DATA
Moving data from an RDBMS or other database to Cassandra is generally quite easy.
The
following options exist for migrating data to Cassandra:
• COPY command - CQL provides a copy command (very similar to Postgres) that
is able to load data from an operating system file into a Cassandra table. Note that this
is not recommended for very large files.
• Bulk loader - this utility is designed for more quickly loading a Cassandra table
with a file that is delimited in some way (e.g. comma, tab, etc.)
• Sqoop - Sqoop is a utility used in Hadoop to load data from RDBMSs into a
Hadoop cluster. DataStax supports pipelining data directly from an RDBMS table
into a Cassandra table.
• ETL tools - there are a variety of ETL tools (e.g. Informatica) that support
Cassandra as both a source and target data platform. Many of these tools not only
extract and load data but also provide transformation routines that can manipulate the
incoming data in many ways. A number of these tools are also free to use (e.g.
Pentaho, Jaspersoft, Talend).
Advanced Command Line Performance Monitoring Tools
The Performance Service maintains the following levels of performance information:
• System level - supplies general memory, network, and thread pool statistics.
• Cluster level - provides metrics at the cluster, data center, and node level.
• Database level - provides drill down metrics at the keyspace, table, and table-pernode
level.
• Table histogram level - delivers histogram metrics for tables being accessed.
• Object I/O level - supplies metrics concerning 'hot objects'; data on what objects
are being accessed the most.
• User level - provides metrics concerning user activity, 'top users' (those consuming
the most resources on the cluster) and more.
• Statement level - captures queries that exceed a certain response time threshold
along with all their relevant metrics.
Once the service has been configured and is running, statistics are
populated in their associated tables and stored in a special keyspace (dse_perf). You
can then query the various performance tables to get statistics such as the I/O metrics
for certain objects:
Finding and Troubleshooting Problem Queries
DataStax Enterprise Performance Service to automatically capture
long-running queries (based on response time thresholds you specify) and then query a
performance table that holds those statements:.
The trace information is stored in the systems_traces keyspace that holds two tables:
sessions and events
Trace on individual query like explain plan:
Cassandra data modelling best practices:
1. Composite Type use through API client is not recommended.
2. Super column family use is not recommended as it de serialize all the columns on usage as
against deserialization of single column.
3. We can create wide rows (huge columns and several rows) and skinny rows (small col and huge
rows).
4. Valueless column; if Rowid={City+uid} we want to write/read only City then uid can be empty or
valueless column.
5. Can expire column based on TTL set in seconds.
6. Counter columns maintain to store a number that incrementally counts the occurrences of a
particular event or process. For example, you might use a counter column to count the number
of times a page is viewed.
7. Keyspace: a cluster has one keyspace per application.
Top level container for Column Families.
Column Family: A container for Row Keys and Column Families
Row Key: The unique identifier for data stored within a Column Family
Column: Name-Value pair with an additional field: timestamp
Super Column: A Dictionary of Columns identified by Row Key.
8. Random Partitioner is the recommended partitioning scheme. It has the following advantages
over Ordered Partitioning as in BOP
Random partitioner: It uses hash on the Row Key to determine which node in the cluster will be
responsible for the data. The hash value is generated by doing MD5 on the Row Key. Each node
in the cluster in a data center is assigned sections of this range (token) and is responsible for
storing the data whose Row Key’s hash value falls within this range.
Token Range = (2^127) ÷ (# of nodes in the cluster)
If the cluster is spanned across multiple data centers, the tokens are created for individual data
centers. Which is better.
Byte Ordered Partitioner (BOP): It allows you to calculate your own tokens and assign to nodes
yourself as opposed to Random Partitioner automatically doing this for you.
9. Partitioning => Picking out one node to store first copy of data on
Replication => Picking out additional nodes to store more copies of data.
Storage commit log (durability)  flush it to memtables(in-memory structures)  SSTables
which compact data using compaction to remove stale data and tombstones(indicator that data
deleted).
10. Binary protocol is faster than thrift.
11. Why RP?
1. RP ensures that the data is evenly distributed across all nodes in the cluster and not create
data hotspot as in BOP.
2. When a new node is added to the cluster, RP can quickly assign it a new token range and
move minimum amount of data from other nodes to the new node which it is now responsible
for. With BOP, this will have to be done manually.
3. Multiple Column Families Issue: BOP can cause uneven distribution of data if you have
multiple column families.
4. The only benefit that BOP has over RP is that it allows you to do row slices. You can obtain a
cursor like in RDBMS and move over your rows.
12. column family as a map of a map.
SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we
can use row keys and column keys to do efficient lookups and range scans.
13. The number of column keys is unbounded. In other words, you can have wide rows.
A key can itself hold a value. In other words, you can have a valueless column.
14. You need to pass the timestamp with each column value, for Cassandra to use internally for
conflict resolution. However, the timestamp can be safely ignored during modeling.
15. Start with query patterns and create ER model. Then start deformalizing and duplicating. helps
to identify the most frequent query patterns and isolate the less frequent.
Query pattern:
Get user by user id
Get item by item id
Get all the items that a particular user likes
Get all the users who like a particular item
Option 1: Exact replica of relational model.
Option 2: Normalized entities with custom indexes
Option 3: Normalized entities with de-normalization into custom indexes
Option 4: Partially de-normalized entities
Keyspaces: container for column families and a cluster has 1 keyspace per application.
CREATE KEYSPACE keyspace_name WITH
strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor='2';
Single device per row - Time Series Pattern 1
Partitioning to limit row size - Time Series Pattern 2
The solution is to use a pattern called row partitioning by adding data to the row key to limit the
amount of columns you get per device.
Reverse order timeseries with expiring columns -
Time Series Pattern 3
Data for a dashboard application and we only want to show the last 10 temperature readings. With
TTL time to live for data value it is possible.
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES
('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
create table Inbound (
InboundID int not null primary key auto_increment,
ParticipantID int not null,
FromParticipantID int not null,
Occurred date not null,
Subject varchar(50) not null,
Story text not null,
foreign key (ParticipantID) references Participant(ParticipantID),
foreign key (FromParticipantID) references Participant(ParticipantID));
create table Inbound (
ParticipantID int,
Occurred timeuuid,
FromParticipantID int,
Subject text,
Story text,
primary key (ParticipantID, Occurred));
1 Define the User Scenarios This ensures User participation and commitment.
2
Define the Steps in each
Scenario
Clarify the User Interaction.
3 Derive the Data Model.
Use a Modelling Tool, such as Data Architect or ERWin to
generate SQL.
4 Relate Data Entities to each Step. Create Cross-reference matrix to check results.
5
Identify Transactions for each
Entity
Confirm that each Entity has Transactions to load and read
Data
6 Prepare sample Data In collaboration with the Users.
7 Prepare Test Scripts Agree sign-off with the Users.
8 Define a Load Sequence
Reference Data, basics such as Products, any existing Users
or Customers,etc..
9 Run the Test Scripts Get User Sign-off to record progress.

Contenu connexe

Tendances

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explainedDavid Groozman
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introductionDavid Groozman
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones WebSantiago Coffey
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_Data
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_DataPeter_Smith_PhD_ACL_10000_Foot_View_of_Big_Data
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_DataTriNimbus
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
ImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationDavid Groozman
 

Tendances (20)

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Lighthouse
LighthouseLighthouse
Lighthouse
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introduction
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_Data
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_DataPeter_Smith_PhD_ACL_10000_Foot_View_of_Big_Data
Peter_Smith_PhD_ACL_10000_Foot_View_of_Big_Data
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Data sharding
Data shardingData sharding
Data sharding
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
ImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integration
 

En vedette

Apache Cassandra Certification
Apache Cassandra CertificationApache Cassandra Certification
Apache Cassandra CertificationVskills
 
Vskills certified html5 developer Notes
Vskills certified html5 developer NotesVskills certified html5 developer Notes
Vskills certified html5 developer NotesVskills
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
DataStax: Rigorous Cassandra Data Modeling for the Relational Data Architect
DataStax: Rigorous Cassandra Data Modeling for the Relational Data ArchitectDataStax: Rigorous Cassandra Data Modeling for the Relational Data Architect
DataStax: Rigorous Cassandra Data Modeling for the Relational Data ArchitectDataStax Academy
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...DataStax Academy
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 

En vedette (20)

Cassandra Configuration
Cassandra ConfigurationCassandra Configuration
Cassandra Configuration
 
Overcoming cassandra query limitation spark
Overcoming cassandra query limitation sparkOvercoming cassandra query limitation spark
Overcoming cassandra query limitation spark
 
Cassandra useful features
Cassandra useful featuresCassandra useful features
Cassandra useful features
 
Apache Cassandra Certification
Apache Cassandra CertificationApache Cassandra Certification
Apache Cassandra Certification
 
Cloud Security Alliance Guide to Cloud Security
Cloud Security Alliance Guide to Cloud SecurityCloud Security Alliance Guide to Cloud Security
Cloud Security Alliance Guide to Cloud Security
 
Solution Architecture - AWS
Solution Architecture - AWSSolution Architecture - AWS
Solution Architecture - AWS
 
Vskills certified html5 developer Notes
Vskills certified html5 developer NotesVskills certified html5 developer Notes
Vskills certified html5 developer Notes
 
Solution Architecture Cassandra
Solution Architecture CassandraSolution Architecture Cassandra
Solution Architecture Cassandra
 
Amazon Redshift Analytical functions
Amazon Redshift Analytical functionsAmazon Redshift Analytical functions
Amazon Redshift Analytical functions
 
Saas security
Saas securitySaas security
Saas security
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
Iam cloud security_vision_wp_236732
Iam cloud security_vision_wp_236732Iam cloud security_vision_wp_236732
Iam cloud security_vision_wp_236732
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Cloudyn - Multi vendor Cloud management
Cloudyn - Multi vendor Cloud management Cloudyn - Multi vendor Cloud management
Cloudyn - Multi vendor Cloud management
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
The Governance Framework For Managing Change
The Governance Framework For Managing ChangeThe Governance Framework For Managing Change
The Governance Framework For Managing Change
 
DataStax: Rigorous Cassandra Data Modeling for the Relational Data Architect
DataStax: Rigorous Cassandra Data Modeling for the Relational Data ArchitectDataStax: Rigorous Cassandra Data Modeling for the Relational Data Architect
DataStax: Rigorous Cassandra Data Modeling for the Relational Data Architect
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 

Similaire à Cassandra data modelling best practices

Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big DataNick Boucart
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseAlireza Kamrani
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudBharat Rane
 
no sql presentation
no sql presentationno sql presentation
no sql presentationchandanm2
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentIJERA Editor
 
JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011Satya Ramachandran
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsnehabsairam
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesTilak Patidar
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 

Similaire à Cassandra data modelling best practices (20)

nosql.pptx
nosql.pptxnosql.pptx
nosql.pptx
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
 
AWS RDS Migration Tool
AWS RDS Migration Tool AWS RDS Migration Tool
AWS RDS Migration Tool
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of database
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
 
no sql presentation
no sql presentationno sql presentation
no sql presentation
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud Environment
 
JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011JovianDATA MDX Engine Comad oct 22 2011
JovianDATA MDX Engine Comad oct 22 2011
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...
 
3 OLAP.pptx
3 OLAP.pptx3 OLAP.pptx
3 OLAP.pptx
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 

Plus de Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW

Plus de Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW (20)

Management Consultancy Saudi Telecom Digital Transformation Design Thinking
Management Consultancy Saudi Telecom Digital Transformation Design ThinkingManagement Consultancy Saudi Telecom Digital Transformation Design Thinking
Management Consultancy Saudi Telecom Digital Transformation Design Thinking
 
Major new initiatives
Major new initiativesMajor new initiatives
Major new initiatives
 
Digital transformation journey Consulting
Digital transformation journey ConsultingDigital transformation journey Consulting
Digital transformation journey Consulting
 
Agile Jira Reporting
Agile Jira Reporting Agile Jira Reporting
Agile Jira Reporting
 
Lnt and bbby Retail Houseare industry Case assignment sandeep sharma
Lnt and bbby Retail Houseare industry Case assignment  sandeep sharmaLnt and bbby Retail Houseare industry Case assignment  sandeep sharma
Lnt and bbby Retail Houseare industry Case assignment sandeep sharma
 
Risk management Consulting For Municipality
Risk management Consulting For MunicipalityRisk management Consulting For Municipality
Risk management Consulting For Municipality
 
GDPR And Privacy By design Consultancy
GDPR And Privacy By design ConsultancyGDPR And Privacy By design Consultancy
GDPR And Privacy By design Consultancy
 
Real implementation Blockchain Best Use Cases Examples
Real implementation Blockchain Best Use Cases ExamplesReal implementation Blockchain Best Use Cases Examples
Real implementation Blockchain Best Use Cases Examples
 
Ffd 05 2012
Ffd 05 2012Ffd 05 2012
Ffd 05 2012
 
Biztalk architecture for Configured SMS service
Biztalk architecture for Configured SMS serviceBiztalk architecture for Configured SMS service
Biztalk architecture for Configured SMS service
 
Data modelling interview question
Data modelling interview questionData modelling interview question
Data modelling interview question
 
Pmo best practices
Pmo best practicesPmo best practices
Pmo best practices
 
Agile project management
Agile project managementAgile project management
Agile project management
 
Enroll hostel Business Model
Enroll hostel Business ModelEnroll hostel Business Model
Enroll hostel Business Model
 
Cloud manager client provisioning guideline draft 1.0
Cloud manager client provisioning guideline draft 1.0Cloud manager client provisioning guideline draft 1.0
Cloud manager client provisioning guideline draft 1.0
 
Bpm digital transformation
Bpm digital transformationBpm digital transformation
Bpm digital transformation
 
Digital transformation explained
Digital transformation explainedDigital transformation explained
Digital transformation explained
 
Government Digital transformation trend draft 1.0
Government Digital transformation trend draft 1.0Government Digital transformation trend draft 1.0
Government Digital transformation trend draft 1.0
 
Enterprise architecture maturity rating draft 1.0
Enterprise architecture maturity rating draft 1.0Enterprise architecture maturity rating draft 1.0
Enterprise architecture maturity rating draft 1.0
 
Organisation Structure For digital Transformation Team
Organisation Structure For digital Transformation TeamOrganisation Structure For digital Transformation Team
Organisation Structure For digital Transformation Team
 

Cassandra data modelling best practices

  • 1. NoSQL database systems are designed for scalability. The down side of that is a primitive key- value data model and, as the name suggest, no support for SQL. It might sound like a serious limitation – how can I “select”, “join”, “group” and “sort” the data? This post explains how all these operations can be quite naturally and efficiently implemented in one of the most famous NoSQL system – Cassandra. To understand this post you need to know the Cassandra data model. You can find a quick introduction in my previous post. The power of the Cassandra data model is that it extends a basic key-value store with efficient data nesting (via columns and super columns). It means that you can read/update a column (or a super column) without retrieving the whole record. Below I describe how we can exploit data nesting to support various query operations. Let’s consider a basic example: departments and employees with one-to-many relationships respectively. So we have two column families: Emps and Deps. In Emps employee IDs are used as keys and there are Name,Birthdate, and City columns. In Deps keys are department IDs and the single column is Name. 1) Select For example: select * from Emps where Birthdate = '25/04/1975' To support this query we need to add one more column family named Birthdate_Emps in which key is a date and column names are IDs of those employees that were born on the date. The values are not used here and can be an empty byte array (denoted “-”). Every time when a new employee is inserted/deleted into/from Empswe need to update Birthdate_Emps. To execute the query we just need to retrieve all the columns for the key'25/04/1975' from Birthdate_Emps.
  • 2. Notice that Birthdate_Emps is essentially an index that allows us to execute the query very efficiently. And this index is scalable as it is distributed across Cassandra nodes. You can go even further to speed up the query by redundantly storing information about employees (i.e. employee’s columns from Emps) in Birthdate_Emps. In this case employee IDs becomes names of super columns that contain corresponding employee columns. 2) Join For example: select * from Emps e, Deps d where e.dep_id = d.dep_id What does join essentially mean? It constructs records that represent relationship between entities. Such relationships can be easily (and even more naturally) represented via nesting. To do that add column familyDep_Emps in which key is a department ID and column names are IDs of the corresponding employees. 3) Group By For example: select count(*) from Emps group by City From implementation viewpoint Group By is very similar to select/indexing described above. You just need to add a column family City_Emps with cities as keys and employee IDs as column names. In this case you will count the number of employees on retrieval. Or you can have a single column named count which value is the pre-calculated number of employees in the city.
  • 3. 4) Order By To keep data sorted in Cassandra you can use two mechanisms: (a) records can be sorted by keys using OrderPreservingPartitioner with range queries (more on this in Cassandra: RandomPartitioner vs OrderPreservingPartitioner). To keep nested data sorted you can use automatically supported ordering for column names. To support all these operations we store redundant data optimized for each particular query. It has two implications: 1) You must know queries in advance (i.e. no support for ad-hoc queries). However, typically in Web applications and enterprise OLTP applications queries are well known in advance, few in number, and do not change often. Read Mike Stonebraker convincingly talking about that. BTW, Constraint Tree Schema, described in the latter paper, also exploits nesting to organize data for predefined queries. 2) We shift the burden from querying to updating because what we essentially do is supporting materialized views (i.e. pre-computed results of queries). But it makes a lot of sense in case of using Cassandra as Cassandra is very much optimized for updates (thanks to eventual consistency and “log-structured” storage borrowed from Google BigTable). So we can use fast updates to speed up query execution. Moreover, use-cases typical for social applications are proven to be only scalable with push-on-change model (i.e. preliminary data propagation via updates with simple queries – the approach taken in this post) in comparison with pull-on- demand model (i.e. data are stored normalized and combined by queries on demand – classical relational approach). On push-on-change versus pull-on-demand read WHY ARE FACEBOOK, DIGG, AND TWITTER SO HARD TO SCALE? Consideration for NoSql Do you need a more flexible data model to manage data that goes beyond a rigid RDBMS table/row data structure and instead includes a combination of structured, semi-structured, and unstructured data? • Do you need continuous availability with redundancy in both data and function across one or more locations versus simple failover for the database? • Do you need a database that runs over multiple data centers / cloud availability zones? • Do you need to handle high velocity data coming in via sensors, mobile devices, and the like, and have extreme write speed and low latency query speed? • Do you need to go beyond single machine limits for scale-up and instead go to a scale-out architecture to support the easy addition of more processing power and storage capacity? • Do you need to run different workloads (e.g. online, analytics, search) on the same
  • 4. data without needing to manually ETL the data to separate systems/machines? • Do you need to manage a widely distributed system with minimal staff? MIGRATING DATA Moving data from an RDBMS or other database to Cassandra is generally quite easy. The following options exist for migrating data to Cassandra: • COPY command - CQL provides a copy command (very similar to Postgres) that is able to load data from an operating system file into a Cassandra table. Note that this is not recommended for very large files. • Bulk loader - this utility is designed for more quickly loading a Cassandra table with a file that is delimited in some way (e.g. comma, tab, etc.) • Sqoop - Sqoop is a utility used in Hadoop to load data from RDBMSs into a Hadoop cluster. DataStax supports pipelining data directly from an RDBMS table into a Cassandra table. • ETL tools - there are a variety of ETL tools (e.g. Informatica) that support Cassandra as both a source and target data platform. Many of these tools not only extract and load data but also provide transformation routines that can manipulate the incoming data in many ways. A number of these tools are also free to use (e.g. Pentaho, Jaspersoft, Talend). Advanced Command Line Performance Monitoring Tools The Performance Service maintains the following levels of performance information: • System level - supplies general memory, network, and thread pool statistics. • Cluster level - provides metrics at the cluster, data center, and node level. • Database level - provides drill down metrics at the keyspace, table, and table-pernode level. • Table histogram level - delivers histogram metrics for tables being accessed. • Object I/O level - supplies metrics concerning 'hot objects'; data on what objects are being accessed the most. • User level - provides metrics concerning user activity, 'top users' (those consuming the most resources on the cluster) and more. • Statement level - captures queries that exceed a certain response time threshold along with all their relevant metrics. Once the service has been configured and is running, statistics are populated in their associated tables and stored in a special keyspace (dse_perf). You can then query the various performance tables to get statistics such as the I/O metrics for certain objects:
  • 5. Finding and Troubleshooting Problem Queries DataStax Enterprise Performance Service to automatically capture long-running queries (based on response time thresholds you specify) and then query a performance table that holds those statements:.
  • 6. The trace information is stored in the systems_traces keyspace that holds two tables: sessions and events Trace on individual query like explain plan:
  • 7. Cassandra data modelling best practices: 1. Composite Type use through API client is not recommended. 2. Super column family use is not recommended as it de serialize all the columns on usage as against deserialization of single column. 3. We can create wide rows (huge columns and several rows) and skinny rows (small col and huge rows). 4. Valueless column; if Rowid={City+uid} we want to write/read only City then uid can be empty or valueless column. 5. Can expire column based on TTL set in seconds.
  • 8. 6. Counter columns maintain to store a number that incrementally counts the occurrences of a particular event or process. For example, you might use a counter column to count the number of times a page is viewed. 7. Keyspace: a cluster has one keyspace per application. Top level container for Column Families. Column Family: A container for Row Keys and Column Families Row Key: The unique identifier for data stored within a Column Family Column: Name-Value pair with an additional field: timestamp Super Column: A Dictionary of Columns identified by Row Key. 8. Random Partitioner is the recommended partitioning scheme. It has the following advantages over Ordered Partitioning as in BOP Random partitioner: It uses hash on the Row Key to determine which node in the cluster will be responsible for the data. The hash value is generated by doing MD5 on the Row Key. Each node in the cluster in a data center is assigned sections of this range (token) and is responsible for storing the data whose Row Key’s hash value falls within this range. Token Range = (2^127) ÷ (# of nodes in the cluster) If the cluster is spanned across multiple data centers, the tokens are created for individual data centers. Which is better. Byte Ordered Partitioner (BOP): It allows you to calculate your own tokens and assign to nodes yourself as opposed to Random Partitioner automatically doing this for you. 9. Partitioning => Picking out one node to store first copy of data on Replication => Picking out additional nodes to store more copies of data. Storage commit log (durability)  flush it to memtables(in-memory structures)  SSTables which compact data using compaction to remove stale data and tombstones(indicator that data deleted). 10. Binary protocol is faster than thrift. 11. Why RP? 1. RP ensures that the data is evenly distributed across all nodes in the cluster and not create data hotspot as in BOP. 2. When a new node is added to the cluster, RP can quickly assign it a new token range and move minimum amount of data from other nodes to the new node which it is now responsible for. With BOP, this will have to be done manually. 3. Multiple Column Families Issue: BOP can cause uneven distribution of data if you have multiple column families. 4. The only benefit that BOP has over RP is that it allows you to do row slices. You can obtain a cursor like in RDBMS and move over your rows.
  • 9. 12. column family as a map of a map. SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>> A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans. 13. The number of column keys is unbounded. In other words, you can have wide rows. A key can itself hold a value. In other words, you can have a valueless column. 14. You need to pass the timestamp with each column value, for Cassandra to use internally for conflict resolution. However, the timestamp can be safely ignored during modeling. 15. Start with query patterns and create ER model. Then start deformalizing and duplicating. helps to identify the most frequent query patterns and isolate the less frequent. Query pattern: Get user by user id Get item by item id Get all the items that a particular user likes Get all the users who like a particular item
  • 10. Option 1: Exact replica of relational model. Option 2: Normalized entities with custom indexes Option 3: Normalized entities with de-normalization into custom indexes Option 4: Partially de-normalized entities
  • 11. Keyspaces: container for column families and a cluster has 1 keyspace per application. CREATE KEYSPACE keyspace_name WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor='2'; Single device per row - Time Series Pattern 1 Partitioning to limit row size - Time Series Pattern 2 The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device. Reverse order timeseries with expiring columns - Time Series Pattern 3 Data for a dashboard application and we only want to show the last 10 temperature readings. With TTL time to live for data value it is possible. CREATE TABLE latest_temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC); INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
  • 12. create table Inbound ( InboundID int not null primary key auto_increment, ParticipantID int not null, FromParticipantID int not null, Occurred date not null, Subject varchar(50) not null, Story text not null, foreign key (ParticipantID) references Participant(ParticipantID), foreign key (FromParticipantID) references Participant(ParticipantID)); create table Inbound ( ParticipantID int, Occurred timeuuid, FromParticipantID int, Subject text, Story text, primary key (ParticipantID, Occurred));
  • 13. 1 Define the User Scenarios This ensures User participation and commitment. 2 Define the Steps in each Scenario Clarify the User Interaction. 3 Derive the Data Model. Use a Modelling Tool, such as Data Architect or ERWin to generate SQL. 4 Relate Data Entities to each Step. Create Cross-reference matrix to check results. 5 Identify Transactions for each Entity Confirm that each Entity has Transactions to load and read Data 6 Prepare sample Data In collaboration with the Users. 7 Prepare Test Scripts Agree sign-off with the Users. 8 Define a Load Sequence Reference Data, basics such as Products, any existing Users or Customers,etc.. 9 Run the Test Scripts Get User Sign-off to record progress.