SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Modeling Data In Cassandra
     Conceptual Differences Versus RDBMS
    Matthew F. Dennis, DataStax // @mdennis




June 27, 2012
Cassandra Is Not Relational
get out of the relational mindset when working
  with Cassandra (or really any NoSQL DB)
Work Backwards From Queries
   Think in terms of queries, not in terms of
normalizing the data; in fact, you often want to
  denormalize (already common in the data
    warehousing world, even in RDBMS)
OK great, but how do I do that?
Well, you need to know how Cassandra models
          data (e.g. Google Big Table)

   research.google.com/archive/bigtable-osdi06.pdf



   Go Read It!
In Cassandra:

data is organized into Keyspaces (usually one per app)
➔




each Keyspace can have multiple Column Families
➔




each Column Family can have many Rows
➔




each Row has a Row Key and a variable number of Columns
➔




each Column consists of a Name, Value and Timestamp
➔
In Cassandra, Keyspaces:
are similar in concept to a “database” in some RDBMs
➔




are stored in separate directories on disk
➔




are usually one-one with applications
➔




are usually the administrative unit for things related to ops
➔




contain multiple column families
➔
In Cassandra, In Keyspaces, Column Famlies:
   ➔ are similar in concept to a “table” in most RDBMs

   ➔ are stored in separate files on disk (many per CF)

   ➔ are usually approximately one-one with query type

   ➔ are usually the administrative unit for things related to your data

   ➔ can contain many (~billion* per node) rows




* for a good sized node
(you can always add nodes)
In Cassandra, In Keyspaces, In Column Families ...
Rows

 thepaul   office: Austin      OS: OSX          twitter: thepaul0


 mdennis    office: UA         OS: Linux        twitter: mdennis


  thobbs   office: Austin   twitter: tylhobbs




Row Keys
thepaul   office: Austin       OS: OSX          twitter: thepaul0


mdennis    office: UA          OS: Linux        twitter: mdennis


thobbs    office: Austin    twitter: tylhobbs




                           Columns
Column Names

thepaul   office: Austin      OS: OSX          twitter: thepaul0


mdennis    office: UA         OS: Linux        twitter: mdennis


thobbs    office: Austin   twitter: tylhobbs
Column Values

thepaul   office: Austin      OS: OSX          twitter: thepaul0


mdennis    office: UA         OS: Linux        twitter: mdennis


thobbs    office: Austin   twitter: tylhobbs
thepaul   office: Austin       OS: OSX          twitter: thepaul0


mdennis    office: UA          OS: Linux        twitter: mdennis


thobbs    office: Austin    twitter: tylhobbs




                           Rows Are Randomly Ordered
                             (if using the RandomPartitioner)
thepaul   office: Austin           OS: OSX          twitter: thepaul0


mdennis    office: UA              OS: Linux        twitter: mdennis


thobbs    office: Austin        twitter: tylhobbs




                  Columns Are Ordered by Name
                           (by a configurable comparator)
Columns are ordered because
 doing so allows very efficient
implementations of useful and
     common operations

        (e.g. merge join)
In particular, within a row
columns with a given name can
    be located very quickly.
(ordered names => log(n) binary search)
More importantly, I can query for a
      slice between a start and end

                 Row Key

RK   ts0   ts1   ...   ...   tsM ...   ...   ...   ...   tsN ...   ...   ...   ...   ...


 start                                                                         end
Why does that matter?
Because columns within don’t have to be static!
    (and random disk seeks are teh evil)
The Column Name Can Be Part of Your Data

  INTC     ts0: $25.20         ts1: $25.25             ...


  AMR       ts0: $6.20          ts9: $0.26             ...


  CRDS      ts0: $1.05          ts5: $6.82             ...




                  Columns Are Ordered by Name
                   (in this case by a TimeUUID Comparator)
Turns Out That Pattern Comes Up A Lot
  ➔ stock ticks
  ➔ event logs

  ➔ ad clicks/views

  ➔ sensor records

  ➔ access/error logs

  ➔ plane/truck/person/”entity” locations

  ➔…
OK, but I can do that in SQL
Not efficiently at scale, at least not easily ...
How it Looks In a RDBMS
                    ticker   timestamp   bid   ask   ...
                    AMR      ts0         ...   ...   ...
                    ...      ...         ...   ...   ...
                    CRDS     ts0         ...   ...   ...
                    ...      ...         ...   ...   ...
Data I Care About   ...      ts0         ...   ...   ...
                    AMR      ts1         ...   ...   ...
                    ...      ...         ...   ...   ...
                    ...      ...         ...   ...   ...
                    …        ts1         ...   ...   ...
                    AMR      ts2         ...   ...   ...
                    ...      ts2         ...   ...   ...
How it Looks In a RDBMS
             ticker     timestamp   bid   ask   ...
             AMR        ts0         ...   ...   ...



                      Larger Than Your Page Size
Disk Seeks
             AMR        ts1         ...   ...   ...


                      Larger Than Your Page Size

             AMR        ts2         ...   ...   ...
             ...        ts2         ...   ...   ...
OK, but what about ...
PostgreSQL Cluster Command?
➔




MySQL Cluster Indexes?
➔




Oracle Index Organized Tables?
➔




SQLServer Clustered Index?
➔
OK, but what about ...
PostgreSQL Cluster Using?
➔




    Meh ...
MySQL [InnoDB] Cluster Indexes?
➔




Oracle Index Organized Table?
➔




SQLServer Clustered Index?
➔
The on-disk management of that
        clustering results in tons of IO …

In the case of PostgreSQL:

clustering is a one time operation
➔

    (implies you must periodically rewrite the entire table)

new data is *not* written in clustered order
➔

    (which is often the data you care most about)
OK, so just partition the tables ...
Not a bad idea, except in MySQL there is a limit of
 1024 partitions and generally less if using NDB

 (you should probably still do it if using MySQL though)

  http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
OK fine, I agree storing data that is queried
       together on disk together is a good thing but
          what's that have to do with modeling in
                        Cassandra?

        Seek To Here


 RK    ts0   ts1   ...   ...   tsM ...   ...   ...   ...   tsN ...   ...   ...   ...   ...



                                  Read Precisely My Data *



* more on some caveats later
Well, that's what is meant by “work backwards
from your queries” or “think in terms of queries”

(NB: this concept, in general, applies to RDBMS
 at scale as well; it is not specific to Cassandra)
An Example From Fraud Detection
  To calculate risk it is common to need to know all the
 emails, destinations, origins, devices, locations, phone
numbers, et cetera ever used for the account in question
In a normalized model that usually translates to a
          table for each type of entity being tracked

                id          name         ...           id          device         ...
                1           guy          ...           1000        0xdead         ...
                2           gal          ...           2000        0xb33f         ...
                ...         ...          ...           ...         ...            ...


id       dest         ...          id          email         ...            id          origin    ...
15       USA          ...          100         guy@          ...            150         USA       ...
25       Finland      ...          200         gal@          ...            250         Nigeria   ...
...      ...          ...          ...         ...           ...            ...         ...       ...
The problem is that at scale that also means
        a disk seek for each one …
    (even for perfect IOT et al if across multiple tables)




➔Previous emails? That's a seek …
➔Previous devices? That's a seek …

➔Previous destinations? That's a seek ...
But In Cassandra I Store The Data I Query
           Together On Disk Together
               (remember, column names need not be static)


  Data I Care About

acctY    ...          ...          ...       ...        ...      ...         ...
acctX    dest21       dev2         dev7        email3   email9   orig4       ...
acctZ    ...          ...          ...       ...        ...      ...         ...



                            email:cassandra@mailinator.com = dateEmailWasLastUsed




                            Column Name                                  Column Value
Don't treat Cassandra (or any DB) as a black box
  ➔Understand how your DBs (and data structures) work

  ➔Understand the building blocks they provide

  ➔Understand the work complexity (“big O”) of queries

  ➔For data sets > memory, goal is to minimize seeks *




* on a related note, SSDs are awesome
Q?
      Modeling Data In Cassandra
 Conceptual Differences Versus RDBMS
Matthew F. Dennis, DataStax // @mdennis

Contenu connexe

Tendances

Guaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in RustGuaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in Rustnikomatsakis
 
Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22nikomatsakis
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)nikomatsakis
 
Better Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingBetter Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingGuillermo Gonzalez
 
Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Angel Boy
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Siouxnikomatsakis
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQLBen Scofield
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템HyeonSeok Choi
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013aleks-f
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationAngel Boy
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of AbstractionAlex Miller
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performanceaaronmorton
 

Tendances (20)

Guaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in RustGuaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in Rust
 
Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22
 
8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)
 
11 bytecode
11 bytecode11 bytecode
11 bytecode
 
Better Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingBetter Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworking
 
Senten500.c
Senten500.cSenten500.c
Senten500.c
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
 
Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
 
12 virtualmachine
12 virtualmachine12 virtualmachine
12 virtualmachine
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQL
 
07 bestpractice
07 bestpractice07 bestpractice
07 bestpractice
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) Exploitation
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
 
Python lec4
Python lec4Python lec4
Python lec4
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performance
 

En vedette

Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-PatternsMatthew Dennis
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsMatthew Dennis
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Modelebenhewitt
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSEDataStax Academy
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraJim Hatcher
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandrajbellis
 
C*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraC*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraDataStax
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraPatrick McFadin
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13Dave Gardner
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13Dave Gardner
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
 
From rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchFrom rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchDuyhai Doan
 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsDuyhai Doan
 

En vedette (20)

Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in Cassandra
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
C*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraC*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache Cassandra
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
From rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchFrom rdbms to cassandra without a hitch
From rdbms to cassandra without a hitch
 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patterns
 

Similaire à DZone Cassandra Data Modeling Webinar

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Factmediumdata
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code ClinicMike Acton
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsRuben Verborgh
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2aaronmorton
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceHeroku
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDBRick Copeland
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandraaaronmorton
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client TutorialJoe McTee
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperConnor McDonald
 

Similaire à DZone Cassandra Data Modeling Webinar (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
 
Intro to riak
Intro to riakIntro to riak
Intro to riak
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
 
Taming Cassandra
Taming CassandraTaming Cassandra
Taming Cassandra
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
Web_Alg_Project
Web_Alg_ProjectWeb_Alg_Project
Web_Alg_Project
 

Dernier

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Dernier (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

DZone Cassandra Data Modeling Webinar

  • 1. Modeling Data In Cassandra Conceptual Differences Versus RDBMS Matthew F. Dennis, DataStax // @mdennis June 27, 2012
  • 2. Cassandra Is Not Relational get out of the relational mindset when working with Cassandra (or really any NoSQL DB)
  • 3. Work Backwards From Queries Think in terms of queries, not in terms of normalizing the data; in fact, you often want to denormalize (already common in the data warehousing world, even in RDBMS)
  • 4. OK great, but how do I do that? Well, you need to know how Cassandra models data (e.g. Google Big Table) research.google.com/archive/bigtable-osdi06.pdf Go Read It!
  • 5. In Cassandra: data is organized into Keyspaces (usually one per app) ➔ each Keyspace can have multiple Column Families ➔ each Column Family can have many Rows ➔ each Row has a Row Key and a variable number of Columns ➔ each Column consists of a Name, Value and Timestamp ➔
  • 6. In Cassandra, Keyspaces: are similar in concept to a “database” in some RDBMs ➔ are stored in separate directories on disk ➔ are usually one-one with applications ➔ are usually the administrative unit for things related to ops ➔ contain multiple column families ➔
  • 7. In Cassandra, In Keyspaces, Column Famlies: ➔ are similar in concept to a “table” in most RDBMs ➔ are stored in separate files on disk (many per CF) ➔ are usually approximately one-one with query type ➔ are usually the administrative unit for things related to your data ➔ can contain many (~billion* per node) rows * for a good sized node (you can always add nodes)
  • 8. In Cassandra, In Keyspaces, In Column Families ...
  • 9. Rows thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Row Keys
  • 10. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Columns
  • 11. Column Names thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs
  • 12. Column Values thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs
  • 13. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Rows Are Randomly Ordered (if using the RandomPartitioner)
  • 14. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Columns Are Ordered by Name (by a configurable comparator)
  • 15. Columns are ordered because doing so allows very efficient implementations of useful and common operations (e.g. merge join)
  • 16. In particular, within a row columns with a given name can be located very quickly. (ordered names => log(n) binary search)
  • 17. More importantly, I can query for a slice between a start and end Row Key RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... start end
  • 18. Why does that matter? Because columns within don’t have to be static! (and random disk seeks are teh evil)
  • 19. The Column Name Can Be Part of Your Data INTC ts0: $25.20 ts1: $25.25 ... AMR ts0: $6.20 ts9: $0.26 ... CRDS ts0: $1.05 ts5: $6.82 ... Columns Are Ordered by Name (in this case by a TimeUUID Comparator)
  • 20. Turns Out That Pattern Comes Up A Lot ➔ stock ticks ➔ event logs ➔ ad clicks/views ➔ sensor records ➔ access/error logs ➔ plane/truck/person/”entity” locations ➔…
  • 21. OK, but I can do that in SQL Not efficiently at scale, at least not easily ...
  • 22. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... ... ... ... ... ... CRDS ts0 ... ... ... ... ... ... ... ... Data I Care About ... ts0 ... ... ... AMR ts1 ... ... ... ... ... ... ... ... ... ... ... ... ... … ts1 ... ... ... AMR ts2 ... ... ... ... ts2 ... ... ...
  • 23. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... Larger Than Your Page Size Disk Seeks AMR ts1 ... ... ... Larger Than Your Page Size AMR ts2 ... ... ... ... ts2 ... ... ...
  • 24. OK, but what about ... PostgreSQL Cluster Command? ➔ MySQL Cluster Indexes? ➔ Oracle Index Organized Tables? ➔ SQLServer Clustered Index? ➔
  • 25. OK, but what about ... PostgreSQL Cluster Using? ➔ Meh ... MySQL [InnoDB] Cluster Indexes? ➔ Oracle Index Organized Table? ➔ SQLServer Clustered Index? ➔
  • 26. The on-disk management of that clustering results in tons of IO … In the case of PostgreSQL: clustering is a one time operation ➔ (implies you must periodically rewrite the entire table) new data is *not* written in clustered order ➔ (which is often the data you care most about)
  • 27. OK, so just partition the tables ...
  • 28. Not a bad idea, except in MySQL there is a limit of 1024 partitions and generally less if using NDB (you should probably still do it if using MySQL though) http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
  • 29. OK fine, I agree storing data that is queried together on disk together is a good thing but what's that have to do with modeling in Cassandra? Seek To Here RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... Read Precisely My Data * * more on some caveats later
  • 30. Well, that's what is meant by “work backwards from your queries” or “think in terms of queries” (NB: this concept, in general, applies to RDBMS at scale as well; it is not specific to Cassandra)
  • 31. An Example From Fraud Detection To calculate risk it is common to need to know all the emails, destinations, origins, devices, locations, phone numbers, et cetera ever used for the account in question
  • 32. In a normalized model that usually translates to a table for each type of entity being tracked id name ... id device ... 1 guy ... 1000 0xdead ... 2 gal ... 2000 0xb33f ... ... ... ... ... ... ... id dest ... id email ... id origin ... 15 USA ... 100 guy@ ... 150 USA ... 25 Finland ... 200 gal@ ... 250 Nigeria ... ... ... ... ... ... ... ... ... ...
  • 33. The problem is that at scale that also means a disk seek for each one … (even for perfect IOT et al if across multiple tables) ➔Previous emails? That's a seek … ➔Previous devices? That's a seek … ➔Previous destinations? That's a seek ...
  • 34. But In Cassandra I Store The Data I Query Together On Disk Together (remember, column names need not be static) Data I Care About acctY ... ... ... ... ... ... ... acctX dest21 dev2 dev7 email3 email9 orig4 ... acctZ ... ... ... ... ... ... ... email:cassandra@mailinator.com = dateEmailWasLastUsed Column Name Column Value
  • 35. Don't treat Cassandra (or any DB) as a black box ➔Understand how your DBs (and data structures) work ➔Understand the building blocks they provide ➔Understand the work complexity (“big O”) of queries ➔For data sets > memory, goal is to minimize seeks * * on a related note, SSDs are awesome
  • 36. Q? Modeling Data In Cassandra Conceptual Differences Versus RDBMS Matthew F. Dennis, DataStax // @mdennis