SlideShare a Scribd company logo
1 of 45
Download to read offline
Five factors to consider when
choosing a big data solution!
Jonathan Ellis
CTO, DataStax
Project Chair, Apache Cassandra
how do I



 my application?
                 model

©2012 DataStax
Popular options
  • Key/value
  • Tabular
  • Document
  • Graph?




©2012 DataStax
Schema is your friend

{
         "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c",
         "name": "jbellis",
         "state": "TX",
         "birthdate": "1/1/1976",
         "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"],
}




    ©2012 DataStax
SQL can be your friend too

 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,
    birth_date date
 );



 CREATE INDEX ON users(state);

 SELECT * FROM users
 WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;




©2012 DataStax
Collections
 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,
    birth_date date
 );

 CREATE TABLE users_addresses (
    user_id uuid REFERENCES users,
    email text
 );

 SELECT *
 FROM users NATURAL JOIN users_addresses;




©2012 DataStax
Collections
 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,




                 X
    birth_date date
 );

 CREATE TABLE users_addresses (
    user_id uuid REFERENCES users,
    email text
 );

 SELECT *
 FROM users NATURAL JOIN users_addresses;




©2012 DataStax
Collections
 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,
    birth_date date,
    email_addresses set<text>
 );

 UPDATE users
 SET email_addresses = email_addresses + {‘jbellis@gmail.com’,
 ‘jbellis@datastax.com’};




©2012 DataStax
Joins don’t scale
  • No joins
  • No subqueries
  • No aggregation functions* or GROUP BY
  • ORDER BY?




©2012 DataStax
SELECT * FROM tweets
WHERE user_id IN (SELECT follower FROM followers
                  WHERE user_id = ’driftx’)

                       followers




                  ?




 ©2012 DataStax
                                    tweets
Clustering in Cassandra
CREATE TABLE timeline (     user_id   tweet_id   _author    _body
  user_id uuid,
  tweet_id timeuuid,        jbellis   3290f9da.. rbranson   lorem
  tweet_author uuid,        jbellis   3895411a..   tjake    ipsum
   tweet_body text,           ...         ...        ...
  PRIMARY KEY (user_id,
                tweet_id)   driftx    3290f9da.. rbranson   lorem
);
                            driftx    71b46a84.. yzhang     dolor
                              ...         ...       ...


                            yukim     3290f9da.. rbranson   lorem
                            yukim     e451dd42..   tjake     amet
                              ...         ...        ...



 ©2012 DataStax
Clustering in Cassandra
CREATE TABLE timeline (     user_id   tweet_id   _author    _body
  user_id uuid,
  tweet_id timeuuid,        jbellis   3290f9da.. rbranson   lorem
  tweet_author uuid,        jbellis   3895411a..   tjake    ipsum
   tweet_body text,           ...         ...        ...
  PRIMARY KEY (user_id,
                tweet_id)   driftx    3290f9da.. rbranson   lorem
);
                            driftx    71b46a84.. yzhang     dolor
                              ...         ...       ...
SELECT * FROM timeline
WHERE user_id = ’driftx’;   yukim     3290f9da.. rbranson   lorem
                            yukim     e451dd42..   tjake     amet
                              ...         ...        ...



 ©2012 DataStax
how does it

                 perform?

©2012 DataStax
Larger than memory datasets




©2012 DataStax
Locking




©2012 DataStax
Efficiency




©2012 DataStax
UPDATE users
 SET email_addresses = email_addresses + {...}
 WHERE user_id = ‘jbellis’;




©2012 DataStax
Durability




©2012 DataStax
C* storage engine very briefly
           write( k1 , c1:v1 )

                                              Memory




                                 Memtable




         Commit log


©2012 DataStax                              Hard drive
write( k1 , c1:v1 )

                                                         Memory
                                 k1 c1:v1




                                            Memtable



                 k1 c1:v1




         Commit log


©2012 DataStax                                         Hard drive
write( k1 , c2:v2 )

                                                    Memory
                                 k1 c1:v1 c2:v2




                 k1 c1:v1
                 k1 c2:v2




©2012 DataStax                                    Hard drive
write(        k2   ,   c1:v1 c2:v2   )

                                                                        Memory
                                                     k1 c1:v1 c2:v2

                                                     k2 c1:v1 c2:v2




                   k1 c1:v1
                   k1 c2:v2
                 k2 c1:v1 c2:v2




©2012 DataStax                                                        Hard drive
write(        k1   ,   c1:v4 c3:v3   )

                                                                              Memory
                                                     k1 c1:v4 c2:v2 c3:v3

                                                     k2 c1:v1 c2:v2




                   k1 c1:v1
                   k1 c2:v2
                 k2 c1:v1 c2:v2
             k1 c1:v4 c3:v3




©2012 DataStax                                                              Hard drive
Memory




                           flush




                                  index
                 cleanup    k1 c1:v4 c2:v2 c3:v3

                            k2 c1:v1 c2:v2


                                                   SSTable




©2012 DataStax                                               Hard drive
No random writes




©2012 DataStax
reads/s            writes/s

                                                                       35000



                                                                      30000


                                                                     25000


                                                                    20000


                                                                   15000


                                                                   10000

                                                               5000
                 Cassandra 0.6
                                                               0
©2012 DataStax
                                           Cassandra 1.0
how does it handle

                 failure?

©2012 DataStax
Classic partitioning with SPOF
                 partition 1   partition 2      partition 3   partition 4




                                         router


                                             client
©2012 DataStax
Availability
  • “High availability implies that a single fault will not bring
            down your system. Not ‘we’ll recover quickly.’”
            -- Ben Coverston: DataStax

     •      “The biggest problem with failover is that you're almost
            never using it until it really hurts. It's like backups that
            you never test.”
            -- Rick Branson: Instagram




©2012 DataStax
Fully distributed, no SPOF
                 client




                          p3
                                p6        p1
                           p1




                                     p1




©2012 DataStax
Multiple datacenters




©2012 DataStax
©2012 DataStax
how does it

                 scale?

©2012 DataStax
Scaling antipatterns
  • Metadata servers
  • Router bottlenecks
  • Overloading existing nodes when adding capacity




©2012 DataStax
©2012 DataStax
how


 is it?
                 flexible

©2012 DataStax
36
Data model: Realtime
     LiveStocks      stock       last
                    GOOG        $95.52
                     AAPL      $186.10
                    AMZN       $112.98


       Portfolios    user       stock       shares
                    jbellis     GOOG          80
                    jbellis     LNKD          20
                    yukim       AMZN         100

      StockHist     stock        date       price
                    GOOG      2011-01-01    $8.23
                    GOOG      2011-01-02    $6.14
                    GOOG      2011-001-03   $7.78
©2012 DataStax
Data model: Analytics
 HistLoss                     worst_date    loss
                 Portfolio1   2011-07-23   -$34.81
                 Portfolio2   2011-03-11 -$11432.24
                 Portfolio3   2011-05-21 -$1476.93




©2012 DataStax
Data model: Analytics
  10dayreturns
          stock      rdate     return
          GOOG    2011-07-25   $8.23
          GOOG    2011-07-24   $6.14
          GOOG    2011-07-23   $7.78
          AAPL    2011-07-25   $15.32
          AAPL    2011-07-24   $12.68


     INSERT OVERWRITE TABLE 10dayreturns
     SELECT a.stock,
            b.date as rdate,
            b.price - a.price
     FROM StockHist a
     JOIN StockHist b
     ON (a.stock = b.stock
         AND date_add(a.date, 10) = b.date);

©2012 DataStax
Data model: Analytics
  portfolio_returns
            portfolio       rdate      preturn
            Portfolio1   2011-07-25    $118.21
            Portfolio1   2011-07-24     $60.78
            Portfolio1   2011-07-23    -$34.81
            Portfolio2   2011-07-25   $2143.92
            Portfolio3   2011-07-24    -$10.19


       INSERT OVERWRITE TABLE portfolio_returns
       SELECT portfolio,
              rdate,
              SUM(b.return)
       FROM portfolios a JOIN 10dayreturns b
       ON (a.stock = b.stock)
       GROUP BY portfolio, rdate;

©2012 DataStax
Data model: Analytics
  HistLoss
                       worst_date    loss
          Portfolio1   2011-07-23   -$34.81
          Portfolio2   2011-03-11 -$11432.24
          Portfolio3   2011-05-21 -$1476.93



    INSERT OVERWRITE TABLE HistLoss
    SELECT a.portfolio, rdate, minp
    FROM (
      SELECT portfolio, min(preturn) as minp
      FROM portfolio_returns
      GROUP BY portfolio
    ) a
    JOIN portfolio_returns b
    ON (a.portfolio = b.portfolio and a.minp = b.preturn);

©2012 DataStax
42
Some Cassandra users




©2012 DataStax
Questions?

Image credits
•    http://www.flickr.com/photos/26817893@N05/2573006312/

•    http://www.flickr.com/photos/rowanbank/7686239548

•    http://www.flickr.com/photos/mervtheswerve/6081933265

•    http://www.flickr.com/photos/dg_pics/2526208830

•    http://www.flickr.com/photos/wainwright/351684037

•    http://www.flickr.com/photos/mikeneilson/1606662529

•    http://www.flickr.com/photos/sbisson/3852905534

•    http://www.flickr.com/photos/breadnbadger/2674928517

More Related Content

What's hot

Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014
jbellis
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1
jbellis
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 

What's hot (17)

Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
Advanced Windows Debugging
Advanced Windows DebuggingAdvanced Windows Debugging
Advanced Windows Debugging
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014
 
Deployment in Oracle SOA Suite and in Oracle BPM Suite
Deployment in Oracle SOA Suite and in Oracle BPM SuiteDeployment in Oracle SOA Suite and in Oracle BPM Suite
Deployment in Oracle SOA Suite and in Oracle BPM Suite
 
Grails 2.0 Update
Grails 2.0 UpdateGrails 2.0 Update
Grails 2.0 Update
 
Introduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseIntroduction to NoSQL and Couchbase
Introduction to NoSQL and Couchbase
 
Akiban Technologies: Renormalize
Akiban Technologies: RenormalizeAkiban Technologies: Renormalize
Akiban Technologies: Renormalize
 
The Native NDB Engine for Memcached
The Native NDB Engine for MemcachedThe Native NDB Engine for Memcached
The Native NDB Engine for Memcached
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
Cassandra11
Cassandra11Cassandra11
Cassandra11
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
What You Need to Know to Move from a Relational to a NoSQL Database
What You Need to Know to Move from a Relational to a NoSQL DatabaseWhat You Need to Know to Move from a Relational to a NoSQL Database
What You Need to Know to Move from a Relational to a NoSQL Database
 
Advanced queuinginternals
Advanced queuinginternalsAdvanced queuinginternals
Advanced queuinginternals
 

Similar to Top five questions to ask when choosing a big data solution

Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
jbellis
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
jbellis
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Michaël Figuière
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
Michaël Figuière
 

Similar to Top five questions to ask when choosing a big data solution (20)

Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
 
DataStax 6 and Beyond
DataStax 6 and BeyondDataStax 6 and Beyond
DataStax 6 and Beyond
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Future
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra Drivers
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Scaling DataStax in Docker
Scaling DataStax in DockerScaling DataStax in Docker
Scaling DataStax in Docker
 
Tungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleTungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And Oracle
 
Data day texas: Cassandra and the Cloud
Data day texas: Cassandra and the CloudData day texas: Cassandra and the Cloud
Data day texas: Cassandra and the Cloud
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
 
Big Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object StorageBig Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object Storage
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 
Multi-cluster k8ssandra
Multi-cluster k8ssandraMulti-cluster k8ssandra
Multi-cluster k8ssandra
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
 
CouchDB
CouchDBCouchDB
CouchDB
 
Copy Data Management for the DBA
Copy Data Management for the DBACopy Data Management for the DBA
Copy Data Management for the DBA
 
Breaking the-database-type-barrier-replicating-across-different-dbms
Breaking the-database-type-barrier-replicating-across-different-dbmsBreaking the-database-type-barrier-replicating-across-different-dbms
Breaking the-database-type-barrier-replicating-across-different-dbms
 
Starburn
StarburnStarburn
Starburn
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 

More from jbellis

Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
jbellis
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
jbellis
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
jbellis
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
jbellis
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
jbellis
 
Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011
jbellis
 
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
jbellis
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from java
jbellis
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011
jbellis
 
Brisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by CassandraBrisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by Cassandra
jbellis
 
Cassandra FrOSCon 10
Cassandra FrOSCon 10Cassandra FrOSCon 10
Cassandra FrOSCon 10
jbellis
 
Cassandra nosql eu 2010
Cassandra nosql eu 2010Cassandra nosql eu 2010
Cassandra nosql eu 2010
jbellis
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
jbellis
 

More from jbellis (20)

Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
 
Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011
 
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from java
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011
 
Brisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by CassandraBrisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by Cassandra
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorial
 
Cassandra 0.7, Los Angeles High Scalability Group
Cassandra 0.7, Los Angeles High Scalability GroupCassandra 0.7, Los Angeles High Scalability Group
Cassandra 0.7, Los Angeles High Scalability Group
 
Cassandra devoxx 2010
Cassandra devoxx 2010Cassandra devoxx 2010
Cassandra devoxx 2010
 
Cassandra FrOSCon 10
Cassandra FrOSCon 10Cassandra FrOSCon 10
Cassandra FrOSCon 10
 
State of Cassandra, August 2010
State of Cassandra, August 2010State of Cassandra, August 2010
State of Cassandra, August 2010
 
Cassandra nosql eu 2010
Cassandra nosql eu 2010Cassandra nosql eu 2010
Cassandra nosql eu 2010
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Top five questions to ask when choosing a big data solution

  • 1. Five factors to consider when choosing a big data solution! Jonathan Ellis CTO, DataStax Project Chair, Apache Cassandra
  • 2. how do I my application? model ©2012 DataStax
  • 3. Popular options • Key/value • Tabular • Document • Graph? ©2012 DataStax
  • 4. Schema is your friend { "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"], } ©2012 DataStax
  • 5. SQL can be your friend too CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’; ©2012 DataStax
  • 6. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses; ©2012 DataStax
  • 7. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, X birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses; ©2012 DataStax
  • 8. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text> ); UPDATE users SET email_addresses = email_addresses + {‘jbellis@gmail.com’, ‘jbellis@datastax.com’}; ©2012 DataStax
  • 9. Joins don’t scale • No joins • No subqueries • No aggregation functions* or GROUP BY • ORDER BY? ©2012 DataStax
  • 10. SELECT * FROM tweets WHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’) followers ? ©2012 DataStax tweets
  • 11. Clustering in Cassandra CREATE TABLE timeline ( user_id tweet_id _author _body   user_id uuid,   tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem   tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...   PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem ); driftx 71b46a84.. yzhang dolor ... ... ... yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  • 12. Clustering in Cassandra CREATE TABLE timeline ( user_id tweet_id _author _body   user_id uuid,   tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem   tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...   PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem ); driftx 71b46a84.. yzhang dolor ... ... ... SELECT * FROM timeline WHERE user_id = ’driftx’; yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  • 13. how does it perform? ©2012 DataStax
  • 14. Larger than memory datasets ©2012 DataStax
  • 17. UPDATE users SET email_addresses = email_addresses + {...} WHERE user_id = ‘jbellis’; ©2012 DataStax
  • 19. C* storage engine very briefly write( k1 , c1:v1 ) Memory Memtable Commit log ©2012 DataStax Hard drive
  • 20. write( k1 , c1:v1 ) Memory k1 c1:v1 Memtable k1 c1:v1 Commit log ©2012 DataStax Hard drive
  • 21. write( k1 , c2:v2 ) Memory k1 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 ©2012 DataStax Hard drive
  • 22. write( k2 , c1:v1 c2:v2 ) Memory k1 c1:v1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 ©2012 DataStax Hard drive
  • 23. write( k1 , c1:v4 c3:v3 ) Memory k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v4 c3:v3 ©2012 DataStax Hard drive
  • 24. Memory flush index cleanup k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 SSTable ©2012 DataStax Hard drive
  • 26. reads/s writes/s 35000 30000 25000 20000 15000 10000 5000 Cassandra 0.6 0 ©2012 DataStax Cassandra 1.0
  • 27. how does it handle failure? ©2012 DataStax
  • 28. Classic partitioning with SPOF partition 1 partition 2 partition 3 partition 4 router client ©2012 DataStax
  • 29. Availability • “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax • “The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.” -- Rick Branson: Instagram ©2012 DataStax
  • 30. Fully distributed, no SPOF client p3 p6 p1 p1 p1 ©2012 DataStax
  • 33. how does it scale? ©2012 DataStax
  • 34. Scaling antipatterns • Metadata servers • Router bottlenecks • Overloading existing nodes when adding capacity ©2012 DataStax
  • 36. how is it? flexible ©2012 DataStax
  • 37. 36
  • 38. Data model: Realtime LiveStocks stock last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios user stock shares jbellis GOOG 80 jbellis LNKD 20 yukim AMZN 100 StockHist stock date price GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78 ©2012 DataStax
  • 39. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 ©2012 DataStax
  • 40. Data model: Analytics 10dayreturns stock rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.stock, b.date as rdate, b.price - a.price FROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date); ©2012 DataStax
  • 41. Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock) GROUP BY portfolio, rdate; ©2012 DataStax
  • 42. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn); ©2012 DataStax
  • 43. 42
  • 45. Questions? Image credits • http://www.flickr.com/photos/26817893@N05/2573006312/ • http://www.flickr.com/photos/rowanbank/7686239548 • http://www.flickr.com/photos/mervtheswerve/6081933265 • http://www.flickr.com/photos/dg_pics/2526208830 • http://www.flickr.com/photos/wainwright/351684037 • http://www.flickr.com/photos/mikeneilson/1606662529 • http://www.flickr.com/photos/sbisson/3852905534 • http://www.flickr.com/photos/breadnbadger/2674928517