SlideShare une entreprise Scribd logo
1  sur  116
Télécharger pour lire hors ligne
NoSQL
de nieuwe generatie van database servers
KVIV IT - 3/6/2010




                                                                                http://www.flickr.com/photos/wolfgangstaudt/2215246206/



        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Who am I



» Steven Noels - stevenn@outerthought.org

» Outerthought : scalable content applications

» makers of Daisy and Lily open source CMS




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   2
An evolution
driven by pain.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
History

                                                                                     2. simplification


                                       1. standardization
  hierarchical databases

      IMS
                 XMLDB                                                       RDBMS

       OODBMS




         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                      4
History


                                     4. rethinking
                                     the problem


     RDBMS                                                                    NOSQL




                           caching
                           denormalisation
                           sharding
                           replication ...
   3. pain

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org      5
Scalability

   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Numbers of scale




                   http://qos.doubleclick.net/counters/

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   7
Types of scaling
» scaling for usage                                 » scaling types of ops
 » volume of users                                     » concurrent read
 » volume of data                                      » concurrent write




   availability                                            partioning
   replication                                             consistency

                            distribution

           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   8
Scaling




 database                      app server                                               users


            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           9
Scaling




 database                      app server                                               users


            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           10
Scaling: vertical partitioning




 database                      app server                                               users


            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           11
Scaling: horizontal partitioning

                     c
                     o
                     m
                     p
                     l
                     e
                     x
                     i
                     t
                     y
 databases                     app servers                                               users

             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           12
Scaling through architecture




 data layer                       app layer                                               users

              IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           13
Distributed
systems are
hard !
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
8 fallacies of distributed computing
» The network is reliable.

» Latency is zero.

» Bandwidth is infinite.




                                                                                    Peter Deutsch and James Gosling
» The network is secure.

» Topology doesn't change.

» There is one administrator.

» Transport cost is zero.

» The network is homogeneous.

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         15
Scaling relational systems


» When scaling relational systems you loose
 their advantages but retain their overhead
» The pain is all about locking (i.e. writes)

» Caching alleviates the read pain to the cost of
 complexity



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   16
The Perspective of Cost




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   17
Enter NoSQL

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Cambrian Explosion




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   19
Buzz-oriented
                                                            development




                                                             ?
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   20
Cambrian Explosion




                               N-O-SQL




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   21
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   22
Common themes

» SCALE SCALE SCALE

» new datamodels

» devops

» N-O-SQL

» The Cloud :
 technology is of no interest anymore


      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   23
New Data

» sparse structures

» weak schemas

» graphs

» semi-structured

» document-oriented



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   24
NoSQL

» Not a movement.

» Not ANSI NoSQL-2010.

» Not one-size-fits-all.

» Not (necessarily) anti-RDBMS.

» No silver bullet.



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   25
NoSQL = pro Choice




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   26
NoSQL = toolbox




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   27
The NOSQL footprint
                            free-structured or sparse data



                                                               NOSQL

                                               MongoDB
                                             CouchDB
                                                  neo4j

                                                          Cassandra




                                                                       available (complexity)
   simple operational




                                                               HBase




                                                                         highly scalable and
      constraints
         ACID,




                                   SQL




                                 referential integrity,
                                      typed data



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org              28
NOSQL, if you need ...


» horizontal scaling (out rather than up)

» unusually common data (aka free-structured)

» speed (especially for writes)

» the bleeding edge




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   29
SQL/RDBMS, if you need ...


» SQL

» ACID

» normalisation

» a defined liability




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   30
Theory

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Academic background



» Amazon Dynamo

» Google BigTable

» Eric Brewer CAP theorem




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   32
Amazon Dynamo
» coined the term ‘eventual consistency’

» consistent hashing




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   33
Consistent hashing



                                   - node C
                                   + node D




                                                             http://www.lexemetech.com/2007/11/consistent-hashing.html


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                                    34
Google BigTable
» multi-dimensional column-oriented database

» on top of GoogleFileSystem

» object versioning




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   35
CAP theorem


             strong                                high
           consistency                          availability



                               partition-
                               tolerance



   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   36
CAP
» Strong Consistency: all clients see the
 same view, even in the presence of updates
» High Availability: all clients can find some
 replica of the data, even in the presence of
 failures
» Partition-tolerance: the system
 properties hold even when the system is
 partitioned

      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   37
Culture Clash
» ACID                                              » BASE
 » highest priority: strong                            » availability and scaling
  consistency for                                          highest priorities
  transactions                                         » weak consistency
 » availability less important
                                                       » optimistic
 » pessimistic
                                                       » best effort
 » rigorous analysis
                                                       » simple and fast
 » complex mechanisms

                                         spectrum

           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   38
Availability ≠
total async !

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
✘
The Enterprise Service Bus




                                bus =

                          congestion



    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   40
Bus systems

» objects don’t fit in a pipe

» object ➙ message

» serialization / de-serialization cost

» message size

» queuing = cost



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   41
Use a mixture of both



»async + sync



                                        stuff which matters !




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   42
Processing large datasets :

Map/Reduce

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Hadoop: HDFS + MapReduce
» single filesystem + single execution-space




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   44
MapReduce example: WordCount




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   45
Hadoop ecosystem
» Hadoop Common
» Subprojects
  » Chukwa: A data collection system for managing large distributed systems.
  » HBase: A scalable, distributed database that supports structured data storage for
    large tables.
  » HDFS: A distributed file system that provides high throughput access to application
    data.
  » Hive: A data warehouse infrastructure that provides data summarization and ad hoc
    querying.
  » MapReduce: A software framework for distributed processing of large data sets on
    compute clusters.
  » Pig: A high-level data-flow language and execution framework for parallel
    computation.
  » ZooKeeper: A high-performance coordination service for distributed applications.
  » Mahout: machine learning libaries


            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org    46
Processing large datasets with MR

» Benefit from parallellisation

» Less modelling upfront (ad-hoc processing)

» Compartmentalized approach reduces
 operational risks
» AsterData et al. have SQL/MR hybrids for
 huge-scale BI


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   47
Market
overview

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Four Trends


» Trend 1 : Data Size

» Trend 2 : Connectedness

» Trend 3 : Semi-structure

» Trend 4 : Architecture




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   49
Trend 1: Data size
               ExaBytes (10!") of data stored per year
                                                                                      988
1000
         Each year more and
         more digital data is
         created. Over t wo
 750     years we create more
         digital data than all                                        623
         the data created in
         history before that.
 500
                                                397

                            253
 250    161


   0
       2006                2007                2008                  2009             2010
                                                      Data source: IDC 2007             3


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org          50
Trend 2: Connectedness
                                                                                                                    Giant
                                                                                                                    Global
                                                                                                                 Graph (GGG)


                                    Over time data has evolved to                                   Ontologies
                                    be more and more interlinked
                                    and connected.
                                                                                           RDF
                                    Hypertext has links,
                                    Blogs have pingback,
                                    Tagging groups all related data                                       Folksonomies
  Information connectivity




                                                                                        Tagging


                                                                        Wikis            User-generated
                                                                                            content
                                                                                Blogs


                                                                      RSS


                                                  Hypertext


                         Text documents
                                                         web 1.0                  web 2.0                        “web 3.0”

                                             1990                     2000                        2010                   2020   4


                                      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                     51
Trend 3: Semi-structure
! Individualization of content
   • In the salary lists of the 1970s, all elements had exactly one job
   • In Or 15? lists of the 2000s, we need 5 job columns! Or 8?
        the salary


! All encompassing “entire world views”
   • Store more data about each entity
! Trend accelerated by the decentralization of content generation
     that is the hallmark of the age of participation (“web 2.0”)



                                                                                        5


            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       52
Trend 4: Architecture

                      1980s: Mainframe applications


                                     Application




                                           DB




                                                                                    6


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       53
Trend 4: Architecture

                  1990s: Database as integration hub


             Application             Application             Application




                                           DB




                                                                                    7


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       54
Trend 4: Architecture

           2000s: (moving towards) Decoupled services
                               with their own backend

             Application             Application             Application




                   DB                      DB                      DB




                                                                                    8


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       55
Products
overview

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
We welcome
the Polyglot
Persistence
overlords.
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Categories


» key-value stores

» column stores

» document stores

» graph databases




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   58
Key-value stores


» Focus on scaling to huge amounts of data

» Designed to handle big loads

» Often: cfr. Amazon Dynamo
 » ring partitioning and replication

» Data model: key/value pairs



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   59
Key-value stores



» Redis

» Voldemort

» Tokyo Cabinet




          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   60
Redis



» REmote DIctionary Server

» http://code.google.com/p/redis/

» vmware




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   61
Redis Features
» persisted memcache, ‘awesome’

» RAM-based + persistable

» key ➙ values: string, list, set

» higher-level ops
 » i.e. push/pop and sort for lists
» fast (very)

» configurable durability

» client-managed sharding

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   62
Voldemort




» http://project-voldemort.com/

» LinkedIn




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   63
Voldemort


» persistent

» distributed

» fault-tolerant

» hash table




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   64
Voldemort


                                                        API: GET, PUT,
                                                        DELETE




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   65
Voldemort




                    routing logic moving up the stack,
                    smaller latency

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   66
Column stores


» BigTable clones

» Sparseness!

» Data model: columns ➙ column families ➙ ACL
 » Datums keyed by: row, column, time, index
 » Row-range ➙ tablet ➙ distribution




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   67
Column stores



» BigTable

» HBase

» Cassandra




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   68
BigTable



» http://labs.google.com/papers/bigtable.html

» Google

» layered on top of GFS




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   69
HBase




» http://hadoop.apache.org/hbase/

» StumbleUpon / Adobe / Cloudera




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   70
HBase
» sorted                                            » persisted
» distributed                                       » storage system
» column-oriented
» multi-dimensional
» highly-available                                  » adds random access
» high-performance                                     reads and writes atop
                                                       HDFS


           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   71
HBase data model
» Distributed multi-dimensional sparse map

» Multi-dimensional keys:
 (table, row, family:column, timestamp) → value




» Keys are arbitrary strings

» Access to row data is atomic
        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   72
Storage architecture




                                                                                © lars george

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            73
Cassandra




» http://cassandra.apache.org/

» Rackspace / Facebook




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   74
Cassandra
» Key-value store (with added structure)

» Reliability (identical nodes)

» Eventual consistent

» Distributed
                                                                                       A
                                                                       C
» Tunable
 » Partitioning
                                                                                   P
 » Replication

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           75
Cassandra applicability
            FIT                                                      NO FIT

» Scalable reliability                            » Flexible indexing
  (through identical                              » Only PK-based
  nodes)                                            querying
» Linear scaling                                  » Big Binary Data
» Write throughput                                » 1 Row must fit in
» Large Data Sets                                   RAM entirely

          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   76
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   77
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   78
Document databases


» ≈ K/V stores, but DB knows what the Value is

» Lotus Notes heritage

» Data model: collections of K/V collections

» Documents often versioned




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   79
Document stores


» CouchDB

» MongoDB

» Riak

» MarkLogic




         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   80
CouchDB




» http://couchdb.apache.org/

» couch.io




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   81
CouchDB


» fault-tolerant

» schema-free

» document-oriented

» accessible via a RESTful HTTP/JSON API




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   82
CouchDB documents

{
    “_id”: ”BCCD12CBB”,
    “_rev”: ”AB764C”,
    “type”: ”person”,
    “name”: ”Darth Vader”,
    “age”: 63,
    “headware”: [“Helmet”, “Sombrero”],
    “dark_side”: true
}


      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   83
CouchDB REST API


» HTTP
 » PUT /db/docid
 » GET /db/docid
 » POST /db/docid
 » DELETE /db/docid




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   84
CouchDB Views
» MapReduce-based

» Filter, Collate, Aggregate

» Javascript

         map                                                        reduce
 function (doc) {                                    function (Key, Values) {
   for(var i in doc.tags)                              var sum = 0;
     emit(doc.tags[i], 1);                             for(var i in Values)
 }                                                       sum += Values[i];
                                                       return sum;
                                                     }



        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   85
CouchDB


» be careful on semantics
 » replication ≠ partioning/sharding !
 » distributed database = distributable database

» sharded / distributed deployment
 requires proxy layer



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   86
MongoDB




» http://www.mongodb.org/

» 10gen




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   87
MongoDB
» cfr. CouchDB, really

» except for:
 » C++
 » performance focus
 » runtime queries (mapreduce still available)
 » native drivers (no REST/HTTP layering)
 » no MVCC: update-in-place
 » auto sharding (alpha)

         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   88
Graph databases


» Focus on modeling structure of data -
 interconnectivity
» Scale, but only to the complexity of data

» Data model: property graphs




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   89
Graph databases




» Neo4j

» AllegroGraph (RDF)




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   90
Neo4j




» http://neo4j.org/

» Neo Technology




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   91
Neo4j
» data = nodes + relationships + key/value properties




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   92
Neo4j
» many language bindings, little remoting

» ‘whiteboard’ friendly

» scaling to complexity (rather than volume?)

» lots of focus on domain modelling

» SPARQL/SAIL impl for triple geeks

» mostly RAM centric (with disk swapping &
 persistence)

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   93
Market
maturization

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Rise of integrators


» Cloudera (H-stack)

» Riptano (Cassandra)

» Cloudant (hosted CouchDB)

» (Outerthought: HBase)




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   95
VC capital

» Cloudera

» couch.io

» Neo

» 10gen

» many others



        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   96
Experiences &
(h)in(d)sights

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
the fireside conversations




http://www.flickr.com/photos/52641994@N00/516394238/



                        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   98
NOSQL applicability

» Horizontal scaling

» Multi-Master

» Data representation
 » search of simplicity
 » data that doesn’t fit the E-R model
   (graphs, trees, versions)
» Speed


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   99
Tool selection
» be careful on the marketeese:
 smoke and mirrors beware!
» monitor dev list, IRC, Twitter, blogs

» monitor project ‘sponsors’

» mix-and-match: polyglot persistency

» DON’T NOSQL WITHOUT INTERNAL SYS
 ARCHS & DEV(OP)S !

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   100
Our Context: Lily



» cloud-scalable content store and search
 repository
» successor (in many ways) of Daisy




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   101
}
     aptness




                                                                 NOSQL
internet
enterprise




                                                                                           }

                                                                                                 SQL
corporate
community




                                                                                           complexity
               IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                102
Lily essentials

» (open source)

» scalable store (Apache HBase)

» and search (Apache SOLR)

» content repository

» α due mid 2010

» www.lilycms.org or @outerthought


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   103
Choosing a NoSQL store for Lily: step I
» automatic scaling to large data sets

» fault-tolerance

» flexible datamodel with sparse data

» commodity hardware

» efficient random access

» community-based open source

» Java if possible

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   104
Choosing a NoSQL store for Lily: step II




» need for consistency

» atomic single-row updates

» M/R for index regeneration




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   105
Choosing a NoSQL store for Lily: step III


 HBase
» datamodel with column families and cell
 versioning
» ordered tables with range scans

» HDFS for blob storage

» Apache

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   106
distributed process coordination
        and configuration (ZooKeeper)




                                                                                                   }
                                                             query          update       indexer
                                      Lily
      Lily                                                                                             Lily Store Server
                                     store
     client
                                     node                    WAL             MQ           M/R

     client


                                                                                                   }
                                     store
                                     node                                     2ary       WAL /         HBase Region Server
                                                            documents
                                                                            indexes       MQ
     client

                                     store
                                     node

                                                                                                   }   Hadoop DFS




                                                                            REST




                                                             index
                                                            replica
                                                                        inverted index


                                                                            replica      replica
                                                                                                   }   SOLR




lily simplified architecture
                       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                             107
lily architecture
           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   108
lily distributed architecture

                      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   109
Distribution,
durability,
and availability
is everywhere
   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
When combining store
and search, make sure
your (search) index
doesn’t become the
store.

   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Key lessons learned
» unlearning normalization is very difficult

» integrity checking in code = not so bad

» doing joins in code can be very liberating

» importance of keyspace design
 » secondary indexing
» data de-normalization = size! (x3)

» schema vs. code flexibility?

» distribution is everywhere
 and you shouldn’t forget about it
        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   112
Mis-use cases
» SQL (or ORM) is a prerequisite

» Deeply hierarchical datasets (unless graph)

» Data integrity is listed on DBA job description

» High-security apps (enforced in DB)

» Transactional data (banking)

» Usage is highly unpredictable, combinatorial, or
 likely to change suddenly

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   113
Reading material

» Amazon Dynamo, Google BigTable, CAP

» http://nosql.mypopescu.com/

» http://nosql-database.org/

» http://twitter.com/nosqlupdate

» http://highscalability.com/



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   114
Questions?




                                                                  http://www.flickr.com/photos/leehaywood/4237636853/


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                                 115
Thanks for your
                                    attention !




                                » stevenn@outerthought.org

                                »           @stevenn

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   116

Contenu connexe

Similaire à KVIV / NoSQL : the new generation of database servers

Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)NGDATA
 
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyHadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyCloudera, Inc.
 
Welcome to the Age of Data
Welcome to the Age of DataWelcome to the Age of Data
Welcome to the Age of DataNGDATA
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionNGDATA
 
Outerthought / Lily Partnerships
Outerthought / Lily PartnershipsOuterthought / Lily Partnerships
Outerthought / Lily PartnershipsNGDATA
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work WebinarNGDATA
 
NoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNGDATA
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNGDATA
 
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog libraryNGDATA
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyNGDATA
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaNGDATA
 
Lily at HUG UK
Lily at HUG UKLily at HUG UK
Lily at HUG UKNGDATA
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...HostedbyConfluent
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataNGDATA
 
Re-engineering Engineering: from a cathedral to a bazaar?
Re-engineering Engineering: from a cathedral to a bazaar?Re-engineering Engineering: from a cathedral to a bazaar?
Re-engineering Engineering: from a cathedral to a bazaar?Open Networking Summits
 

Similaire à KVIV / NoSQL : the new generation of database servers (20)

Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)
 
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyHadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
 
Welcome to the Age of Data
Welcome to the Age of DataWelcome to the Age of Data
Welcome to the Age of Data
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
 
Outerthought / Lily Partnerships
Outerthought / Lily PartnershipsOuterthought / Lily Partnerships
Outerthought / Lily Partnerships
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work Webinar
 
NoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG Luxembourg
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
 
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
Revitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with MicroservicesRevitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with Microservices
 
Lily at HUG UK
Lily at HUG UKLily at HUG UK
Lily at HUG UK
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
Huguk lily
Huguk lilyHuguk lily
Huguk lily
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart Data
 
Re-engineering Engineering: from a cathedral to a bazaar?
Re-engineering Engineering: from a cathedral to a bazaar?Re-engineering Engineering: from a cathedral to a bazaar?
Re-engineering Engineering: from a cathedral to a bazaar?
 

Plus de NGDATA

NGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA
 
20110514 appsforghent
20110514 appsforghent20110514 appsforghent
20110514 appsforghentNGDATA
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyNGDATA
 
NoSQL BOF at Devoxx
NoSQL BOF at DevoxxNoSQL BOF at Devoxx
NoSQL BOF at DevoxxNGDATA
 
NoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNGDATA
 

Plus de NGDATA (6)

NGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA Corporate Presentation
NGDATA Corporate Presentation
 
20110514 appsforghent
20110514 appsforghent20110514 appsforghent
20110514 appsforghent
 
Big Data
Big DataBig Data
Big Data
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
NoSQL BOF at Devoxx
NoSQL BOF at DevoxxNoSQL BOF at Devoxx
NoSQL BOF at Devoxx
 
NoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at Devoxx
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

KVIV / NoSQL : the new generation of database servers

  • 1. NoSQL de nieuwe generatie van database servers KVIV IT - 3/6/2010 http://www.flickr.com/photos/wolfgangstaudt/2215246206/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 2. Who am I » Steven Noels - stevenn@outerthought.org » Outerthought : scalable content applications » makers of Daisy and Lily open source CMS IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2
  • 3. An evolution driven by pain. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 4. History 2. simplification 1. standardization hierarchical databases IMS XMLDB RDBMS OODBMS IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  • 5. History 4. rethinking the problem RDBMS NOSQL caching denormalisation sharding replication ... 3. pain IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
  • 6. Scalability IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 7. Numbers of scale http://qos.doubleclick.net/counters/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  • 8. Types of scaling » scaling for usage » scaling types of ops » volume of users » concurrent read » volume of data » concurrent write availability partioning replication consistency distribution IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
  • 9. Scaling database app server users IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
  • 10. Scaling database app server users IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  • 11. Scaling: vertical partitioning database app server users IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  • 12. Scaling: horizontal partitioning c o m p l e x i t y databases app servers users IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  • 13. Scaling through architecture data layer app layer users IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
  • 14. Distributed systems are hard ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 15. 8 fallacies of distributed computing » The network is reliable. » Latency is zero. » Bandwidth is infinite. Peter Deutsch and James Gosling » The network is secure. » Topology doesn't change. » There is one administrator. » Transport cost is zero. » The network is homogeneous. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  • 16. Scaling relational systems » When scaling relational systems you loose their advantages but retain their overhead » The pain is all about locking (i.e. writes) » Caching alleviates the read pain to the cost of complexity IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  • 17. The Perspective of Cost IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  • 18. Enter NoSQL IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 19. Cambrian Explosion IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  • 20. Buzz-oriented development ? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  • 21. Cambrian Explosion N-O-SQL IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  • 22. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  • 23. Common themes » SCALE SCALE SCALE » new datamodels » devops » N-O-SQL » The Cloud : technology is of no interest anymore IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  • 24. New Data » sparse structures » weak schemas » graphs » semi-structured » document-oriented IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  • 25. NoSQL » Not a movement. » Not ANSI NoSQL-2010. » Not one-size-fits-all. » Not (necessarily) anti-RDBMS. » No silver bullet. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  • 26. NoSQL = pro Choice IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  • 27. NoSQL = toolbox IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  • 28. The NOSQL footprint free-structured or sparse data NOSQL MongoDB CouchDB neo4j Cassandra available (complexity) simple operational HBase highly scalable and constraints ACID, SQL referential integrity, typed data IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  • 29. NOSQL, if you need ... » horizontal scaling (out rather than up) » unusually common data (aka free-structured) » speed (especially for writes) » the bleeding edge IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  • 30. SQL/RDBMS, if you need ... » SQL » ACID » normalisation » a defined liability IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  • 31. Theory IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 32. Academic background » Amazon Dynamo » Google BigTable » Eric Brewer CAP theorem IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
  • 33. Amazon Dynamo » coined the term ‘eventual consistency’ » consistent hashing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  • 34. Consistent hashing - node C + node D http://www.lexemetech.com/2007/11/consistent-hashing.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  • 35. Google BigTable » multi-dimensional column-oriented database » on top of GoogleFileSystem » object versioning IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  • 36. CAP theorem strong high consistency availability partition- tolerance IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  • 37. CAP » Strong Consistency: all clients see the same view, even in the presence of updates » High Availability: all clients can find some replica of the data, even in the presence of failures » Partition-tolerance: the system properties hold even when the system is partitioned IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  • 38. Culture Clash » ACID » BASE » highest priority: strong » availability and scaling consistency for highest priorities transactions » weak consistency » availability less important » optimistic » pessimistic » best effort » rigorous analysis » simple and fast » complex mechanisms spectrum IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  • 39. Availability ≠ total async ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 40. ✘ The Enterprise Service Bus bus = congestion IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  • 41. Bus systems » objects don’t fit in a pipe » object ➙ message » serialization / de-serialization cost » message size » queuing = cost IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
  • 42. Use a mixture of both »async + sync stuff which matters ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  • 43. Processing large datasets : Map/Reduce IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 44. Hadoop: HDFS + MapReduce » single filesystem + single execution-space IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  • 45. MapReduce example: WordCount IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  • 46. Hadoop ecosystem » Hadoop Common » Subprojects » Chukwa: A data collection system for managing large distributed systems. » HBase: A scalable, distributed database that supports structured data storage for large tables. » HDFS: A distributed file system that provides high throughput access to application data. » Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. » MapReduce: A software framework for distributed processing of large data sets on compute clusters. » Pig: A high-level data-flow language and execution framework for parallel computation. » ZooKeeper: A high-performance coordination service for distributed applications. » Mahout: machine learning libaries IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  • 47. Processing large datasets with MR » Benefit from parallellisation » Less modelling upfront (ad-hoc processing) » Compartmentalized approach reduces operational risks » AsterData et al. have SQL/MR hybrids for huge-scale BI IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
  • 48. Market overview IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 49. Four Trends » Trend 1 : Data Size » Trend 2 : Connectedness » Trend 3 : Semi-structure » Trend 4 : Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 49
  • 50. Trend 1: Data size ExaBytes (10!") of data stored per year 988 1000 Each year more and more digital data is created. Over t wo 750 years we create more digital data than all 623 the data created in history before that. 500 397 253 250 161 0 2006 2007 2008 2009 2010 Data source: IDC 2007 3 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 50
  • 51. Trend 2: Connectedness Giant Global Graph (GGG) Over time data has evolved to Ontologies be more and more interlinked and connected. RDF Hypertext has links, Blogs have pingback, Tagging groups all related data Folksonomies Information connectivity Tagging Wikis User-generated content Blogs RSS Hypertext Text documents web 1.0 web 2.0 “web 3.0” 1990 2000 2010 2020 4 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 51
  • 52. Trend 3: Semi-structure ! Individualization of content • In the salary lists of the 1970s, all elements had exactly one job • In Or 15? lists of the 2000s, we need 5 job columns! Or 8? the salary ! All encompassing “entire world views” • Store more data about each entity ! Trend accelerated by the decentralization of content generation that is the hallmark of the age of participation (“web 2.0”) 5 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 52
  • 53. Trend 4: Architecture 1980s: Mainframe applications Application DB 6 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 53
  • 54. Trend 4: Architecture 1990s: Database as integration hub Application Application Application DB 7 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 54
  • 55. Trend 4: Architecture 2000s: (moving towards) Decoupled services with their own backend Application Application Application DB DB DB 8 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 55
  • 56. Products overview IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 57. We welcome the Polyglot Persistence overlords. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 58. Categories » key-value stores » column stores » document stores » graph databases IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 58
  • 59. Key-value stores » Focus on scaling to huge amounts of data » Designed to handle big loads » Often: cfr. Amazon Dynamo » ring partitioning and replication » Data model: key/value pairs IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 59
  • 60. Key-value stores » Redis » Voldemort » Tokyo Cabinet IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 60
  • 61. Redis » REmote DIctionary Server » http://code.google.com/p/redis/ » vmware IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 61
  • 62. Redis Features » persisted memcache, ‘awesome’ » RAM-based + persistable » key ➙ values: string, list, set » higher-level ops » i.e. push/pop and sort for lists » fast (very) » configurable durability » client-managed sharding IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 62
  • 63. Voldemort » http://project-voldemort.com/ » LinkedIn IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 63
  • 64. Voldemort » persistent » distributed » fault-tolerant » hash table IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 64
  • 65. Voldemort API: GET, PUT, DELETE IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 65
  • 66. Voldemort routing logic moving up the stack, smaller latency IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 66
  • 67. Column stores » BigTable clones » Sparseness! » Data model: columns ➙ column families ➙ ACL » Datums keyed by: row, column, time, index » Row-range ➙ tablet ➙ distribution IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 67
  • 68. Column stores » BigTable » HBase » Cassandra IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 68
  • 69. BigTable » http://labs.google.com/papers/bigtable.html » Google » layered on top of GFS IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 69
  • 70. HBase » http://hadoop.apache.org/hbase/ » StumbleUpon / Adobe / Cloudera IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 70
  • 71. HBase » sorted » persisted » distributed » storage system » column-oriented » multi-dimensional » highly-available » adds random access » high-performance reads and writes atop HDFS IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 71
  • 72. HBase data model » Distributed multi-dimensional sparse map » Multi-dimensional keys: (table, row, family:column, timestamp) → value » Keys are arbitrary strings » Access to row data is atomic IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 72
  • 73. Storage architecture © lars george IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 73
  • 74. Cassandra » http://cassandra.apache.org/ » Rackspace / Facebook IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 74
  • 75. Cassandra » Key-value store (with added structure) » Reliability (identical nodes) » Eventual consistent » Distributed A C » Tunable » Partitioning P » Replication IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 75
  • 76. Cassandra applicability FIT NO FIT » Scalable reliability » Flexible indexing (through identical » Only PK-based nodes) querying » Linear scaling » Big Binary Data » Write throughput » 1 Row must fit in » Large Data Sets RAM entirely IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 76
  • 77. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 77
  • 78. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 78
  • 79. Document databases » ≈ K/V stores, but DB knows what the Value is » Lotus Notes heritage » Data model: collections of K/V collections » Documents often versioned IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 79
  • 80. Document stores » CouchDB » MongoDB » Riak » MarkLogic IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 80
  • 81. CouchDB » http://couchdb.apache.org/ » couch.io IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 81
  • 82. CouchDB » fault-tolerant » schema-free » document-oriented » accessible via a RESTful HTTP/JSON API IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 82
  • 83. CouchDB documents { “_id”: ”BCCD12CBB”, “_rev”: ”AB764C”, “type”: ”person”, “name”: ”Darth Vader”, “age”: 63, “headware”: [“Helmet”, “Sombrero”], “dark_side”: true } IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 83
  • 84. CouchDB REST API » HTTP » PUT /db/docid » GET /db/docid » POST /db/docid » DELETE /db/docid IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 84
  • 85. CouchDB Views » MapReduce-based » Filter, Collate, Aggregate » Javascript map reduce function (doc) { function (Key, Values) { for(var i in doc.tags) var sum = 0; emit(doc.tags[i], 1); for(var i in Values) } sum += Values[i]; return sum; } IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 85
  • 86. CouchDB » be careful on semantics » replication ≠ partioning/sharding ! » distributed database = distributable database » sharded / distributed deployment requires proxy layer IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 86
  • 87. MongoDB » http://www.mongodb.org/ » 10gen IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 87
  • 88. MongoDB » cfr. CouchDB, really » except for: » C++ » performance focus » runtime queries (mapreduce still available) » native drivers (no REST/HTTP layering) » no MVCC: update-in-place » auto sharding (alpha) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 88
  • 89. Graph databases » Focus on modeling structure of data - interconnectivity » Scale, but only to the complexity of data » Data model: property graphs IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 89
  • 90. Graph databases » Neo4j » AllegroGraph (RDF) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 90
  • 91. Neo4j » http://neo4j.org/ » Neo Technology IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 91
  • 92. Neo4j » data = nodes + relationships + key/value properties IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 92
  • 93. Neo4j » many language bindings, little remoting » ‘whiteboard’ friendly » scaling to complexity (rather than volume?) » lots of focus on domain modelling » SPARQL/SAIL impl for triple geeks » mostly RAM centric (with disk swapping & persistence) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 93
  • 94. Market maturization IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 95. Rise of integrators » Cloudera (H-stack) » Riptano (Cassandra) » Cloudant (hosted CouchDB) » (Outerthought: HBase) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 95
  • 96. VC capital » Cloudera » couch.io » Neo » 10gen » many others IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 96
  • 97. Experiences & (h)in(d)sights IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 98. the fireside conversations http://www.flickr.com/photos/52641994@N00/516394238/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 98
  • 99. NOSQL applicability » Horizontal scaling » Multi-Master » Data representation » search of simplicity » data that doesn’t fit the E-R model (graphs, trees, versions) » Speed IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 99
  • 100. Tool selection » be careful on the marketeese: smoke and mirrors beware! » monitor dev list, IRC, Twitter, blogs » monitor project ‘sponsors’ » mix-and-match: polyglot persistency » DON’T NOSQL WITHOUT INTERNAL SYS ARCHS & DEV(OP)S ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 100
  • 101. Our Context: Lily » cloud-scalable content store and search repository » successor (in many ways) of Daisy IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 101
  • 102. } aptness NOSQL internet enterprise } SQL corporate community complexity IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 102
  • 103. Lily essentials » (open source) » scalable store (Apache HBase) » and search (Apache SOLR) » content repository » α due mid 2010 » www.lilycms.org or @outerthought IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 103
  • 104. Choosing a NoSQL store for Lily: step I » automatic scaling to large data sets » fault-tolerance » flexible datamodel with sparse data » commodity hardware » efficient random access » community-based open source » Java if possible IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 104
  • 105. Choosing a NoSQL store for Lily: step II » need for consistency » atomic single-row updates » M/R for index regeneration IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 105
  • 106. Choosing a NoSQL store for Lily: step III HBase » datamodel with column families and cell versioning » ordered tables with range scans » HDFS for blob storage » Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 106
  • 107. distributed process coordination and configuration (ZooKeeper) } query update indexer Lily Lily Lily Store Server store client node WAL MQ M/R client } store node 2ary WAL / HBase Region Server documents indexes MQ client store node } Hadoop DFS REST index replica inverted index replica replica } SOLR lily simplified architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 107
  • 108. lily architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 108
  • 109. lily distributed architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 109
  • 110. Distribution, durability, and availability is everywhere IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 111. When combining store and search, make sure your (search) index doesn’t become the store. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 112. Key lessons learned » unlearning normalization is very difficult » integrity checking in code = not so bad » doing joins in code can be very liberating » importance of keyspace design » secondary indexing » data de-normalization = size! (x3) » schema vs. code flexibility? » distribution is everywhere and you shouldn’t forget about it IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 112
  • 113. Mis-use cases » SQL (or ORM) is a prerequisite » Deeply hierarchical datasets (unless graph) » Data integrity is listed on DBA job description » High-security apps (enforced in DB) » Transactional data (banking) » Usage is highly unpredictable, combinatorial, or likely to change suddenly IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 113
  • 114. Reading material » Amazon Dynamo, Google BigTable, CAP » http://nosql.mypopescu.com/ » http://nosql-database.org/ » http://twitter.com/nosqlupdate » http://highscalability.com/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 114
  • 115. Questions? http://www.flickr.com/photos/leehaywood/4237636853/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 115
  • 116. Thanks for your attention ! » stevenn@outerthought.org » @stevenn IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 116