SlideShare une entreprise Scribd logo
1  sur  69
NOSQL
BIG DATA
NEW DATABASES
Stuff.
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Hey!
 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
I won’t do
a demo.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
I’m @stevenn
or
stevenn@outerthought.org


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Houston,
   we have
  a problem.
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
We’re drowning.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Drowning
in a
Sea
of
Data.
 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Mountains of
      Metadata.
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The firehose
 of UGC.
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Still, we
 can’t make
much sense
        of it.

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
... and we throw
  a lot of it away.
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
We regard
content as cost.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
But data is an
opportunity.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Think about it.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
advertisements
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   15
recommendations
 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   16
anything that sells
 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   17
profile harvesting
 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   18
The future is for
datanerds.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Nerdy
               enough?
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   20
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   21
This is what Big
Data is about:
new insights,
new business.
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
But first,
some history.

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
How did NOSQL happen?

                                                                                    2. simplification


                                      1. standardization
 hierarchical databases

     IMS
                XMLDB                                                       RDBMS

      OODBMS




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                      24
How did NOSQL happen?


                                    4. rethinking
                                    the problem


    RDBMS                                                                    NOSQL




                          caching
                          denormalisation
                          sharding
                          replication ...
  3. pain

      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org      25
Numbers of scale




                   http://qos.doubleclick.net/counters/

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   26
Numbers of scale




» Twitter does 12 M tweet displays

» ... per second.




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   27
Types of scaling
» scaling for usage                                 » scaling types of ops
 » volume of users                                     » concurrent read
 » volume of data                                      » concurrent write



   availability                                            partioning
   replication                                             consistency


                     distributed systems

           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   28
... and
distributed
systems are
HARD.
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
8 fallacies of distributed computing
» The network is reliable.

» Latency is zero.

» Bandwidth is infinite.




                                                                                    Peter Deutsch and James Gosling
» The network is secure.

» Topology doesn't change.

» There is one administrator.

» Transport cost is zero.

» The network is homogeneous.

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         30
Data.
 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Trend 1: Data size
               ExaBytes (10!") of data stored per year
                                                                                      988
1000
         Each year more and
         more digital data is
         created. Over t wo
 750     years we create more
         digital data than all                                        623
         the data created in
         history before that.
 500
                                                397

                            253
 250    161


   0
       2006                2007                2008                  2009             2010
                                                      Data source: IDC 2007             3


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org          32
Trend 2: Connectedness
                                                                                                                    Giant
                                                                                                                    Global
                                                                                                                 Graph (GGG)


                                    Over time data has evolved to                                   Ontologies
                                    be more and more interlinked
                                    and connected.
                                                                                           RDF
                                    Hypertext has links,
                                    Blogs have pingback,
                                    Tagging groups all related data                                       Folksonomies
  Information connectivity




                                                                                        Tagging


                                                                        Wikis            User-generated
                                                                                            content
                                                                                Blogs


                                                                      RSS


                                                  Hypertext


                         Text documents
                                                         web 1.0                  web 2.0                        “web 3.0”

                                             1990                     2000                        2010                   2020   4


                                      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                     33
Trend 3: Semi-structure
! Individualization of content
   • In the salary lists of the 1970s, all elements had exactly one job
   • In Or 15? lists of the 2000s, we need 5 job columns! Or 8?
        the salary


! All encompassing “entire world views”
   • Store more data about each entity
! Trend accelerated by the decentralization of content generation
     that is the hallmark of the age of participation (“web 2.0”)



                                                                                        5


            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       34
Trend 4: Architecture

                      1980s: Mainframe applications


                                     Application




                                           DB




                                                                                    6


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       35
Trend 4: Architecture

                  1990s: Database as integration hub


             Application             Application             Application




                                           DB




                                                                                    7


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       36
Trend 4: Architecture

           2000s: (moving towards) Decoupled services
                               with their own backend

             Application             Application             Application




                   DB                      DB                      DB




                                                                                    8


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       37
Trend 4: Architecture

           2000s: (moving towards) Decoupled services
                               with their own backend

             Application             Application             Application




                   DB                DATA TIER
                                        DB                         DB




                                                                                    8


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       38
For years, we
tried to squeeze
data into a
one-size-fits-all
container.
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Also, the cost perspective




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   40
NOSQL,
the Data
Liberation
Front
(or: Polyglot Persistency)
    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Cambrian Explosion




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   42
Cambrian Explosion




                               N-O-SQL




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   43
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   44
The NOSQL footprint
                            free-structured or sparse data



                                                               NOSQL

                                               MongoDB
                                             CouchDB
                                                  neo4j

                                                          Cassandra




                                                                       available (complexity)
   simple operational




                                                               HBase




                                                                         highly scalable and
      constraints
         ACID,




                                   SQL




                                 referential integrity,
                                      typed data



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org              45
Other axes of classification
» Data Model

» Consistency

» Atomic test-and-set

» Secondary indexes

» Manageability

» Latency vs. Durability

» Read vs. Write Performance

» Dynamic Scaling

» Auto failover

» Compression Support

» Range Scanning

» Failure Scenarios


            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   46
Data Model


» Key/Value

» Document

» Row Stores with Column Families

» Graphs




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   47
Other axes of classification
» Data Model

» Consistency

» Atomic test-and-set

» Secondary indexes

» Manageability

» Latency vs. Durability

» Read vs. Write Performance

» Dynamic Scaling

» Auto failover

» Compression Support

» Range Scanning
                          http://huanliu.wordpress.com/2011/01/21/
» Failure Scenarios       dimensions-to-use-to-compare-nosql-data-stores/

            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   48
Hire a good
consultant.
(or become one, like Xebia, SFEIR,
Cloudera ...)


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Data Processing

  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   51
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   52
Map + Reduce



  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Hadoop: HDFS + MapReduce
» single filesystem + single execution-space




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   54
Processing large datasets with MR

» Benefit from parallellisation

» Less modelling upfront (ad-hoc processing)

» Compartmentalized approach reduces
 operational risks (aka robustness)
» AsterData et al. have SQL/MR hybrids for
 huge-scale BI


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   55
LILY
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   56
Cloud-scale
content
storage & search
  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
LILY
                             +

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   125
                                                                             58
Lily

» provides scalable storage

» and scalable search

» with a fault-tolerant, distributed architecture

» automated index maintenance

» versioning, rich data types, Java+REST API

» based on HBase (NOSQL) and SOLR (Lucene)


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   59
Choosing a NoSQL store for Lily: step I
» automatic scaling to large data sets

» fault-tolerance

» flexible datamodel with sparse data

» commodity hardware

» efficient random access

» community-based open source

» Java if possible

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   60
Choosing a NoSQL store for Lily: step II




» need for consistency

» atomic single-row updates

» M/R for index regeneration




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   61
Choosing a NoSQL store for Lily: step III


 HBase
» datamodel with column families and cell
 versioning
» ordered tables with range scans

» HDFS for blob storage

» Apache

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   62
Lily

» scales to infinity, and beyond
» open source
 » Apache license (no strings attached)
 » Java and REST API

» www.lilyproject.org

» subscription- and partnership-based
 business model


           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   63
distributed process coordination
        and configuration (ZooKeeper)




                                                                                                   }
                                                             query          update       indexer
                                      Lily
      Lily                                                                                             Lily Store Server
                                     store
     client
                                     node                    WAL             MQ           M/R

     client


                                                                                                   }
                                     store
                                     node                                     2ary       WAL /         HBase Region Server
                                                            documents
                                                                            indexes       MQ
     client

                                     store
                                     node

                                                                                                   }   Hadoop DFS




                                                                            REST




                                                             index
                                                            replica
                                                                        inverted index


                                                                            replica      replica
                                                                                                   }   SOLR




lily simplified architecture
                       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                             64
Key lessons learned
» unlearning normalization is very difficult

» integrity checking in code = not so bad

» doing joins in code can be very liberating

» importance of keyspace design
 » secondary indexing
» data de-normalization = size! (x3)

» schema vs. code flexibility?

» distribution is everywhere
 and you shouldn’t forget about it
        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   65
Pssst. :-)
If you absolutely, positively want to see a
demo, go check http://outerthought.blip.tv/




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Reading material

» Amazon Dynamo, Google BigTable, CAP

» http://nosql.mypopescu.com/

» http://nosql-database.org/

» http://twitter.com/nosqlupdate

» http://highscalability.com/



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   67
We’re growing
                                                      We’re hiring



  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Thank you !
                                for your attention
                                for your questions
                                » stevenn@outerthought.org

                                »           @stevenn

   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Contenu connexe

En vedette (6)

Welcome to the Age of Data
Welcome to the Age of DataWelcome to the Age of Data
Welcome to the Age of Data
 
Jan. 24 28
Jan. 24 28Jan. 24 28
Jan. 24 28
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work Webinar
 
Oct. 4 8
Oct. 4 8Oct. 4 8
Oct. 4 8
 
Setting Up a Free CMS Course
Setting Up a Free CMS CourseSetting Up a Free CMS Course
Setting Up a Free CMS Course
 
Sept. 23 27
Sept. 23 27Sept. 23 27
Sept. 23 27
 

Similaire à NoSQL intro for YaJUG / NoSQL UG Luxembourg

Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
NGDATA
 
Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)
NGDATA
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart Data
NGDATA
 

Similaire à NoSQL intro for YaJUG / NoSQL UG Luxembourg (20)

Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyHadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
 
Outerthought / Lily Partnerships
Outerthought / Lily PartnershipsOuterthought / Lily Partnerships
Outerthought / Lily Partnerships
 
KVIV / NoSQL : the new generation of database servers
KVIV / NoSQL : the new generation of database serversKVIV / NoSQL : the new generation of database servers
KVIV / NoSQL : the new generation of database servers
 
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
 
N-O-SQL, new database technologies on the rise
N-O-SQL, new database technologies on the riseN-O-SQL, new database technologies on the rise
N-O-SQL, new database technologies on the rise
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
 
Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)Building a CMS on top of NoSQL (for ParisJUG)
Building a CMS on top of NoSQL (for ParisJUG)
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart Data
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
 
BigchainDB: A Scalable Blockchain Database, In Python
BigchainDB: A Scalable Blockchain Database, In PythonBigchainDB: A Scalable Blockchain Database, In Python
BigchainDB: A Scalable Blockchain Database, In Python
 
BigchainDB: A Scalable Blockchain Database, In Python
  BigchainDB: A Scalable Blockchain Database, In Python   BigchainDB: A Scalable Blockchain Database, In Python
BigchainDB: A Scalable Blockchain Database, In Python
 
Trent McConaghy- BigchainDB
Trent McConaghy- BigchainDBTrent McConaghy- BigchainDB
Trent McConaghy- BigchainDB
 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
 

Plus de NGDATA

Plus de NGDATA (8)

NGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA Corporate Presentation
NGDATA Corporate Presentation
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
20110514 appsforghent
20110514 appsforghent20110514 appsforghent
20110514 appsforghent
 
Big Data
Big DataBig Data
Big Data
 
Lily at HUG UK
Lily at HUG UKLily at HUG UK
Lily at HUG UK
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
NoSQL BOF at Devoxx
NoSQL BOF at DevoxxNoSQL BOF at Devoxx
NoSQL BOF at Devoxx
 
NoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at Devoxx
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

NoSQL intro for YaJUG / NoSQL UG Luxembourg

  • 1. NOSQL BIG DATA NEW DATABASES Stuff. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 2. Hey! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 3. I won’t do a demo. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 4. I’m @stevenn or stevenn@outerthought.org IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 5. Houston, we have a problem. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 6. We’re drowning. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 7. Drowning in a Sea of Data. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 8. Mountains of Metadata. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 9. The firehose of UGC. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 10. Still, we can’t make much sense of it. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 11. ... and we throw a lot of it away. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 12. We regard content as cost. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 13. But data is an opportunity. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 14. Think about it. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 15. advertisements IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  • 16. recommendations IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  • 17. anything that sells IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  • 18. profile harvesting IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
  • 19. The future is for datanerds. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 20. Nerdy enough? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  • 21. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  • 22. This is what Big Data is about: new insights, new business. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 23. But first, some history. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 24. How did NOSQL happen? 2. simplification 1. standardization hierarchical databases IMS XMLDB RDBMS OODBMS IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  • 25. How did NOSQL happen? 4. rethinking the problem RDBMS NOSQL caching denormalisation sharding replication ... 3. pain IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  • 26. Numbers of scale http://qos.doubleclick.net/counters/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  • 27. Numbers of scale » Twitter does 12 M tweet displays » ... per second. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  • 28. Types of scaling » scaling for usage » scaling types of ops » volume of users » concurrent read » volume of data » concurrent write availability partioning replication consistency distributed systems IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  • 29. ... and distributed systems are HARD. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 30. 8 fallacies of distributed computing » The network is reliable. » Latency is zero. » Bandwidth is infinite. Peter Deutsch and James Gosling » The network is secure. » Topology doesn't change. » There is one administrator. » Transport cost is zero. » The network is homogeneous. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  • 31. Data. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 32. Trend 1: Data size ExaBytes (10!") of data stored per year 988 1000 Each year more and more digital data is created. Over t wo 750 years we create more digital data than all 623 the data created in history before that. 500 397 253 250 161 0 2006 2007 2008 2009 2010 Data source: IDC 2007 3 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
  • 33. Trend 2: Connectedness Giant Global Graph (GGG) Over time data has evolved to Ontologies be more and more interlinked and connected. RDF Hypertext has links, Blogs have pingback, Tagging groups all related data Folksonomies Information connectivity Tagging Wikis User-generated content Blogs RSS Hypertext Text documents web 1.0 web 2.0 “web 3.0” 1990 2000 2010 2020 4 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  • 34. Trend 3: Semi-structure ! Individualization of content • In the salary lists of the 1970s, all elements had exactly one job • In Or 15? lists of the 2000s, we need 5 job columns! Or 8? the salary ! All encompassing “entire world views” • Store more data about each entity ! Trend accelerated by the decentralization of content generation that is the hallmark of the age of participation (“web 2.0”) 5 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  • 35. Trend 4: Architecture 1980s: Mainframe applications Application DB 6 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  • 36. Trend 4: Architecture 1990s: Database as integration hub Application Application Application DB 7 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  • 37. Trend 4: Architecture 2000s: (moving towards) Decoupled services with their own backend Application Application Application DB DB DB 8 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  • 38. Trend 4: Architecture 2000s: (moving towards) Decoupled services with their own backend Application Application Application DB DATA TIER DB DB 8 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  • 39. For years, we tried to squeeze data into a one-size-fits-all container. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 40. Also, the cost perspective IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  • 41. NOSQL, the Data Liberation Front (or: Polyglot Persistency) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 42. Cambrian Explosion IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  • 43. Cambrian Explosion N-O-SQL IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  • 44. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  • 45. The NOSQL footprint free-structured or sparse data NOSQL MongoDB CouchDB neo4j Cassandra available (complexity) simple operational HBase highly scalable and constraints ACID, SQL referential integrity, typed data IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  • 46. Other axes of classification » Data Model » Consistency » Atomic test-and-set » Secondary indexes » Manageability » Latency vs. Durability » Read vs. Write Performance » Dynamic Scaling » Auto failover » Compression Support » Range Scanning » Failure Scenarios IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  • 47. Data Model » Key/Value » Document » Row Stores with Column Families » Graphs IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
  • 48. Other axes of classification » Data Model » Consistency » Atomic test-and-set » Secondary indexes » Manageability » Latency vs. Durability » Read vs. Write Performance » Dynamic Scaling » Auto failover » Compression Support » Range Scanning http://huanliu.wordpress.com/2011/01/21/ » Failure Scenarios dimensions-to-use-to-compare-nosql-data-stores/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
  • 49. Hire a good consultant. (or become one, like Xebia, SFEIR, Cloudera ...) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 50. Data Processing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 51. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 51
  • 52. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 52
  • 53. Map + Reduce IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 54. Hadoop: HDFS + MapReduce » single filesystem + single execution-space IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 54
  • 55. Processing large datasets with MR » Benefit from parallellisation » Less modelling upfront (ad-hoc processing) » Compartmentalized approach reduces operational risks (aka robustness) » AsterData et al. have SQL/MR hybrids for huge-scale BI IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 55
  • 56. LILY IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 56
  • 57. Cloud-scale content storage & search IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 58. LILY + IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 125 58
  • 59. Lily » provides scalable storage » and scalable search » with a fault-tolerant, distributed architecture » automated index maintenance » versioning, rich data types, Java+REST API » based on HBase (NOSQL) and SOLR (Lucene) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 59
  • 60. Choosing a NoSQL store for Lily: step I » automatic scaling to large data sets » fault-tolerance » flexible datamodel with sparse data » commodity hardware » efficient random access » community-based open source » Java if possible IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 60
  • 61. Choosing a NoSQL store for Lily: step II » need for consistency » atomic single-row updates » M/R for index regeneration IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 61
  • 62. Choosing a NoSQL store for Lily: step III HBase » datamodel with column families and cell versioning » ordered tables with range scans » HDFS for blob storage » Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 62
  • 63. Lily » scales to infinity, and beyond » open source » Apache license (no strings attached) » Java and REST API » www.lilyproject.org » subscription- and partnership-based business model IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 63
  • 64. distributed process coordination and configuration (ZooKeeper) } query update indexer Lily Lily Lily Store Server store client node WAL MQ M/R client } store node 2ary WAL / HBase Region Server documents indexes MQ client store node } Hadoop DFS REST index replica inverted index replica replica } SOLR lily simplified architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 64
  • 65. Key lessons learned » unlearning normalization is very difficult » integrity checking in code = not so bad » doing joins in code can be very liberating » importance of keyspace design » secondary indexing » data de-normalization = size! (x3) » schema vs. code flexibility? » distribution is everywhere and you shouldn’t forget about it IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 65
  • 66. Pssst. :-) If you absolutely, positively want to see a demo, go check http://outerthought.blip.tv/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 67. Reading material » Amazon Dynamo, Google BigTable, CAP » http://nosql.mypopescu.com/ » http://nosql-database.org/ » http://twitter.com/nosqlupdate » http://highscalability.com/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 67
  • 68. We’re growing We’re hiring IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 69. Thank you ! for your attention for your questions » stevenn@outerthought.org » @stevenn IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org