SlideShare a Scribd company logo
1 of 51
Download to read offline
Learning
Lessons
Building a content repository
on top of NoSQL Technologies


      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
hello,
I’m @stevenn from @outerthought




  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   2
This story is about




     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   3
Complexity
complexity




                            software architecture
                                                                      3.0




                                             2.0
             1.0


                                                                                         age




             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         4
Complexity
 complexity
user interest




                                                                         3.0




                                                2.0
            1.0


                                                                                            age




                IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         5
We Prefer Sophistication




» the challenge for us was to scale ...
 without dropping features




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   6
The typical CMS ‘architecture’




  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   7
The typical CMS ‘architecture’




  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   8
The typical CMS ‘architecture’




  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   9
The typical CMS ‘architecture’


  even more cache

  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   10
The typical CMS ‘architecture’

  client

  even more cache

  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   11
The typical CMS ‘architecture’

  client (+cache)

  even more cache

  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   12
What we found hard to scale
» access control

» facet browsing

» all the nifty stuff people were using our
 software for


» ... anything that required random access
 to in-memory-cache data for computations

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   13
Beyond the ‘scaling’ problem
» three-prong data layer



                                                                      fs




 » result set merging (between MySQL & Lucene)
   » happened in appcode/memory

 » ‘transactions’, set operations = hard


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   14
Beyond the three-prong problem




» errrr..... “Failover” ..... ?




         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   15
If we would be able to add more nodes ...


                                                   scalability


» True Distribution                                                  availability


                                                 performance

                   ... in the line of fire

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org    16
Solution 1




» do MORE inside the database




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   17
Functional




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   18
Functional




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   19
Infrastructural



                                                                 e !
                                                         a s
                                             ta b
                                    d a
                     o re
              m


     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   20
e !
                                                            a s
                                               tab
                                d            a
                           o re
             n m
  eve



IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   21
s !
                                                u s se
                                             e b
                                         sa g
                           m es
                 d d
    ’s a
l et


  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         22
f f!
                                                                 ! s tu
                                                       B C
                  J D
                r
               e !
             ov 0t
            S 0
         J M w
  M   I!
R


  IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org    23
http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html



             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   24
Business Development 101
user interest




                                                                                            budget




                IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            25
Solution II
sophistication
ability to cope




                                                                       3.0
           mysql                                                                          nosql?

                                              2.0
              1.0




              IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            26
Enter The Cambrian Explosion

                                       Cassandra




                                  NoSQL
                                                                            neo4j




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       27
Requirements, phase I
» automatic scaling to large data sets

» fault-tolerance: replication, automatic handling of failing nodes

» a flexible data model supporting sparse data

» runs on commodity hardware

» efficient random access to data

» open source, ability to participate in the development thus
  drive the direction of the project
» some preference for a Java-based solution


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   28
Requirements, phase II

» After careful consideration, we realized the
 important choices were also:
 » consistency: no chance of having two conflicting
   versions of a row
 » atomic updates of a single row, single-row
   transactions
 » bonus points for MapReduce integration
   » e.g. full-text index rebuilding



        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   29
That brought us to HBase, which bought us:
» a datamodel where you can have column
 families which keep all versions and others
 which do not, which fits very well on our
 CMS document model
» ordered tables with the ability to do range
 scans on them, which allows to build
 scalable indexes on top of it
» HDFS, a convenient place to store large blobs

» Apache license and community, a familiar
 environment for us

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   30
» OK, so now we had a data store !




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   31
» However, content repository =
 store + search                          !
                                    u ch
                                o



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   32
a s
                                                        w
                                                      at !
                                                    h sy
                                                  T a         .. .)
                                                      e ver
                                                       we
                                                    h o
                                                  (
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   33
Search ponderings

» CMS = two types of search
 » structured search
  » numbers, strings
  » based on logic         (SQL, anyone?)

 » information retrieval (or: full-text search)
  » text
  » based on statistics



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   34
Search ponderings




» All of that, at scale




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   35
Structured Search
» HBase Indexing Library
 » idea from Google App Engine datastore indexes
 » http://code.google.com/appengine/articles/
  index_building.html

    rowkey             col              col                             rowkey          col



                                                          order
      A               val3             foo6                              val2-B

      B               val2             foo7                              val3-A

                 content table                                              index table A


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           36
Full-text / IR search


» Lucene?
 » no sharding (for scale)
 » no replication (for availability)
 » batched index updates (not real-time)




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   37
Beyond Lucene
» Katta
  » scalable architecture, however only search, no indexing

» Elastic Search
  » very young (sorry)

» hbasene et al.
  » stores inverted index in HBase, might not scale all features

» SOLR
  » widely used, schema, facets, query syntax, cloud branch



More info: http://lilycms.org/lily/prerelease/technology.html

          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   38
?
                             +
                         =
                                                r ?
                                      ! O
                     a sy
                    E
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   39
Remember distribution ?
Remember secondary indexes ?




 ➙ Need for reliable queuing

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   40
Connecting things
» we needed a reliable bridge between our
 main storage (HBase) and our index/search
 server(s) (SOLR)
 » indexing, reindexing, mass reindexing (M/R)

» we need a reliable method of updating
 HBase secondary indexes
» all of that eventually to run distributed

» distribution means coping with failure

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   41
Solution


» ACMEMessageQueue ? Bzzzzzt.
 We wanted fault-safe HBase persistence for
 the queues.
 Also for ease of administration.
» ➙ WAL  & Queue implemented on top of
 HBase tables


      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   42
WAL / Queue
» WAL                                                 » Queue
 » guaranteed execution                                  » triggering of async
   of synchronous actions                                    actions
 » call doesn’t return before                            » e.g. (re)index (updated)
   secondary action finishes                                  record with SOLR back-end
 » e.g. update secondary actions                         » size depends on speed of
 » if all goes well,                                         back-end process
   size = #concurrent ops
 » will be useful/made available
   outside of Lily context as
   well!


             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   43
The Sum
» Lily model (records & fields)

» mapped onto HBase (=storage)

» indexed and searchable through
 SOLR
» using a WAL/Queue mechanism
 implemented in HBase
» runtime based on Kauri

» with client/server comms via
 Avro
        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   44
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   45
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   46
Roadmap

» Today = release of learning material
 (architecture, model, API, Javadoc)
 ➥ www.lilycms.org
 ➥ bit.ly/lilyprerelease

» Mid July = ‘proof of architecture’ release                                                e re!
                                                                                          th
                                                                                    early
                                                                                   N
» from there on, ca. 3-monthly releases
 leading up to Lily 1.0


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org               47
bit.ly/lilyprerelease




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   48
License




» Apache




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   49
Business model
» Consulting, mentoring, turn-key projects

» Strong focus on partner relations
 » targeting vertical markets
 » geographic coverage
 » SaaS offerings

» Markets: media, finance, insurance, govt,
 heritage ... LOTS of semi-structured data
» Not: OLAP

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   50
More ?



» @outerthought

» www.lilycms.org/lily/prerelease.html




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   51

More Related Content

Similar to Learning Lessons: Building a CMS on top of NoSQL technologies

Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work Webinar
NGDATA
 
NoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG Luxembourg
NGDATA
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart Data
NGDATA
 
The world is the computer and the programmer is you
The world is the computer and the programmer is youThe world is the computer and the programmer is you
The world is the computer and the programmer is you
Davide Carboni
 
มโนทัศน์เทคโนโลยีการศึกษา
มโนทัศน์เทคโนโลยีการศึกษามโนทัศน์เทคโนโลยีการศึกษา
มโนทัศน์เทคโนโลยีการศึกษา
onumaphimviset
 

Similar to Learning Lessons: Building a CMS on top of NoSQL technologies (18)

Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyHadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work Webinar
 
NoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG Luxembourg
 
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart Data
 
Lily at HUG UK
Lily at HUG UKLily at HUG UK
Lily at HUG UK
 
Huguk lily
Huguk lilyHuguk lily
Huguk lily
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
 
The world is the computer and the programmer is you
The world is the computer and the programmer is youThe world is the computer and the programmer is you
The world is the computer and the programmer is you
 
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWKubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
มโนทัศน์เทคโนโลยีการศึกษา
มโนทัศน์เทคโนโลยีการศึกษามโนทัศน์เทคโนโลยีการศึกษา
มโนทัศน์เทคโนโลยีการศึกษา
 
Revitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with MicroservicesRevitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with Microservices
 
2006 — Technology Adoption: emerging technologies and their likely impact
2006 — Technology Adoption: emerging technologies and their likely impact2006 — Technology Adoption: emerging technologies and their likely impact
2006 — Technology Adoption: emerging technologies and their likely impact
 

More from NGDATA

More from NGDATA (6)

NGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA Corporate Presentation
NGDATA Corporate Presentation
 
20110514 appsforghent
20110514 appsforghent20110514 appsforghent
20110514 appsforghent
 
Big Data
Big DataBig Data
Big Data
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
NoSQL BOF at Devoxx
NoSQL BOF at DevoxxNoSQL BOF at Devoxx
NoSQL BOF at Devoxx
 
NoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at Devoxx
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Learning Lessons: Building a CMS on top of NoSQL technologies

  • 1. Learning Lessons Building a content repository on top of NoSQL Technologies IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 2. hello, I’m @stevenn from @outerthought IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 2
  • 3. This story is about IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 3
  • 4. Complexity complexity software architecture 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  • 5. Complexity complexity user interest 3.0 2.0 1.0 age IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
  • 6. We Prefer Sophistication » the challenge for us was to scale ... without dropping features IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
  • 7. The typical CMS ‘architecture’ database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  • 8. The typical CMS ‘architecture’ application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
  • 9. The typical CMS ‘architecture’ more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
  • 10. The typical CMS ‘architecture’ even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  • 11. The typical CMS ‘architecture’ client even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  • 12. The typical CMS ‘architecture’ client (+cache) even more cache more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  • 13. What we found hard to scale » access control » facet browsing » all the nifty stuff people were using our software for » ... anything that required random access to in-memory-cache data for computations IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
  • 14. Beyond the ‘scaling’ problem » three-prong data layer fs » result set merging (between MySQL & Lucene) » happened in appcode/memory » ‘transactions’, set operations = hard IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14
  • 15. Beyond the three-prong problem » errrr..... “Failover” ..... ? IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  • 16. If we would be able to add more nodes ... scalability » True Distribution availability performance ... in the line of fire IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  • 17. Solution 1 » do MORE inside the database IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  • 18. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
  • 19. Functional IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  • 20. Infrastructural e ! a s ta b d a o re m IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  • 21. e ! a s tab d a o re n m eve IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  • 22. s ! u s se e b sa g m es d d ’s a l et IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  • 23. f f! ! s tu B C J D r e ! ov 0t S 0 J M w M I! R IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  • 24. http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  • 25. Business Development 101 user interest budget IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  • 26. Solution II sophistication ability to cope 3.0 mysql nosql? 2.0 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  • 27. Enter The Cambrian Explosion Cassandra NoSQL neo4j IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  • 28. Requirements, phase I » automatic scaling to large data sets » fault-tolerance: replication, automatic handling of failing nodes » a flexible data model supporting sparse data » runs on commodity hardware » efficient random access to data » open source, ability to participate in the development thus drive the direction of the project » some preference for a Java-based solution IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  • 29. Requirements, phase II » After careful consideration, we realized the important choices were also: » consistency: no chance of having two conflicting versions of a row » atomic updates of a single row, single-row transactions » bonus points for MapReduce integration » e.g. full-text index rebuilding IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  • 30. That brought us to HBase, which bought us: » a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model » ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it » HDFS, a convenient place to store large blobs » Apache license and community, a familiar environment for us IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  • 31. » OK, so now we had a data store ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
  • 32. » However, content repository = store + search ! u ch o IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 32
  • 33. a s w at ! h sy T a .. .) e ver we h o ( IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  • 34. Search ponderings » CMS = two types of search » structured search » numbers, strings » based on logic (SQL, anyone?) » information retrieval (or: full-text search) » text » based on statistics IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  • 35. Search ponderings » All of that, at scale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  • 36. Structured Search » HBase Indexing Library » idea from Google App Engine datastore indexes » http://code.google.com/appengine/articles/ index_building.html rowkey col col rowkey col order A val3 foo6 val2-B B val2 foo7 val3-A content table index table A IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36
  • 37. Full-text / IR search » Lucene? » no sharding (for scale) » no replication (for availability) » batched index updates (not real-time) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37
  • 38. Beyond Lucene » Katta » scalable architecture, however only search, no indexing » Elastic Search » very young (sorry) » hbasene et al. » stores inverted index in HBase, might not scale all features » SOLR » widely used, schema, facets, query syntax, cloud branch More info: http://lilycms.org/lily/prerelease/technology.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38
  • 39. ? + = r ? ! O a sy E IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39
  • 40. Remember distribution ? Remember secondary indexes ? ➙ Need for reliable queuing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40
  • 41. Connecting things » we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR) » indexing, reindexing, mass reindexing (M/R) » we need a reliable method of updating HBase secondary indexes » all of that eventually to run distributed » distribution means coping with failure IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41
  • 42. Solution » ACMEMessageQueue ? Bzzzzzt. We wanted fault-safe HBase persistence for the queues. Also for ease of administration. » ➙ WAL & Queue implemented on top of HBase tables IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  • 43. WAL / Queue » WAL » Queue » guaranteed execution » triggering of async of synchronous actions actions » call doesn’t return before » e.g. (re)index (updated) secondary action finishes record with SOLR back-end » e.g. update secondary actions » size depends on speed of » if all goes well, back-end process size = #concurrent ops » will be useful/made available outside of Lily context as well! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  • 44. The Sum » Lily model (records & fields) » mapped onto HBase (=storage) » indexed and searchable through SOLR » using a WAL/Queue mechanism implemented in HBase » runtime based on Kauri » with client/server comms via Avro IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  • 45. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  • 46. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  • 47. Roadmap » Today = release of learning material (architecture, model, API, Javadoc) ➥ www.lilycms.org ➥ bit.ly/lilyprerelease » Mid July = ‘proof of architecture’ release e re! th early N » from there on, ca. 3-monthly releases leading up to Lily 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
  • 48. bit.ly/lilyprerelease IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
  • 49. License » Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 49
  • 50. Business model » Consulting, mentoring, turn-key projects » Strong focus on partner relations » targeting vertical markets » geographic coverage » SaaS offerings » Markets: media, finance, insurance, govt, heritage ... LOTS of semi-structured data » Not: OLAP IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 50
  • 51. More ? » @outerthought » www.lilycms.org/lily/prerelease.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 51