SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
NoSQL - Life beyond the Outer
             Join
              Glen Smith
        (glen@bytecode.com.au)
Objectives


   Survey the landscape of NoSQL offerings
   Learn some of the terminology
   Look at some of the Java offerings in the space
   Take away source to play with
   Be able to ask questions (but you may not get
    answers)
What is NoSQL?


   (N)ot (O)nly SQL not “Anti SQL”
   Movement more than “one” technology
   Distributed Storage System
   Much weaker queries
   Scale across many machines
   Much larger data, much faster queries
Why NoSQL?


 Inspired by Distributed Data Storage problems
 Scale easily by adding servers
 Not suited to all problem types, but super-suited to
  certain large problem types
 High-write situations (eg activity tracking or timeline
  rendering for millions of users)
 A lot of relational uses are really dumbed down (eg
  fetch by PK with update)
What’s wrong with RDBMS?


 Nothing ;-)
 To scale RDBMS, your approach is typically:
   Shard your datasource
   Put in a bunch of read replicas
   Put memcached in front of those
 What could possibly go wrong? 
   Complex. Custom caching. Partitioning. Migrating of
    shards. Tons of moving parts.
How can I live w/o ACID?


   Atomic (it happens or not, no partial completes)
   Consistent (DB internals, ref integ, field validate)
   Isolated (Can’t modify uncommitted data)
   Durable (written to disk/transaction log)

 But in a distributed db, life is not so simple...
The CAP theorum


In a distributed system, when you have state on more
  than one machine, pick any two:
 Consistency (easy in read-only states – copy!)
 Availability (can you get at your data? Is it up?)
 Partition Tolerance (3 machines on one net, 3 on the
  other, with a broken link. How do you take updates
  since you can’t keep people up to date. What if you
  don’t agree on what’s up?)
How do these NoSQL things work?


 Basically big distributed hashtables
 Push all logic into the write (update two lists – one for
  userId, one for email)
 Things don’t happen transactionally. These are two
  writes.
 There is no free lunch. The programmer is now
  handling consistency problems.
 You were thinking about query optimisation before,
  and now even more so.
How big are we talking?


   Digg - 3Tb
   Facebook Inbox – 50 Tb
   eBay – 2 Pb
   Think about Twitter’s issues.. Billion of queries a
    second over Tb of data.
The NoSQL Taxonomy


 Key-Value In-Memory stores (Memcached, Redis)
 Key-Value “Eventually Consistent” stores (“Dynamo
  Clones” like Cassandra, Voldemort, Riak)
 Document stores (Couchdb, Mongodb, JCR)
 Graph Databases (Neo4j)
 Tabular (“BigTable clones” like Hadoop/Hbase)
Memcached


   Developed for the original LiveJournal site
   LRU, distributed hashtable
   Logic is in both client and server
   Used in Google App Engine, Facebook, Twitter
   Ehcache now has similar service
   Good for things that outlive an app server
How does it work?


 Clients know how to:
     Send items to servers (consistent hashing)
     What to do when a server fails
     How to fetch keys from servers
     Can “weigh” to server capacities
 Servers know how to:
   Store items they receive
   Expire them from the cache
   No inter-server comms – everything is unaware
Sample Code
Voldemort


   Less than Memcached, but also more!
   Not a cache, but a distributed key/value store
   Developed by LinkedIn
   Works on distributed hashmap w/failover
   Logic can be in client/server or just server
   Pluggable storage (mysql,bdb,mock)
   Pluggable serialization (JSON, Google PB, etc)
“Relaxed” Consistency


 Eventual consistency – data will come into sync but
  not immediately on the write. In practice “pretty
  soon” is milliseconds later
 We are actually used to this – eg Google indexes
  update every so often.
 Guarantees to read your own writes (eg your profile
  on LinkedIn)
 Tuneable to better performance/weaker consistency
What’s attractive?


   Data is automatically replicated
   Partitioning ensures all servers have subset
   Server failure is handled transparently
   Data is rebalanced when servers added/removed
   Serialization is pluggable
   Apache License
Impressive Performance


 “We were able to move applications that needed to
  handle hundreds of millions of reads and writes per day
  from over 400ms to under 10ms while simultaneously
  increasing the amount of data we store.”
Performance Info




http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010
Sample Script


 Starting the server (or deploy as a .war)
binvoldemort-server.bat configsingle_node_cluster
 Starting the console
binvoldemort-shell.bat test tcp://localhost:6666
 Run some queries
put “hello” “world”
get “hello”
put “hello” “world 2.0”
delete “hello”
Sample Code
CouchDb


 Document-Oriented Db – No Schema
 Written in Erlang (!) by a Notes Dev (!!!)
 Everything is stored in JSON, Restful API
 Clever replication concepts – works in disconnected
  settings
 Every write is a new document, version
 Map/Reduce baked in
 Apache License
What’s attractive?


 Schemaless operation – Adhoc data
 Incremental replication (great for disconnected
  settings)
 Great fault-tolerance (with versioned conflicts)
 Fast query with flexibility (MapReduce)
So what is this Map/Reduce thing?


  Popularized by Google’s BigTable
  Map functions collect documents matching criteria
   and create a B-Tree
  Reduce functions operate on the B-Tree
  Everything happens in parallel on many machines
  Example: distributed grep
The Naked Couch


   http://127.0.0.1:5984/
   http://127.0.0.1:5984/_all_dbs
   http://127.0.0.1:5984/mydb (PUT)
   http://127.0.0.1:5984/_utils/ (Futon)
Mapping Couch with Ekron


 You lose some of the joy of schema-less
 But you do get lots of boilerplate ;-)
 Oh, and strong typing.
Writing a Couch MapReduce


 You write a map function to extract data
 You always return a key/value pair

function(doc) {
  if (doc.title.indexOf(“Hi!") > -1) {
    emit(doc.title, doc);
  }
}
Neo4j


   Stored data in a graph of nodes and r’ships
   Can handle billions of nodes per machine
   Means you can query on relationships!
   Supports ACID transactions
   One 500kb jar (!)
   Dual-licensed GPL/Commercial
Sample Code
Blogvertising


 http://blogs.bytecode.com.au/glen
 http://twitter.com/glen_a_smith
 http://grailspodcast.com/


 Download all the source from today:
 http://bitbucket.org/glen_a_smith/cjug-nosql-
  examples
Q&A


 Looking for a good book?

Contenu connexe

Dernier

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

NoSQL - Life Beyond the Outer Join

  • 1. NoSQL - Life beyond the Outer Join Glen Smith (glen@bytecode.com.au)
  • 2. Objectives  Survey the landscape of NoSQL offerings  Learn some of the terminology  Look at some of the Java offerings in the space  Take away source to play with  Be able to ask questions (but you may not get answers)
  • 3. What is NoSQL?  (N)ot (O)nly SQL not “Anti SQL”  Movement more than “one” technology  Distributed Storage System  Much weaker queries  Scale across many machines  Much larger data, much faster queries
  • 4. Why NoSQL?  Inspired by Distributed Data Storage problems  Scale easily by adding servers  Not suited to all problem types, but super-suited to certain large problem types  High-write situations (eg activity tracking or timeline rendering for millions of users)  A lot of relational uses are really dumbed down (eg fetch by PK with update)
  • 5. What’s wrong with RDBMS?  Nothing ;-)  To scale RDBMS, your approach is typically:  Shard your datasource  Put in a bunch of read replicas  Put memcached in front of those  What could possibly go wrong?   Complex. Custom caching. Partitioning. Migrating of shards. Tons of moving parts.
  • 6. How can I live w/o ACID?  Atomic (it happens or not, no partial completes)  Consistent (DB internals, ref integ, field validate)  Isolated (Can’t modify uncommitted data)  Durable (written to disk/transaction log)  But in a distributed db, life is not so simple...
  • 7. The CAP theorum In a distributed system, when you have state on more than one machine, pick any two:  Consistency (easy in read-only states – copy!)  Availability (can you get at your data? Is it up?)  Partition Tolerance (3 machines on one net, 3 on the other, with a broken link. How do you take updates since you can’t keep people up to date. What if you don’t agree on what’s up?)
  • 8. How do these NoSQL things work?  Basically big distributed hashtables  Push all logic into the write (update two lists – one for userId, one for email)  Things don’t happen transactionally. These are two writes.  There is no free lunch. The programmer is now handling consistency problems.  You were thinking about query optimisation before, and now even more so.
  • 9. How big are we talking?  Digg - 3Tb  Facebook Inbox – 50 Tb  eBay – 2 Pb  Think about Twitter’s issues.. Billion of queries a second over Tb of data.
  • 10. The NoSQL Taxonomy  Key-Value In-Memory stores (Memcached, Redis)  Key-Value “Eventually Consistent” stores (“Dynamo Clones” like Cassandra, Voldemort, Riak)  Document stores (Couchdb, Mongodb, JCR)  Graph Databases (Neo4j)  Tabular (“BigTable clones” like Hadoop/Hbase)
  • 11. Memcached  Developed for the original LiveJournal site  LRU, distributed hashtable  Logic is in both client and server  Used in Google App Engine, Facebook, Twitter  Ehcache now has similar service  Good for things that outlive an app server
  • 12. How does it work?  Clients know how to:  Send items to servers (consistent hashing)  What to do when a server fails  How to fetch keys from servers  Can “weigh” to server capacities  Servers know how to:  Store items they receive  Expire them from the cache  No inter-server comms – everything is unaware
  • 14. Voldemort  Less than Memcached, but also more!  Not a cache, but a distributed key/value store  Developed by LinkedIn  Works on distributed hashmap w/failover  Logic can be in client/server or just server  Pluggable storage (mysql,bdb,mock)  Pluggable serialization (JSON, Google PB, etc)
  • 15. “Relaxed” Consistency  Eventual consistency – data will come into sync but not immediately on the write. In practice “pretty soon” is milliseconds later  We are actually used to this – eg Google indexes update every so often.  Guarantees to read your own writes (eg your profile on LinkedIn)  Tuneable to better performance/weaker consistency
  • 16. What’s attractive?  Data is automatically replicated  Partitioning ensures all servers have subset  Server failure is handled transparently  Data is rebalanced when servers added/removed  Serialization is pluggable  Apache License
  • 17. Impressive Performance  “We were able to move applications that needed to handle hundreds of millions of reads and writes per day from over 400ms to under 10ms while simultaneously increasing the amount of data we store.”
  • 19. Sample Script  Starting the server (or deploy as a .war) binvoldemort-server.bat configsingle_node_cluster  Starting the console binvoldemort-shell.bat test tcp://localhost:6666  Run some queries put “hello” “world” get “hello” put “hello” “world 2.0” delete “hello”
  • 21. CouchDb  Document-Oriented Db – No Schema  Written in Erlang (!) by a Notes Dev (!!!)  Everything is stored in JSON, Restful API  Clever replication concepts – works in disconnected settings  Every write is a new document, version  Map/Reduce baked in  Apache License
  • 22. What’s attractive?  Schemaless operation – Adhoc data  Incremental replication (great for disconnected settings)  Great fault-tolerance (with versioned conflicts)  Fast query with flexibility (MapReduce)
  • 23. So what is this Map/Reduce thing?  Popularized by Google’s BigTable  Map functions collect documents matching criteria and create a B-Tree  Reduce functions operate on the B-Tree  Everything happens in parallel on many machines  Example: distributed grep
  • 24. The Naked Couch  http://127.0.0.1:5984/  http://127.0.0.1:5984/_all_dbs  http://127.0.0.1:5984/mydb (PUT)  http://127.0.0.1:5984/_utils/ (Futon)
  • 25. Mapping Couch with Ekron  You lose some of the joy of schema-less  But you do get lots of boilerplate ;-)  Oh, and strong typing.
  • 26. Writing a Couch MapReduce  You write a map function to extract data  You always return a key/value pair function(doc) { if (doc.title.indexOf(“Hi!") > -1) { emit(doc.title, doc); } }
  • 27. Neo4j  Stored data in a graph of nodes and r’ships  Can handle billions of nodes per machine  Means you can query on relationships!  Supports ACID transactions  One 500kb jar (!)  Dual-licensed GPL/Commercial
  • 29. Blogvertising  http://blogs.bytecode.com.au/glen  http://twitter.com/glen_a_smith  http://grailspodcast.com/  Download all the source from today:  http://bitbucket.org/glen_a_smith/cjug-nosql- examples
  • 30. Q&A Looking for a good book?