SlideShare une entreprise Scribd logo
1  sur  64
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Introduction to NoSQL
Jim Driscoll, MarkLogic
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 2
Agenda
 History of NoSQL
 NoSQL Terminology
 Types of NoSQL Databases (with examples of each)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 3
HISTORY OF NOSQL
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 4
A Short History of Data
 Application Specific Databases
 Size is paramount
 Relational Databases
 Size matters…
 …but break from the application silo
 … and provide data integrity
 NoSQL Databases
 Agility
 Scalability
 Speed
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 5
RELATIONAL DOESN’T MEAN
WHAT YOU THINK IT DOES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 6
What's wrong with Relational?
 Nothing, it's perfect for square data
 ...where you know the relationships in advance
 ...where the schema doesn't change often
 ...where all the data can fit on one machine
 ...where a separate disk seek for every join isn't an issue (or can be cached)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 7
SEEK AND YOU WILL FIND
…IN ABOUT 10MS
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 8
The Rise of NoSQL
1998 2001 2003 2004 2006 2007 2009
Google
FileSystem
paper
Carlo Strozzi
coins term
Google
BigTable
paper
Eric Evans
popularizes
term
MarkLogic
founded
Google
MapReduce
paper
Amazon
Dynamo
paper
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 9
MEMCACHED
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 10
Memcached
 Developed at LiveJournal as a frontend cache for websites
 First released in 2003
 Keep disk access at a minimum, pool memory on many machines
 So useful it found wide popularity, still under active development
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 11
Memcached
 High Performance, Distributed Memory Object Caching System
 Distributed – runs across many computers
 Memory – runs without touching disk
 Object cache – designed to hold small lumps of data
 High performance – because it never touches disk, and the objects are
small, it’s optimized for speed
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 12
Memcached
 Client server system
 Servers are unaware of each other
 Clients determine server to use via hashing
 Servers keep content as an LRU cache
 So all data transitory
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 13
SHARDING
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 14
Sharding to Scale Out
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 15
BIGTABLE
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 16
Bigtable
 Created by Google in 2004
 …to store massive amounts of data
 Made public in famous 2006 paper
 Used throughout Google
 GMail, Google Maps, YouTube, Web Indexing, etc
 Reportedly over 100 internal projects
 Never shipped externally as a product
 … but available for public use as part the AppEngine hosting API
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 17
Bigtable
 Rows are composed of columns, which in turn belong to column families
 Column families are essentially typing, validation and expiration info
 It’s helpful to think of them as the “tables”
 Lookups are done via a Row Key
 Every cell is versioned via timestamp, and sparsely stored
 System is robust and crash resistant
 Can survive the crash of any machine, including the master
 Scale out architecture
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 18
MAP / REDUCE
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 19
Map / Reduce
 Massively Distributed Processes
 Map - sort, filter, transform data
 Reduce - summarize data (iteratively)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 20
HADOOP
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 21
Hadoop
 First envisioned as “Nutch” at the Internet Archive in 2002
 There were 100’s of millions of webpages to index
 Early versions heavily influenced by Google File System, Map Reduce papers
 Goal: Perform work on large datasets using commodity machines
 Development moved to Yahoo in 2006
 Open Sourced to Apache, as Hadoop
 A File system (HDFS)
 A Task Runner (MapReduce)
 A Task Manager (YARN)
 Note: Not a database
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 22
Hadoop
 Really good at…
 Batch Processing
 … on incredibly large data sets
 Not so good parts
 Latency
 Updates
 Usability
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 23
DYNAMO
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 24
Amazon Dynamo
 Created to power Amazon’s Web store
 Writing with low latency more important than consistency
 Techniques first made public in 2007 paper
 Never externally shipped…
 …but huge influence on market
 Used for a variety of critical portions of Amazon’s site
 Shopping cart
 User Session
 Succeeded by DynamoDB
 Similar name, but whole new architecture (with better consistency)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 25
Amazon Dynamo
 Distributed Key Value store
 "always writable”
 low latency reads and writes, at the expense of consistency
 asynchronous replication on put() operations
 …mean that get() may return a stale value
 updates during a network partition can result in conflicts
 …and the application must handle them
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 26
TERMINOLOGY
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 27
What is(n't) NoSQL
 No SQL
 Schema-less
 Open Source
 BASE (Eventually Consistent)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 28
ACID
 Atomicity
 Everything either succeeds or fails
 Consistency
 Nothing is saved unless it passes consistency rules
 Isolation
 No two processes can interfere with each other
 Durability
 Once saved, data can not be lost due to system failure
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 29
BASE
 Basically Available
 Soft state
 Eventually consistent
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 30
What happens without consistency?
 Absolute fastest performance at lowest hardware cost
 Highest global data availability at lowest hardware cost
 Working with one document or row at a time
 Writing advanced code to create your own consistency model
 Eventually consistent data
 Some inconsistent data that can’t be reconciled
 Some missing data that can’t be recovered
 Some inconsistent query results
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 31
What is NoSQL?
 Database
 Non-relational
 Schema on read
 Scale out architecture
 Cluster friendly / Cloud Ready
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 32
TYPES OF NOSQL DATABASES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 33
Types of NoSQL Databases
Graph
Databases
Wide Column
Databases
Key Value
Databases
Document
Databases
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 34
KEY VALUE STORES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 35
MemcacheDB
 Very early KV implementation (2008)
 KV Store based on Memcached source, with BerkleyDB persistent store
 Speaks the memcached protocol
 Development stopped (2009), but still quite popular
 For when you like Memcached, but want persistence
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 36
REDIS
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 37
Redis
 First released in 2009
 Sponsored by VMWare, then Pivotal
 Name means Remote Dictionary Server
 Fully in memory key value store
 Whole db must reside in memory of one machine
 Limits scalability, at the benefit of performance
 Often used as a front end cache for other NoSQL databases
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 38
Redis
 Not just strings as values:
 Lists of strings
 Sets of strings (collections of non-repeating unsorted elements)
 Sorted sets of strings (collections of non-repeating elements ordered by a
floating-point number called score)
 Hashes where keys and values are strings
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 39
Redis
 Master / slave replication - slave may be master to another slave
 allowing tree replication
 also publish/subscribe API
 slaves may be updated separately from master, allows inconsistencies (!)
 Persistent store
 Append only journal
 Flushed every 2 seconds by default
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 40
DOCUMENT STORES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 41
Document vs Key Value Stores
 Extension of Key Value - the value is a document
 but also Structurally aware
 Indexed searches
 Self-describing document formats
 CouchDB – JSON
 MongoDB – BSON
 MarkLogic - JSON, XML
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 42
MARKLOGIC
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 43
MarkLogic
 Founded in 2001, founders were search engine experts
 Document centric database with search engine features
 Stores and indexes XML, JSON, text and binaries
 Enterprise NoSQL
 ACID transactions (including XA)
 HA/DR
 Government grade security
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 44
MarkLogic
 Universal Index – index all the things
 Index words, elements, the relationships of words and elements
 Many indexes (automatically) used at once, resolving queries without
touching disk
 Search on ranges, free text, field values, more…
 Shared nothing architecture, transactions via MVCC
 Automatic partitioning and balancing
 Hadoop support (works on HDFS, and with Map/Reduce jobs)
 Includes a webserver for building RESTful applications
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 45
MarkLogic
 More than a document store
 Range indexes allow in-memory column operations
 Triple store, supporting RDF Triples and SPARQL
 High Availability – multiple copies of updates saved transactionally
 Disaster Recovery – copies sent to remote site with a window
 Free to download and try out with a developer license
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 46
MONGODB
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 47
MongoDB
 Development began in 2007 by 10gen
 Name from “humongous”
 Originally wanted to create a Google App Engine system
 1.4 considered first “production ready” release, 2010
 Stores and retrieves BSON documents
 Horizontally scaling
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 48
MongoDB
 Stores data in proprietary format
 BSON, similar to JSON with more data types
 Search on field, on range, or on regex
 Single index per query (secondary index optional)
 Replication of databases as master/slave, with (tunable) eventual consistency
 Sharding handled via a shard key, splitting by range
 Be sure the key is evenly distributed
 Client APIs in many languages
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 49
WIDE COLUMN STORES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 50
Column Stores
 Descended from Big Table approach
 Excellent for sparse data
 Column families need to be specified up front
 But still stored sparsely
 No way to list all the columns in the database
 Append only
 Updates via timestamp
 Deletes via tombstone marker
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 51
CASSANDRA
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 52
Cassandra
 Developed at Facebook, 2008, donated to Apache
 Descended from Bigtable and Dynamo
 One of the primary Dynamo developers helped create Cassandra
 Focused on maximum throughput
 Write lots of data, fast
 But at the expense of consistency (tunable)
 Used by Twitter, Reddit, Netflix
 …but not Facebook
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 53
Cassandra
 Partitioned via hash (multiple strategies)
 Be careful choosing your Row Key!
 Async masterless replication
 Tunable Consistency
 from "writes never fail" to "wait until persisted on all slaves”
 Query with range queries, column family, CQL
 Hadoop support (replaces HDFS)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 54
GRAPH DATABASES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 55
Nodes and Vertices
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 56
NEO4J
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 57
Neo4J
 Released in 2010
 Written in Java, APIs are Java centric
 Most popular Graph Database
 Powers the recommendation engines of Glassdoor, Walmart
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 58
Neo4J
 Whole graph in memory – scales to millions of relationships
 But does persist to disk
 Transactional
 Replicated for performance and robustness, master/slave
 Proprietary Graph query language (Cypher)
 Enterprise version adds clustering, sharding
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 59
SEMANTIC WEB
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 60
Semantics: A New Way to Organize Data
Data is stored in Triples, expressed as: Subject : Predicate : Object
John Smith : livesIn : London
London : isIn : England
Query with SPARQL, gives us simple lookup .. and more
Find people who live in (a place that's in) England
"John Smith" "England"
livesIn
"London"
isIn
livesIn
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 61
Context from the World at Large
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Linked Open Data
 Facts that are freely available
 In a form that’s easily consumed
DBpedia (wikipedia as structured information)
 Einstein was born in Germany
 Ireland’s currency is the Euro
GeoNames
 Doha is the capital of Qatar
 Doha has these lat/long coordinates
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 62
IN CONCLUSION…
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 63
Don't Design Your System Like It's 1979
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 64
ANY QUESTIONS?
@MARKLOGIC

Contenu connexe

Tendances

RHTE2015_CloudForms_Containers
RHTE2015_CloudForms_ContainersRHTE2015_CloudForms_Containers
RHTE2015_CloudForms_Containers
Jerome Marc
 

Tendances (20)

Cloud Computing
Cloud Computing Cloud Computing
Cloud Computing
 
Introduction to MANTL Data Platform
Introduction to MANTL Data PlatformIntroduction to MANTL Data Platform
Introduction to MANTL Data Platform
 
RHTE2015_CloudForms_Containers
RHTE2015_CloudForms_ContainersRHTE2015_CloudForms_Containers
RHTE2015_CloudForms_Containers
 
Delivering Agile Data Science on Openshift - Red Hat Summit 2019
Delivering Agile Data Science on Openshift  - Red Hat Summit 2019Delivering Agile Data Science on Openshift  - Red Hat Summit 2019
Delivering Agile Data Science on Openshift - Red Hat Summit 2019
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
 
Destination Marketing Open Source and Cloud Presentation
Destination Marketing Open Source and Cloud PresentationDestination Marketing Open Source and Cloud Presentation
Destination Marketing Open Source and Cloud Presentation
 
Openstack Benelux Conference 2014 Red Hat Keynote
Openstack Benelux Conference 2014  Red Hat KeynoteOpenstack Benelux Conference 2014  Red Hat Keynote
Openstack Benelux Conference 2014 Red Hat Keynote
 
VMware - Openstack e VMware: la strana coppia
VMware - Openstack e VMware: la strana coppia VMware - Openstack e VMware: la strana coppia
VMware - Openstack e VMware: la strana coppia
 
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open SourceLiberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
 
CNCF Live Webinar: Kubernetes 1.23
CNCF Live Webinar: Kubernetes 1.23CNCF Live Webinar: Kubernetes 1.23
CNCF Live Webinar: Kubernetes 1.23
 
Building Cloud-Native Applications with a Container-Native SQL Database in th...
Building Cloud-Native Applications with a Container-Native SQL Database in th...Building Cloud-Native Applications with a Container-Native SQL Database in th...
Building Cloud-Native Applications with a Container-Native SQL Database in th...
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
 
MOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major Announcements
 
OpenShift on OpenStack
OpenShift on OpenStackOpenShift on OpenStack
OpenShift on OpenStack
 
Docker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhereDocker based Hadoop provisioning - anywhere
Docker based Hadoop provisioning - anywhere
 
HP Helion OpenStack step by step
HP Helion OpenStack step by stepHP Helion OpenStack step by step
HP Helion OpenStack step by step
 
Journey to the Cloud with Red Hat
Journey to the Cloud with Red HatJourney to the Cloud with Red Hat
Journey to the Cloud with Red Hat
 
PaaS is dead, Long live PaaS - Defrag 2016
PaaS is dead, Long live PaaS - Defrag 2016PaaS is dead, Long live PaaS - Defrag 2016
PaaS is dead, Long live PaaS - Defrag 2016
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
How to get started with Oracle Cloud Infrastructure
How to get started with Oracle Cloud InfrastructureHow to get started with Oracle Cloud Infrastructure
How to get started with Oracle Cloud Infrastructure
 

Similaire à Intro to NoSQL

Data-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile DevelopmentData-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile Development
DATAVERSITY
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overview
Paulo Fagundes
 

Similaire à Intro to NoSQL (20)

Data-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile DevelopmentData-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile Development
 
OUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
OUG Scotland 2014 - NoSQL and MySQL - The best of both worldsOUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
OUG Scotland 2014 - NoSQL and MySQL - The best of both worlds
 
NoSQL and MySQL
NoSQL and MySQLNoSQL and MySQL
NoSQL and MySQL
 
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
 
MySQL's NoSQL -- Texas Linuxfest August 22nd 2015
MySQL's NoSQL  -- Texas Linuxfest August 22nd 2015MySQL's NoSQL  -- Texas Linuxfest August 22nd 2015
MySQL's NoSQL -- Texas Linuxfest August 22nd 2015
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
 
Introducing Mache
Introducing MacheIntroducing Mache
Introducing Mache
 
Solution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataSolution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big Data
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overview
 
01282016 Aerospike-Docker webinar
01282016 Aerospike-Docker webinar01282016 Aerospike-Docker webinar
01282016 Aerospike-Docker webinar
 
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
 
Spark 101
Spark 101Spark 101
Spark 101
 
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
Oracle NoSQL Database -- Big Data Bellevue Meetup - 02-18-15
 
Idera live 2021: Managing Databases in the Cloud - the First Step, a Succes...
Idera live 2021:   Managing Databases in the Cloud - the First Step, a Succes...Idera live 2021:   Managing Databases in the Cloud - the First Step, a Succes...
Idera live 2021: Managing Databases in the Cloud - the First Step, a Succes...
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
 
Viridians on Rails
Viridians on RailsViridians on Rails
Viridians on Rails
 
EDB Postgres with Containers
EDB Postgres with ContainersEDB Postgres with Containers
EDB Postgres with Containers
 
Rails Concept
Rails ConceptRails Concept
Rails Concept
 
Manuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4octManuel Hurtado. Couchbase paradigma4oct
Manuel Hurtado. Couchbase paradigma4oct
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Intro to NoSQL

  • 1. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Introduction to NoSQL Jim Driscoll, MarkLogic
  • 2. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 2 Agenda  History of NoSQL  NoSQL Terminology  Types of NoSQL Databases (with examples of each)
  • 3. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 3 HISTORY OF NOSQL
  • 4. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 4 A Short History of Data  Application Specific Databases  Size is paramount  Relational Databases  Size matters…  …but break from the application silo  … and provide data integrity  NoSQL Databases  Agility  Scalability  Speed
  • 5. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 5 RELATIONAL DOESN’T MEAN WHAT YOU THINK IT DOES
  • 6. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 6 What's wrong with Relational?  Nothing, it's perfect for square data  ...where you know the relationships in advance  ...where the schema doesn't change often  ...where all the data can fit on one machine  ...where a separate disk seek for every join isn't an issue (or can be cached)
  • 7. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 7 SEEK AND YOU WILL FIND …IN ABOUT 10MS
  • 8. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 8 The Rise of NoSQL 1998 2001 2003 2004 2006 2007 2009 Google FileSystem paper Carlo Strozzi coins term Google BigTable paper Eric Evans popularizes term MarkLogic founded Google MapReduce paper Amazon Dynamo paper
  • 9. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 9 MEMCACHED
  • 10. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 10 Memcached  Developed at LiveJournal as a frontend cache for websites  First released in 2003  Keep disk access at a minimum, pool memory on many machines  So useful it found wide popularity, still under active development
  • 11. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 11 Memcached  High Performance, Distributed Memory Object Caching System  Distributed – runs across many computers  Memory – runs without touching disk  Object cache – designed to hold small lumps of data  High performance – because it never touches disk, and the objects are small, it’s optimized for speed
  • 12. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 12 Memcached  Client server system  Servers are unaware of each other  Clients determine server to use via hashing  Servers keep content as an LRU cache  So all data transitory
  • 13. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 13 SHARDING
  • 14. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 14 Sharding to Scale Out
  • 15. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 15 BIGTABLE
  • 16. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 16 Bigtable  Created by Google in 2004  …to store massive amounts of data  Made public in famous 2006 paper  Used throughout Google  GMail, Google Maps, YouTube, Web Indexing, etc  Reportedly over 100 internal projects  Never shipped externally as a product  … but available for public use as part the AppEngine hosting API
  • 17. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 17 Bigtable  Rows are composed of columns, which in turn belong to column families  Column families are essentially typing, validation and expiration info  It’s helpful to think of them as the “tables”  Lookups are done via a Row Key  Every cell is versioned via timestamp, and sparsely stored  System is robust and crash resistant  Can survive the crash of any machine, including the master  Scale out architecture
  • 18. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 18 MAP / REDUCE
  • 19. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 19 Map / Reduce  Massively Distributed Processes  Map - sort, filter, transform data  Reduce - summarize data (iteratively)
  • 20. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 20 HADOOP
  • 21. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 21 Hadoop  First envisioned as “Nutch” at the Internet Archive in 2002  There were 100’s of millions of webpages to index  Early versions heavily influenced by Google File System, Map Reduce papers  Goal: Perform work on large datasets using commodity machines  Development moved to Yahoo in 2006  Open Sourced to Apache, as Hadoop  A File system (HDFS)  A Task Runner (MapReduce)  A Task Manager (YARN)  Note: Not a database
  • 22. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 22 Hadoop  Really good at…  Batch Processing  … on incredibly large data sets  Not so good parts  Latency  Updates  Usability
  • 23. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 23 DYNAMO
  • 24. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 24 Amazon Dynamo  Created to power Amazon’s Web store  Writing with low latency more important than consistency  Techniques first made public in 2007 paper  Never externally shipped…  …but huge influence on market  Used for a variety of critical portions of Amazon’s site  Shopping cart  User Session  Succeeded by DynamoDB  Similar name, but whole new architecture (with better consistency)
  • 25. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 25 Amazon Dynamo  Distributed Key Value store  "always writable”  low latency reads and writes, at the expense of consistency  asynchronous replication on put() operations  …mean that get() may return a stale value  updates during a network partition can result in conflicts  …and the application must handle them
  • 26. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 26 TERMINOLOGY
  • 27. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 27 What is(n't) NoSQL  No SQL  Schema-less  Open Source  BASE (Eventually Consistent)
  • 28. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 28 ACID  Atomicity  Everything either succeeds or fails  Consistency  Nothing is saved unless it passes consistency rules  Isolation  No two processes can interfere with each other  Durability  Once saved, data can not be lost due to system failure
  • 29. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 29 BASE  Basically Available  Soft state  Eventually consistent
  • 30. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 30 What happens without consistency?  Absolute fastest performance at lowest hardware cost  Highest global data availability at lowest hardware cost  Working with one document or row at a time  Writing advanced code to create your own consistency model  Eventually consistent data  Some inconsistent data that can’t be reconciled  Some missing data that can’t be recovered  Some inconsistent query results
  • 31. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 31 What is NoSQL?  Database  Non-relational  Schema on read  Scale out architecture  Cluster friendly / Cloud Ready
  • 32. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 32 TYPES OF NOSQL DATABASES
  • 33. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 33 Types of NoSQL Databases Graph Databases Wide Column Databases Key Value Databases Document Databases
  • 34. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 34 KEY VALUE STORES
  • 35. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 35 MemcacheDB  Very early KV implementation (2008)  KV Store based on Memcached source, with BerkleyDB persistent store  Speaks the memcached protocol  Development stopped (2009), but still quite popular  For when you like Memcached, but want persistence
  • 36. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 36 REDIS
  • 37. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 37 Redis  First released in 2009  Sponsored by VMWare, then Pivotal  Name means Remote Dictionary Server  Fully in memory key value store  Whole db must reside in memory of one machine  Limits scalability, at the benefit of performance  Often used as a front end cache for other NoSQL databases
  • 38. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 38 Redis  Not just strings as values:  Lists of strings  Sets of strings (collections of non-repeating unsorted elements)  Sorted sets of strings (collections of non-repeating elements ordered by a floating-point number called score)  Hashes where keys and values are strings
  • 39. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 39 Redis  Master / slave replication - slave may be master to another slave  allowing tree replication  also publish/subscribe API  slaves may be updated separately from master, allows inconsistencies (!)  Persistent store  Append only journal  Flushed every 2 seconds by default
  • 40. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 40 DOCUMENT STORES
  • 41. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 41 Document vs Key Value Stores  Extension of Key Value - the value is a document  but also Structurally aware  Indexed searches  Self-describing document formats  CouchDB – JSON  MongoDB – BSON  MarkLogic - JSON, XML
  • 42. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 42 MARKLOGIC
  • 43. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 43 MarkLogic  Founded in 2001, founders were search engine experts  Document centric database with search engine features  Stores and indexes XML, JSON, text and binaries  Enterprise NoSQL  ACID transactions (including XA)  HA/DR  Government grade security
  • 44. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 44 MarkLogic  Universal Index – index all the things  Index words, elements, the relationships of words and elements  Many indexes (automatically) used at once, resolving queries without touching disk  Search on ranges, free text, field values, more…  Shared nothing architecture, transactions via MVCC  Automatic partitioning and balancing  Hadoop support (works on HDFS, and with Map/Reduce jobs)  Includes a webserver for building RESTful applications
  • 45. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 45 MarkLogic  More than a document store  Range indexes allow in-memory column operations  Triple store, supporting RDF Triples and SPARQL  High Availability – multiple copies of updates saved transactionally  Disaster Recovery – copies sent to remote site with a window  Free to download and try out with a developer license
  • 46. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 46 MONGODB
  • 47. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 47 MongoDB  Development began in 2007 by 10gen  Name from “humongous”  Originally wanted to create a Google App Engine system  1.4 considered first “production ready” release, 2010  Stores and retrieves BSON documents  Horizontally scaling
  • 48. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 48 MongoDB  Stores data in proprietary format  BSON, similar to JSON with more data types  Search on field, on range, or on regex  Single index per query (secondary index optional)  Replication of databases as master/slave, with (tunable) eventual consistency  Sharding handled via a shard key, splitting by range  Be sure the key is evenly distributed  Client APIs in many languages
  • 49. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 49 WIDE COLUMN STORES
  • 50. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 50 Column Stores  Descended from Big Table approach  Excellent for sparse data  Column families need to be specified up front  But still stored sparsely  No way to list all the columns in the database  Append only  Updates via timestamp  Deletes via tombstone marker
  • 51. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 51 CASSANDRA
  • 52. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 52 Cassandra  Developed at Facebook, 2008, donated to Apache  Descended from Bigtable and Dynamo  One of the primary Dynamo developers helped create Cassandra  Focused on maximum throughput  Write lots of data, fast  But at the expense of consistency (tunable)  Used by Twitter, Reddit, Netflix  …but not Facebook
  • 53. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 53 Cassandra  Partitioned via hash (multiple strategies)  Be careful choosing your Row Key!  Async masterless replication  Tunable Consistency  from "writes never fail" to "wait until persisted on all slaves”  Query with range queries, column family, CQL  Hadoop support (replaces HDFS)
  • 54. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 54 GRAPH DATABASES
  • 55. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 55 Nodes and Vertices
  • 56. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 56 NEO4J
  • 57. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 57 Neo4J  Released in 2010  Written in Java, APIs are Java centric  Most popular Graph Database  Powers the recommendation engines of Glassdoor, Walmart
  • 58. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 58 Neo4J  Whole graph in memory – scales to millions of relationships  But does persist to disk  Transactional  Replicated for performance and robustness, master/slave  Proprietary Graph query language (Cypher)  Enterprise version adds clustering, sharding
  • 59. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 59 SEMANTIC WEB
  • 60. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 60 Semantics: A New Way to Organize Data Data is stored in Triples, expressed as: Subject : Predicate : Object John Smith : livesIn : London London : isIn : England Query with SPARQL, gives us simple lookup .. and more Find people who live in (a place that's in) England "John Smith" "England" livesIn "London" isIn livesIn
  • 61. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 61 Context from the World at Large “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” Linked Open Data  Facts that are freely available  In a form that’s easily consumed DBpedia (wikipedia as structured information)  Einstein was born in Germany  Ireland’s currency is the Euro GeoNames  Doha is the capital of Qatar  Doha has these lat/long coordinates
  • 62. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 62 IN CONCLUSION…
  • 63. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 63 Don't Design Your System Like It's 1979
  • 64. © COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 64 ANY QUESTIONS? @MARKLOGIC

Notes de l'éditeur

  1. In formal database theory, tables are relations, rows are tuples, and fields are attributes. Relational databases aren’t about “Relations” between tables. It refers to the tables (relations) that make up the database. Relational databases actually have a problems dealing with “relationships” in the informal sense – joins need to be planned for in advance, and schema design can be a multi-week process (or even multi-month!) which has to be complete before you can start building your application.
  2. Disk read latency on a spinning 7200 RPM platter is 4.17ms for avg rotational latency, plus 8 or so ms for average seek time. You want as few seeks as possible, and normalizing data increases the seeks. This made complete sense when disks only held 10MB of data. Now? Modern relational databases spend much of their optimizing effort at combating this multi-seek problem… but can be limited by the constraints of memory on commodity hardware – which means you end up buying specialist hardware. Have I mentioned Larry Ellison owns an island? An entire Hawaiian island.
  3. Turn of millennium saw XML and Object databases, like MarkLogic and Objectivity – but the real explosion in interest began in the middle of the decade, as the needs of data storage and retrieval really started to change. Eric Evan popularized the term, as title to Meetups to discuss this new technology trend in San Francisco, my home town. Next, we’ll go over some of these historical developments.
  4. One of the earliest developments was the creation of Memcached
  5. Developed at LiveJournal in 2003 as a way to speed up web applications, memcached has proved so useful, that it’s still in wide use and under active development.
  6. Described as: high-performance, distributed memory object caching system Let’s unpack: Distributed – runs across many computers Memory – runs without touching disk Object cache – designed to hold small lumps of data High performance – because it never touches disk, and the objects are small, it’s optimized for speed
  7. Advantage? Scale out architecture
  8. With a single server, as in most relational systems, all you can do is buy a bigger machine – scale up. But this quickly gets ruinously expensive. NoSQL offers another way to scale – scale out. With Memcache, there’s no connection between the machines, where the data lives is determined by the client hash. That lets you set up mulitple machines. [click] But other systems are possible. The servers can communicate among themselves, and decide who keeps what data. Mongo, for instance, does this by setting a key, so where data lives depends on it’s value. MarkLogic does this automatically without setting a key. This lets you scale to an effectively unlimited number of hosts. [click]
  9. From 2006 Google paper: Bigtable is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. Bigtable was never shipped outside of Google, but it’s considered a seminal paper for the NoSQL movement, and the ideas behind it are the basis for a family of databases called wide column stores. It’s also integral to many Google projects, and is the Data storage method exposed by App Engine, so you can still use it today. Bigtable uses MVCC for writes, and as a result is able to do fast writes which scale well. It also supports indexing for queries.
  10. History
  11. functions need to be order independent. Another scale out architecture
  12. Doug Cutting of Internet Archive and Mike Cafarella of U Wash. Cutting went to Yahoo in 2006. http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
  13. Another influential system that never shipped publicly was Amazon’s Dynamo. Presented at the 2007 All Things Distributed conference, the Amazon Dynamo paper was every bit as exciting as the Bigtable paper. From the paper, Dynamo is “a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency” Paper is at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
  14. Another influential system that never shipped was Amazon’s Dynamo. Presented at the 2007 All Things Distributed conference, the Amazon Dynamo paper was every bit as exciting as the Bigtable paper. From the paper, Dynamo is “a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency” Paper is at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf DynamoDB info at http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
  15. From the paper: Updates in the presence of network partitions and node failures can potentially result in an object having distinct version sub-histories, which the system will need to reconcile in the future. This requires us to design applications that explicitly acknowledge the possibility of multiple versions of the same data (in order to never lose any updates).
  16. A somewhat artificial acronym for a very real thing. We’ll cover this in more detail in just a moment.
  17. A very made up acronym for a much more vague idea than ACID. It’s another way of saying “not ACID”. It’s also appropriate for some usecases… but beware using it where it’s not. The most famous problem with this approach came from two different BitCoin exchanges who went out of business because they relied on eventual consistency. So, for things like Survey Data, cat pictures, forum postings (for some forums, but not others, like in Finanace), BASE is fine. For anything having to do with money, or regulatory compliance, or inventory, etc, use ACID.
  18. Consistency is a function of the other three properties, Durability, Isolation and Atomicity. So, this is essentially a summary of the preceding slides. As a side note, Eventually Consistent is really just marketing speak: if you’re only consistent eventually, you’re Essentially Inconsistent. Slide originally from Mike Bowers (but since modified), presented at MarkLogic World 2013
  19. Database, not a filesystem. Not a cache (without a store). So, not Hadoop, not memcache (but memcachedb). Cluster friendly is about more than just running in an AMI - it means running on commodity hardware.
  20. There are easily over 200 different NoSQL database systems, and they vary wildly in features and design centers.
  21. Key Values stores are “Hashtables in the sky”.
  22. Redis is an open-source, networked, in-memory, key-value data store with optional durability. The most popular KV store. As mentioned previously, it’s also considered a “data structure server”.
  23. The ability to do a clustered shared-nothing distribution of data is currently in Beta
  24. Like key value stores, but by also allowing for additional structure in the value stored, new possibilities open up for things like indexing, search and aggregation. Mongo is the most used Document DB, while MarkLogic is the largest NoSQL database company from a revenue perspective (according to independent web site estimates)
  25. Binary JSON (BSON) oriented document database, with sharding and eventual consistency. First stable release in 2010.
  26. Big Table, Cassandra, Hadoop HBase, Apache Accumulo All from the Bigtable starting point, and share that general architecture. But really, it’s almost all Cassandra, from a marketshare perspective
  27. Data model like BigTable * Distro model like Dynamo * Built by Facebook in 2008 * Apache Project 2010
  28. Great for: Recommendations, Social Network analysis, Shortest path, Asset Management Neo4J, Allegro, Titan, Objectivity Databases where the primary thing tracked are nodes, and the connections of those nodes, called vertixes. Neo4J dominates the market from a share perspective.
  29. The Semantic Web and Open Linked Data are really just a special case of Graph Databases.
  30. Semantics is a new way of organizing and searching information Data are modeled as triples: the combination of a subject, predicate, object triple, or fact. For example, “John Smith lives in London” is a fact “London is in England”. Each of those facts can modeled as a triple. Any human would look at those two facts and immediately know that John Smith lives in England With rules, MarkLogic Semantics can achieve the same result [CLICK] Even though we never explicitly say that John Smith lives in England, we can query MarkLogic and find that it’s true
  31. There are a large and growing number of Linked Open Data sets available and more are coming every day These data sets are in a form that makes them easily consumed. That’s really important and we’ll describe what that form looks like in a minute Examples dbpedia (wikipedia as triples) Einstein was born in Germany Ireland's currency is the Euro GeoNames: Doha is the capital of Qatar Doha has these lat/long coords Others: Data.gov, data.gov.uk Legislation Where the money goes World Bank Linked Data Patents.data.gov, reference.data.gov, BBC Programmes, BBC Music, BBC Wildlife