Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×
Prochain SlideShare
NoSQL databases
NoSQL databases
Chargement dans…3

Consultez-les par la suite

1 sur 164 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)


Similaire à NoSql (20)


Plus récents (20)


  1. 1. NoSQL Technologies HBase | Cassandra | MongoDB | Redis Girish Khanzode
  2. 2. Contents • NoSQL – Horizontal Scalability – CAP Theorem – Gossip Protocol & Hinted Handoffs • Hbase – HBase Data Model – HBase Regions – Column Families – HBase API • Redis NoSQL Database • Cassandra – Architecture Overview – Partitioning – Write Properties – Gossip Protocols – Accrual Failure Detector – Data Model – Tunable Consistency • CQL • Memcached Database • MongoDB • References
  3. 3. NoSQL • Not Only SQL • Class of non-relational data storage systems • Usually no fixed table schema • No concept of joins • Relax one or more of the ACID properties
  4. 4. NoSQL • Column StoreType – Each storage block contains data from only one column – More efficient than row (or document) store if • Multiple row/record/documents are inserted at the same time so updates of column blocks can be aggregated • Retrievals access only some of the columns in a row/record/document • Document Store Type – stores documents made up of tagged elements • Key-Value Store Type – Hash table of keys • Graph DatabasesType
  5. 5. Categories • Key-Value Store – Big HashTable of keys & values • Products – Memcached – Membase – Redis – Data structure server – Riak – Amazon Dynamo based – Amazon S3 (Dynamo)
  6. 6. Categories • Schema-less - column-based, document-based, graph-based • Document-basedStore- Stores documents made up of tagged elements (CouchDB, MongoDB) • Column-based Store- Each storage block contains data from only one column – Google BigTable – Cassandra – HBase • Graph-based-A network database that uses edges and nodes to represent and store data (Neo4J)
  7. 7. NoSQLTypes Comparison
  8. 8. RDBMS Scaling - Master-Slave • All writes are written to the master • All reads performed against the replicated slave databases • Critical reads may be incorrect as writes may not have been propagated down • Large data sets can pose problems as master needs to duplicate data to slaves
  9. 9. RDBMS Scaling • Partition or Sharding – Scales well for both reads and writes – Not transparent, application needs to be partition-aware – Can no longer have relationships/joins across partitions – Loss of referential integrity across shards • Multi-Master replication • INSERT only, not UPDATES/DELETES • No JOINs, thereby reducing query time – Requires de-normalizing data • In-memory databases
  10. 10. RDBMS Limitations • One size does not fit all • Impedance mismatch • Rigid schema design • Harder to scale • Replication • Difficult to join across multiple nodes • Can not easily handle data growth • Need a DBA
  11. 11. RDBMS Limitations • Many issues while scaling up for massive datasets • Not designed for distributed computing • Expensive specialized hardware • Multi-node databases considered as solutions - Known as ‘scaling out’ or ‘horizontal scaling’ – Master-slave – Sharding
  12. 12. Horizontal Scalability • Scale out • Easily add servers to existing system - Elastically scalable – Bugs, hardware errors, things fail all the time – Cost efficient • Non sharing • Use commodity/cheap hardware • Heterogeneous systems
  13. 13. Horizontal Scalability • Controlled concurrency (avoids locks) • Service Oriented Architecture – Local states – Decentralized to reduce bottlenecks – Avoids single point of failures • Asynchronous • All nodes are symmetric
  14. 14. Horizontal Scalability
  15. 15. NoSQL Database Features • Large data volumes • Scalable replication and distribution – Potentially thousands of machines – Potentially distributed around the world • Queries require to return answers quickly • CAPTheorem • Open source development • Key /Value
  16. 16. NoSQL Database Features • Mostly query, few updates • Asynchronous Inserts & Updates • Schema-less • ACID transaction properties not needed – BASE • Schema-Less Stores – Richer model than key/value pairs – Eventual consistency – Distributed – Excellent performance and scalability – Downside - typically no ACID transactions or joins
  17. 17. Key-Value Store • A simple Hash table • Read and write values using a key – Get(key), returns the value associated with the provided key – Put(key, value), associates the value with the key – Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys – Delete(key), removes the entry for the key from the data store
  18. 18. Key-Value Store • Pros – Very fast – Scalable – Simple model – Distribute horizontally • Cons – Many data structures (objects) not easily modeled – As data volume rises, maintaining unique values as keys is difficult
  19. 19. Document Store • The data is a collection of key value pairs, is compressed as a document store similar to a key-value store • Difference is that the values stored (documents) provide some structure and encoding of the managed data • XML, JSON (Java Script Object Notation), BSON (binary JSON objects) are some common standard encodings
  20. 20. Column Store • Data stored in cells grouped in columns of data rather than rows • Columns logically grouped into column families • Families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema • Read and write is done using columns rather than rows • Benefit of storing data in columns, is fast search/ access and data aggregation • Store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster
  21. 21. Column Store - Data Model • ColumnFamily - A single structure that can group Columns and SuperColumns • Key - permanent name of the record. Keys have different numbers of columns, so the database can scale in an irregular way • Key-space - Defines the outermost level of an organization, typically the name of the application • Column - Ordered list of elements -Tuple with a name and a value defined
  22. 22. ACIDTransactions - Atomic • Either the whole process is done or none • If transaction successful – commit • System responsible for saving all changes to database • If transaction unsuccessful - abort • System responsible for rollback of all changes
  23. 23. ACIDTransactions - Consistent • Database constraints preserved • Enterprise rules limit occurrence of some real-world events • Customer cannot withdraw if balance less than minimum • These limitations are integrity constraints: assertions that must be satisfied by all database states (state invariants) • Isolated - User sees as if only one process executes at a time - two concurrent transactions will not see on another’s transaction while “in flight”
  24. 24. ACIDTransactions - Durable • Effects of a process not lost if the system crashes • System ensures that once a transaction commits, its effect on the database state is not lost despite subsequent failures • Database stored redundantly on mass storage devices to protect against media failure • Related to Availability - extent to which a (possibly distributed) system can provide service despite failure – Non-stop DBMS (mirrored disks) – Recovery based DBMS (log)
  25. 25. CAPTheorem • Brewer’sTheorem by Prof. Eric Brewer, published in 2000 at University of Berkeley • Consistency: Every node in the system contains the same data • Replicas never out of data • Availability - Every request to a non-failing node in the system returns a response – System available during software and hardware upgrades and node failures – Traditionally thought of as server/process available for five 9’s (99.999 %) – For large node system, at any point there’s a good chance that a node is either down or a network disruption among the nodes • Need a system resilience during network disruption
  26. 26. CAPTheorem
  27. 27. CAPTheorem • PartitionTolerance - System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost) • A system can continue to operate in the presence of a network partitions • At most two of these three properties supported for any shared-data system • Scaling out requires partition • It leaves either consistency or availability to choose from • In almost all cases, availability chosen over consistency
  28. 28. Eventual Consistency • BASE (BasicallyAvailable Soft-state Eventual consistency) • BASE is an alternative to ACID • Weak consistency – stale data OK • When no updates occur for a long period of time, eventually all updates propagate through the system and all the nodes are consistent • For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service • Availability first • Approximate answers
  29. 29. Eventual Consistency • Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent • Conflict resolution – Read repair -The correction is done when a read finds an inconsistency. This slows down the read operation – Write repair -The correction takes place during a write operation, if an inconsistency has been found, slowing down the write operation – Asynchronous repair -The correction is not part of a read or write operation
  30. 30. NoSQL Advantages • Cheap - open source • Easy to implement • Data replicated to multiple nodes (identical and fault-tolerant) • Partitioned – Down nodes easily replaced – No single point of failure • Easy to distribute • No predefined schema • Scale up and down • Relax the data consistency requirement (CAP)
  31. 31. NoSQL Downsides • Joins • Group by • Order by • ACID transactions • SQL frustrating but still a powerful query language • Easy integration with other applications that support SQL
  32. 32. Gossip Protocol & Hinted Handoffs • Most preferred communication protocol in a distributed environment • All the nodes talk to each other peer wise • No global state • No single point of coordinator • If one node goes down and there is a Quorum • Load for down node shared by others • Self managing system • If a new node joins, load is also distributed • Requests coming to node F handled by node C. When F becomes available, it will get this Information from C • Self healing property
  33. 33. Gossip Protocol & Hinted Handoffs
  34. 34. HBASE
  35. 35. HBase • An open-source, distributed, column-oriented database built on top of HDFS based on BigTable • A distributed data store scalable horizontally to 1,000’s of commodity servers and petabytes of indexed storage • Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS - Cloudstore) for scalability, fault tolerance and high availability
  36. 36. HBase History Started by Chad Walters and Jim 2006.11 - Google releases paper on BigTable 2007.2 - Initial HBase prototype created as Hadoop contribution 2007.10 - First useable HBase 2008.1 - Hadoop become Apache top-level project and HBase becomes subproject 2008.10 - HBase 0.18, 0.19 released
  37. 37. A Big Map • Row Key + Column Key + timestamp => value Row Key Column Key Timestamp Value 1 Info:name 1273516197868 Sakis 1 Info:age 1273871824184 21 1 Info:sex 1273746281432 Male 2 Info:name 1273863723227 Themis 2 Info:name 1273973134238 Andreas
  38. 38. Why BigTable? • RDBMS performance good for transaction processing • Very large scale analytic processing solutions are commercial, expensive, and specialized • Very large scale analytic processing – Big queries – typically range or table scans – Big databases (100s ofTB)
  39. 39. Why BigTable? • Map reduce on Bigtable with optional cascading on top to support some relational algebras - a cost effective solution • Sharding not a solution to scale open source RDBMS platforms – Application specific – Labor intensive (re)partitioning
  40. 40. HBase as Hadoop Component • Hbase built on top of HDFS • HBase files internally stored in HDFS
  41. 41. HBase Data Model • Based on Google’s Bigtable model - Key-Value pairs • HBase schema consists of several tables • Each table consists of a set of column families – Columns not part of schema • Tables sorted by Row Row key Column Family valueTimeStamp
  42. 42. HBase Data Model • Dynamic Columns – Because column names are encoded inside the cells – Different cells can have different columns • Table schema only defines it’s column families – Each family has any number of columns – Each column consists of any number of versions – Columns only exist when inserted, NULLs are free. – Columns within a family sorted and stored together • Everything except table names are byte[] • (Row, Family: Column,Timestamp) =Value
  43. 43. Components • Region – A subset of a table rows, like horizontal range partitioning – Automatic • RegionServer (many slaves) – Manages data regions – Serves data for reads and writes (using a log) • Master – Responsible for coordinating the slaves – Assigns regions, detects failures – Admin functions
  44. 44. HBase Members • Master – Monitors region servers – Load balancing for regions – Redirect client to correct region servers – Current SPOF – Signs regions, detects failures of Region Servers – Control admin function • Slaves – Region Servers – Region - A subset of table's rows – Serves data for reads and writes – Send Heartbeat to Master
  45. 45. HBase Regions • Each HTable (column family) is partitioned horizontally into regions – Regions are counterpart to HDFS blocks
  46. 46. Regions • Contain an in-memory data store (MemStore) and a persistent data store (HFile) • All regions on a region server share a reference to the write-ahead log (WAL) which is used to store new data that hasn't yet been persisted to permanent storage and to recover from region server crashes • Each region holds a specific range of row keys, and when a region exceeds a configurable size, HBase automatically splits the region into two child regions, which is the key to scaling HBase
  47. 47. Regions
  48. 48. LogicalView
  49. 49. Column Families Each row has a Key Each record is divided into Column Families Each column family consists of one or more Columns
  50. 50. Column Families
  51. 51. HBase vs. HDFS • Both distributed systems that scale to hundreds or thousands of nodes • HDFS is good for batch processing (scans over big files) – Not good for record lookup – Not good for incremental addition of small batches – Not good for updates
  52. 52. HBase vs. HDFS • HBase is designed to efficiently address the above points – Fast record lookup – Support for record-level insertion – Support for updates (not in place) • HBase updates are done by creating new versions of values
  53. 53. HBase vs. HDFS • If application has neither random reads or writes, stick to HDFS
  54. 54. HBase vs. RDBMS
  55. 55. When to Use HBase • Random read, write or both are required • Need to do many thousands of operations per second on multipleTB of data • Access patterns are well-known as simple
  56. 56. Row key Time Stamp Column “content s:” Column “anchor:” “com.apac he.ww w” t12 “<html> …” t11 “<html> …” t10 “anchor:apache .com” “APACH E” “com.cnn.w ww” t15 “anchor:cnnsi.co m” “CNN” t13 “anchor:my.look. ca” “CNN.co m” t6 “<html> …” t5 “<html> …” t3 “<html> …” Column family named “Contents” Column family named “anchor” Column named “apache.com” • Key – Byte array – Serves as the primary key for the table – Indexed far fast lookup • Column Family – Has a name (string) – Contains one or more related columns • Column – Belongs to one column family – Included inside the row • familyName:columnName
  57. 57. Row key Time Stamp Column “content s:” Column “anchor:” “com.apac he.ww w” t12 “<html> …” t11 “<html> …” t10 “anchor:apache .com” “APACH E” “com.cnn.w ww” t15 “anchor:cnnsi.co m” “CNN” t13 “anchor:my.look. ca” “CNN.co m” t6 “<html> …” t5 “<html> …” t3 “<html> …” Version number for each row value • Version Number – Unique within each key – By default - System timestamp – Data type is Long • Value (Cell) – Byte array
  58. 58. Data Model • Version number can be user-supplied – Even does not have to be inserted in increasing order – Version numbers are unique within each key • Table can be very sparse – Many cells are empty • Keys are indexed as the primary key Has two columns [cnnsi.com & my.look.ca]
  59. 59. Physical Model • Each column family is stored in a separate file (called HTables) • Key & version numbers are replicated with each column family • Empty cells are not stored HBase maintains a multi-level index on values: <key, column family, column name, timestamp>
  60. 60. Architecture
  61. 61. Zookeeper and HBase • HBase depends on Zookeeper • To manage master election and server availability, Zookeeper used • Set up a cluster, provides distributed coordination primitives • A tool for building cluster management systems
  62. 62. Connecting to HBase • Java client – get(byte [] row, byte [] column, long timestamp, int versions); • Non-Java clients – Thrift server hosting HBase client instance • Sample ruby, C++, & java (via thrift) clients – REST server hosts HBase client • TableInput / OutputFormat for MapReduce – HBase as MR source or sink • HBase Shell – JRuby IRB with “DSL” to add get, scan, and admin – ./bin/hbase shell YOUR_SCRIPT
  63. 63. ApacheThrift • $hbase-daemon.sh start thrift • $hbase-daemon.sh stop thrift • High performance, scalable, cross-language serialization and RPC framework • Created at Facebook along with Cassandra • A cross-language, service-generation framework • Binary Protocol (like Google Protocol Buffers) • Compiles to: C++, Java, Python, PHP, Ruby, Perl, …
  64. 64. HBase API • get(key) – Extract value given a key – get(row) • put(key, value) - Create or update the value given its key – put(row, Map<column, value>) • delete(key) -- Remove the key and its associated value • execute(key, operation, parameters) – operate on value given a key – List, Set, Map…
  65. 65. Hive HBase Integration • Reasons to use Hive on Hbase – Large data in Hbase for use in a real-time environment, but never used for analysis – Give access to data in HBase usually only queried through MapReduce to people that don’t code (business analysts) – When needing a more flexible storage solution, so that rows can be updated live by either a Hive job or an application and can be seen immediately to the other • Reasons not to do it – Run SQL queries on HBase to answer live user requests (it’s still a MR job) – Hoping to see interoperability with other SQL analytics systems
  66. 66. Hive HBase Integration
  67. 67. HBase - Benefits • Distributed storage • Table-like in data structure - Multi-dimensional map • High scalability, availability and performance • No real indexes • Automatic partitioning • Scale linearly and automatically with new nodes • Commodity hardware • Fault tolerance • Batch processing
  68. 68. HBase Limitations • Tables have one primary index / key , the row key • Each row can have any number of columns • Table schema only defines column families (column family can have any number of columns) • Each cell value has a timestamp • No join operators • Scans and queries can select a subset of available columns using a wildcard
  69. 69. HBase Limitations • Lookups – Fast lookup using row key and optional timestamp – Full table scan – Range scan from region start to end • Limited atomicity and transaction support – Supports multiple batched mutations of single rows only – Data is unstructured and un-typed • No access via SQL – Programmatic access - Java,Thrift(Ruby, Php, Python, Perl, C++,..), Hbase Shell
  70. 70. REDIS
  71. 71. Redis NoSQL Database • Redis is an open source, advanced key-value data store • Often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets • Redis works with an in-memory dataset • It is possible to persist dataset either by – dumping the dataset to disk every once in a while – or by appending each command to a log
  72. 72. Redis NoSQL Database • Distributed data structure server • Consistent hashing at client • Non-blocking I/O, single threaded • Values are binary safe strings: byte strings • String : Key/Value Pair, set/get. O(1) for many string operations. • Lists: lpush, lpop, rpush, rpop.you - use as stack or queue. O(1)
  73. 73. Redis NoSQL Database • Publisher/Subscriber model • Set: collection of unique elements - add, pop, union, intersection - set operations. • Sorted set: unique elements sorted by scores. O(logn). Range operations • Hash: multiple key/value pairs – HMSET user 1 username foo password bar age 30 – HGET user 1 age
  74. 74. Architecture
  75. 75. Redis Keys • Keys are binary safe - it is possible to use any binary sequence as a key • The empty string is also a valid key • Too long keys are not a good idea • Too short keys are often also not a good idea ("u:1000:pwd" versus "user:1000:password") • Nice idea is to use some kind of schema, like: "object-type:id:field"
  76. 76. Redis DataTypes • Redis is often referred to as a data structure server since keys can contain – Strings – Lists – Sets – Hashes – Sorted Sets
  77. 77. Redis Strings • Most basic kind of Redis value • Binary safe - can contain any kind of data, for instance a JPEG image or a serialized Ruby object • Max 512 Megabytes in length • Can be used as atomic counters using commands in the INCR family • Can be appended with the APPEND command
  78. 78. Redis Strings - Example
  79. 79. Redis Lists • Lists of strings, sorted by insertion order • Add elements to a Redis List pushing new elements on the head (on the left) or on the tail (on the right) of the list • Max length: (2^32 - 1) elements • Model a timeline in a social network, using LPUSH to add new elements, and using LRANGE in order to retrieve recent items • Use LPUSH together with LTRIM to create a list that never exceeds a given number of elements
  80. 80. Redis Lists - Example
  81. 81. Redis Sorted Sets • Every member of a Sorted Set is associated with score, that is used in order to take the sorted set ordered, from the smallest to the greatest score • You can do a lot of tasks with great performance that are really hard to model in other kind of databases • Probably the most advanced Redis data type
  82. 82. Redis Hashes • Map between string fields and string values • Perfect data type to represent objects HMSET user:1000 username antirez password P1pp0 age 34 HGETALL user:1000 HSET user:1000 password 12345 HGETALL user:1000
  83. 83. Redis Operations • It is possible to run atomic operations on data types: • Appending to a string • Incrementing the value in a hash • Pushing to a list • Computing set intersection, union and difference • Getting the member with highest ranking in a sorted set
  84. 84. CASSANDRA
  85. 85. Cassandra • Structured Storage System over a P2P Network • Was created to power the Facebook Inbox Search • Facebook open-sourced Cassandra in 2008 and became anApache Incubator project • In 2010, Cassandra graduated to a top-level project, regular update and releases followed
  86. 86. Cassandra • High availability • Designed to handle large amount of data across multiple servers • Eventual consistency - trade-off strong consistency in favor of high availability • Incremental scalability • Optimistic Replication
  87. 87. Cassandra • “Knobs” to tune tradeoffs between consistency, durability and latency • Low total cost of ownership • Minimal administration • Tunable consistency • Decentralized - No single point of failure • Writes faster than reads • Uses consistent hashing (logical partitioning) when clustered.
  88. 88. Cassandra • Hinted handoffs • Peer to peer routing(ring) • Thrift API • Multi data center support • Mimics traditional relational database systems, but with triggers and lightweight transactions • Raw, simple data structures
  89. 89. Features • Emphasis on performance over analysis – Still supports analysis tools like Hadoop • Organization – Rows are organized into tables – First component of a table’s primary key is the partition key – Rows clustered by the remaining columns of the key – Columns may be indexed separately from the primary key – Tables may be created, dropped, altered at runtime without blocking queries
  90. 90. Features • Language – CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve) • Peer-to-Peer cluster – Decentralized design • Each node has the same role – No single point of failure • Avoids issues of master-slave DBMS’s – No bottlenecking
  91. 91. Comparisons Apache Cassandra Google Big Table Amazon DynamoDB StorageType Column Column Key-Value Best Use Write often, read less Designed for large scalability Large database solution Concurrency Control MVCC Locks ACID Characteristics HighAvailability PartitionTolerance Persistence Consistency HighAvailability PartitionTolerance Persistence Consistency HighAvailability Key Point – Cassandra offers a healthy cross between BigTable and Dynamo.
  92. 92. Cassandra History Google Bigtable (2006) • consistency model: strong • data model: sparse map • clones: hbase, hypertable Amazon Dynamo (2007) • O(1) dht • consistency model: client tune-able • clones: riak, voldemort Cassandra ~= Bigtable + Dynamo
  93. 93. Architecture Overview • Cassandra was designed with the understanding that system/ hardware failures can and do occur • Peer-to-peer, distributed system • All nodes are the same • Data partitioned among all nodes in the cluster • Custom data replication to ensure fault tolerance • Read/Write-anywhere design
  94. 94. Architecture Overview
  95. 95. Architecture Overview • Google BigTable - data model – Column Families – Memtables – SSTables • Amazon Dynamo - distributed systems technologies – Consistent hashing – Partitioning – Replication – One-hop routing
  96. 96. Architecture
  97. 97. Transparent Elasticity • Nodes can be added and removed from Cassandra online, with no downtime being experienced. 1 2 3 4 5 6 1 7 10 4 2 3 5 6 8 9 11 12
  98. 98. Transparent Scalability • Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data 1 2 3 4 5 6 1 7 10 4 2 3 5 6 8 9 11 12 Performance throughput = N Performance throughput = N x 2
  99. 99. High Availability • Cassandra has no single point of failure due to peer-to-peer architecture
  100. 100. Multi-Geography - Zone Aware Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation
  101. 101. Partitioning • Nodes are logically structured in RingTopology • Hashed value of key associated with data partition is used to assign it to a node in the ring • Hashing rounds off after certain value to support ring structure • Lightly loaded nodes moves position to alleviate highly loaded nodes
  102. 102. Partitioning
  103. 103. Data Redundancy • Cassandra allows for customizable data redundancy so that data is completely protected • Supports rack awareness (data can be replicated between different racks to guard against machine/rack failures) • Uses Zookeeper to choose a leader which tells nodes the range they are replicas for
  104. 104. Data Redundancy
  105. 105. Operations • A client issues a write request to a random node in the Cassandra cluster • Partitioner determines the nodes responsible for the data • Locally, write operations are logged and then applied to an in-memory version • Commit log is stored on a dedicated disk local to the machine • Relies on local file system for data persistency
  106. 106. Operations • Write operations happens in 2 steps – Write to commit log in local disk of the node – Update in-memory data structure. – Why 2 steps or any preference to order or execution? • Read operation – Looks up in-memory ds first before looking up files on disk. – Uses Bloom Filter (summarization of keys in file store in memory) to avoid looking up files that do not contain the key
  107. 107. Consistency • Read Consistency – Number of nodes that must agree before read request returns – ONE to ALL • Write Consistency – Number of nodes that must be updated before a write is considered successful – ANY to ALL – AtANY, a hinted handoff is all that is needed to return. • QUORUM – Commonly used middle-ground consistency level – Defined as (replication_factor / 2) + 1
  108. 108. Hinted Handoff Write • Write intended for a node that is offline • An online node, processing the request, makes a note to carry out the write once the node comes back online
  109. 109. Write Properties • No locks in the critical path • Sequential disk access • Behaves like a write back Cache • Append support without read ahead • Atomicity guarantee for a key • AlwaysWritable – accept writes during failure scenarios
  110. 110. Write Operations • Stages – Logging data in the commit log – Writing data to the memtable – Flushing data from the memtable – Storing data on disk in SSTables • Commit Log – First place a write is recorded – Crash recovery mechanism – Write not successful until recorded in commit log – Once recorded in commit log, data is written to Memtable
  111. 111. Write Operations • Memtable – Data structure in memory – Once memtable size reaches a threshold, it is flushed (appended) to SSTable – Several may exist at once (1 current, any others waiting to be flushed) – First place read operations look for data • SSTable – Kept on disk – Immutable once written – Periodically compacted for performance
  112. 112. Write Operations
  113. 113. Read Repair • On read, nodes are queried until the number of nodes which respond with the most recent value meet a specified consistency level from ONE to ALL • If the consistency level is not met, nodes are updated with the most recent value which is then returned • If the consistency level is met, the value is returned and any nodes that reported old values are then updated
  114. 114. Read Repair
  115. 115. Delete Operations • Tombstones – On delete request, records are marked for deletion – Similar to Recycle Bin – Data is actually deleted on major compaction or configurable timer
  116. 116. Gossip Protocols • Used to discover location and state information about the other nodes participating in a Cassandra cluster • Network Communication protocols inspired for real life rumor spreading • Periodic, Pairwise, inter-node communication • Low frequency communication ensures low cost
  117. 117. Gossip Protocols • Random selection of peers • Example – Node A wish to search for pattern in data – Round 1 – Node A searches locally and then gossips with node B – Round 2 – Node A,B gossips with C and D – Round 3 – Nodes A,B,C and D gossips with 4 other nodes …… • Round by round doubling makes protocol very robust
  118. 118. Failure Detection • Gossip process tracks heartbeats from other nodes both directly and indirectly • Node Fail state is given by variable Φ – tells how likely a node might fail (suspicion level) instead of simple binary value (up/down). • This type of system is known as Accrual Failure Detector • Takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate • A threshold for Φ tells is used to decide if a node is dead – If node is correct, phi will be constant set by application. – Generally Φ(t) = 0
  119. 119. Failure Detection • Uses Scuttleback (a Gossip protocol) to manage nodes • Uses gossip for node membership and to transmit system control state • Lightweight with mathematically provable properties • State disseminated in O(logN) rounds where N is the number of nodes in the cluster. • EveryT seconds each member increments its heartbeat counter and selects one other member to send its list to. • A member merges the list with its own list .
  120. 120. Accrual Failure Detector • Valuable for system management, replication, load balancing etc • Node Fail state is given by variable ‘phi’ which tells how likely a node might fail (suspicion level) instead of simple binary value (up/down) • Defined as a failure detector that outputs a value, PHI, associated with each process. • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions
  121. 121. Accrual Failure Detector • The value output, PHI, represents a suspicion level • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions • In Cassandra the average time taken to detect a failure is 10- 15 seconds with the PHI threshold set at 5
  122. 122. Performance Benchmark • Loading of data - limited by network bandwidth • Read performance for Inbox Search in production Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms
  123. 123. Throughput Benchmark
  124. 124. Data Model • Column: smallest data element, a tuple with a name and a value :Rockets, '1' might return: { 'name' => ‘Rocket-Powered Roller Skates', ‘toon' => ‘Ready Set Zoom', ‘inventoryQty' => ‘5‘, ‘productUrl’ => ‘rockets1.gif’ }
  125. 125. Data Model • ColumnFamily -There’s a single structure used to group both the Columns and SuperColumns. Called a ColumnFamily (think table), it has two types, Standard & Super. – Column families must be defined at startup • Key - the permanent name of the record • Keyspace - the outer-most level of organization.This is usually the name of the application. For example, ‘Acme' (think database name)
  126. 126. Data Model • Optional super column: a named list.A super column contains standard columns, stored in recent order • SupposeOtherProducts has inventory in categories • Querying (:OtherProducts, '174927') might return – {‘OtherProducts' => {'name' => ‘Acme Instant Girl', ..}, ‘foods': {...}, ‘martian': {...}, ‘animals': {...}} • In the example, foods, martian, and animals are all super column names • They are defined on the fly, and there can be any number of them per row. :OtherProducts would be the name of the super column family
  127. 127. Data Model • Columns and SuperColumns are both tuples with a name & value.The key difference is that a standard Column’s value is a “string” and in a SuperColumn the value is a Map of Columns • Columns are always sorted by their name. Sorting supports: – BytesType – UTF8Type – LexicalUUIDType – TimeUUIDType – AsciiType – LongType • Each of these options treats the Columns' name as a different data type
  128. 128. Tunable Consistency • Cassandra has programmable read/writable consistency • Any - Ensure that the write is written to at least 1 node • One - Ensure that the write is written to at least 1 node’s commit log and memory table before receipt to client • Quorom - Ensure that the write goes to node/2 + 1 • All - Ensure that writes go to all nodes. An unresponsive node would fail the write
  129. 129. Consistent Hashing A H D B M V S R C • Partition using consistent hashing – Keys hash to a point on a fixed circular space – Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots • Nodes take positions on the circle. • A, B, and D exists. – B responsible for AB range. – D responsible for BD range. – A responsible for DA range. • C joins. – B, D split ranges. – C gets BC from D.
  130. 130. Key-Value Model • Cassandra is a column oriented NoSQL system • Column families: sets of key-value pairs – column family as a table and key- value pairs as a row (using relational database analogy) • A row is a collection of columns labeled with a name
  131. 131. Cassandra Row • Value of row is itself a sequence of key-value pairs • such nested key-value pairs are columns • key = column name • A row must contain at least 1 column
  132. 132. Example of Columns
  133. 133. Column Names StoringValues • key: User ID • column names store tweet ID values • values of all column names are set to “-” (empty byte array) as they are not used
  134. 134. Key Space • A Key Space is a group of column families together. It is only a logical grouping of column families and provides an isolated scope for names
  135. 135. Comparison with RDBMS • With RDBMS, a normalized data model is created without considering the exact queries – SQL can return almost anything though Joins • With C*, the data model is designed for specific queries – schema is adjusted as new queries introduced • C*: NO joins, relationships, or foreign keys – a separate table is leveraged per query – data required by multiple tables is denormalized across those tables
  136. 136. Compaction • Compaction runs periodically to merge multiple SSTables – Reclaims space – Creates new index – Merges keys – Combines columns – Discards tombstones – Improves performance by minimizing disk seeks • Types – Major – Read-only
  137. 137. Anti-Entropy • Replica synchronization mechanism • Ensures synchronization of data across nodes • Compares data checksums against neighboring nodes • Uses Merkle trees (Hash trees) – Snapshot of data sent to neighboring nodes – Created and broadcasted on every major compaction – If two nodes take snapshots withinTREE_STORE_TIMEOUT of each other, snapshots are compared and data is synced.
  138. 138. Anti-Entropy
  139. 139. Cassandra Query Language - CQL • Creating a keyspace - namespace of tables CREATE KEYSPACE demo WITH replication = {‘class’: ’SimpleStrategy’, replication_factor’: 3}; • To use namespace: USE demo;
  140. 140. CQL – CreateTable CREATE TABLE users( CREATE TABLE tweets( email varchar, email varchar, bio varchar, time_posted timestamp, birthday timestamp, tweet varchar, active boolean, PRIMARY KEY (email, time_posted)); PRIMARY KEY (email));
  141. 141. CQL • Insert – INSERT INTO users (email, bio, birthday, active)VALUES (‘Tom.Stok@btx.com’, ‘StarTeammate’, 516513612220, true); – Timestamp fields are specified in milliseconds since epoch • Query tables – SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows – SELECT * FROM users; – SELECT email FROM usersWHERE active = true;
  142. 142. Cassandra Advantages • Perfect for time-series data • High performance • Decentralization • Nearly linear scalability • Replication support • No single points of failure • MapReduce support
  143. 143. CassandraWeaknesses • No referential integrity – no concept of JOIN • Querying options for retrieving data are limited • Sorting data is a design decision – no GROUP BY • No support for atomic operations – if operation fails, changes can still occur • First think about queries, then about data model
  144. 144. Key Points • Cassandra is designed as a distributed database management system – use it when you have a lot of data spread across multiple servers • Cassandra write performance is always excellent, but read performance depends on write patterns – it is important to spend enough time to design proper schema around the query pattern • having a high-level understanding of some internals is a plus – ensures a design of a strong application built atop Cassandra
  145. 145. Hector – Java API for Cassandra • Sits on top ofThrift • Load balancing • JMX monitoring • Connection-pooling • Failover • JNDI integration with application servers • Additional methods on top of the standard get, update, delete methods. • Under discussion – hooks into Spring declarative transactions
  146. 146. Memcached Database • Key-Value Store • Very easy to setup and use • Consistent hashing • Scales very well • In memory caching, no persistence • LRU eviction policy • O(1) to set/get/delete • Atomic operations set/get/delete • No iterators or very difficult
  147. 147. MONGODB
  148. 148. MongoDB • Publicly released in 2009 • Allows data to persist in a nested state • Query that nested data in an ad hoc fashion • Enforces no schema • Documents can optionally contain fields or types that no other document in the collection contains • NoSQL
  149. 149. MongoDB
  150. 150. MongoDB • Document-oriented database • Uses BSON format – Binary JSON • An instance may have zero or more databases • A database may have zero or more collections • A collection may have zero or more documents • A document may have one or more fields • Indexes function like RDBMS counterparts
  151. 151. MongoDB • Data types: bool, int, double, string, object(bson), oid, array, null, date • Database and collections created automatically • Language Drivers • Capped collections are fixed size collections, buffers, very fast, FIFO, good for logs. No indexes • Object id are generated by client, 12 bytes packed data - 4 byte time, 3 byte machine, 2 byte pid, 3 byte counter
  152. 152. MongoDB • Possible to refer other documents in different collections but more efficient to embed documents • Replication easy to setup. Read from slaves • Supports aggregation – Map Reduce with JavaScript • Indexes, B-Trees. Ids are always indexed
  153. 153. MongoDB • Updates are atomic. Low contention locks • Querying mongo done with a document – Lazy, returns a cursor – Reducable to SQL, select, insert, update limit, sort - upsert (either inserts of updates) – Operators - $ne, $and, $or, $lt, $gt, $incr, $decr • Repository Pattern for easy development
  154. 154. MongoDB • Full Index Support • Replication & High Availability • Auto-Sharding • Querying • Fast In-Place Updates • Map/Reduce
  155. 155. Architecture
  156. 156. Comparison RDBMS MongoDB Database Database Table,View Collection Row Document (JSON, BSON) Column Field Index Index Join Embedded Document Foreign Key Reference Partition Shard
  157. 157. CRUD • Create – db.collection.insert( <document> ) – db.collection.save( <document> ) – db.collection.update( <query>, <update>, { upsert: true } ) • Read – db.collection.find( <query>, <projection> ) – db.collection.findOne( <query>, <projection> ) • Update – db.collection.update( <query>, <update>, <options> ) • Delete – db.collection.remove( <query>, <justOne> )
  158. 158. Commands # create a doc and save into a collection  p = {firstname:"Dave", lastname:"Ho“}  db.person.save(p)  db.person.insert({firstname:"Ricky", lastname:"Ho"}) # Show all docs within a collection  db.person.find() # Iterate result using cursor  var c = db.person.find()  p1 = c.next()  p2 = c.next()
  159. 159. Commands #Query  p3 = db.person.findone({lastname:"Ho"} # Return a subset of fields (ie: projection)  db.person.find({lastname:"Ho"}, {firstname:true}) # Delete some records  db.person.remove({firstname:"Ricky"}) #To build an index for a collection  db.person.ensureIndex({firstname:1})
  160. 160. Commands #To show all existing indexes  db.person.getIndexes() #To remove an index  db.person.dropIndex({firstname:1}) # Index can be build on a path of the doc  db.person.ensureIndex({"address.city":1}) # A composite key can be used to build index  db.person.ensureIndex({lastname:1, firstname:1})
  161. 161. Commands #Data update andTransaction: To update an existing doc, we can do the following  var p1 = db.person.findone({lastname:"Ho"})  p1["address"] = "San Jose" db.person.save(p1) # Do the same in one command  db.person.update({lastname:"Ho"}, {$set:{address:"San Jose"}}, false, true)
  162. 162. MongoDB Sharding • Config servers: Keeps mapping • Mongos: Routing servers • Mongod: master-slave replicas
  163. 163. References • NoSQL --Your Ultimate Guide to the Non - Relational Universe! http://nosql-database.org/links.html • NoSQL (RDBMS) http://en.wikipedia.org/wiki/NoSQL • PODC Keynote, July 19, 2000.Towards Robust. DistributedSystems. Dr. Eric A. Brewer. Professor, UC Berkeley.Co-Founder & Chief Scientist, Inktomi www.eecs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • http://planetcassandra.org/functional-use-cases/ • http://marsmedia.info/en/cassandra-pros-cons-and-model.php • http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra • http://wiki.apache.org/cassandra/CassandraLimitations • “Brewer'sCAPTheorem” posted by Julian Browne, January 11, 2009. http://www.julianbrowne.com/article/viewer/brewers- cap-theorem • “Scalable SQL”,ACM Queue, Michael Rys, April 19, 2011 http://queue.acm.org/detail.cfm?id=1971597
  164. 164. ThankYou Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode