2. Data Store
Super Set
Relational Databases
Key Value Stores
Document Stores
Column Family Stores
Tuesday, September 21, 2010
3. Design This Schema
Student Course
Student
Address
Address Score
Course
Score
Tuesday, September 21, 2010
4. Scalable huh??
Use Case : This schema has to serve the whole student
community in this world
One Big Server?? How Big?
More than 1 Servers. How will that work?
Tuesday, September 21, 2010
5. WHY NOSQL ?
Scalability : Horizontal
Relational Databases do no good when distributed
NOSQL : Distributed, Flexible Schema, Relaxing Consistency
Tuesday, September 21, 2010
6. Issues with Relational DB
Scalability
Replication : Scaling by duplication
Partitioning(Sharding) : Scaling by division
Tuesday, September 21, 2010
7. Replication
Master - Slave
1 write = N * writes (N is number of slaves)
Faster reads ( Can Read from N nodes)
Critical Reads Go to Master (Application Aware)
Limitations of high volumes of data
Tuesday, September 21, 2010
8. Replication
Multi - Master
Adding more masters
Conflict resolution O(n^3) or O(n^2)
Tuesday, September 21, 2010
9. Partitioning(Sharding)
Scales Read as well as Writes
Application needs to be Partition Aware
Broken Relationships : Cartesian products across shards ??
Referential Integrity is no more
Rebalancing
Tuesday, September 21, 2010
10. Consistent Hashing
Hash Ring (Or Clock Face)
Balanced Distribution After Adding a new Node
Tuesday, September 21, 2010
11. Common Sharding Schemes
Vertical Partitioning
Range Based Partitioning
Hash Based Partitioning
Directory Based Partitioning
Tuesday, September 21, 2010
12. Can live without !!
UPDATE and DELETE
Loss of Information
Can be modeled as INSERT with versioning
Filter out inactive records
Tuesday, September 21, 2010
13. Avoid JOINS
Expensive, Fails with partitions
How to avoid?
De - normalize
Storage is cheap now
Burden of Consistency shifts to application
Tuesday, September 21, 2010
14. Still need ACID ??
Atomicity : Only Single key is enough
Consistency : CAP Theorem
Can only get any two of Consistency, Availability,
Partition Tolerance
Isolation : Not more than Read - Committed (Single Key)
Durability : Node failures. Peer Replication
Tuesday, September 21, 2010
15. Fixed Schema
Schema comes before Data
Modifying Schema is essential
Adding new features
Modifying Schema is hard
Locking of rows(Add/Modify a column)
Locking of table(Add/Remove index)
Tuesday, September 21, 2010
16. Model this!!
Hierarchal Data
Graphs
Tuesday, September 21, 2010
17. Desired Characteristics
High Scalability
Add nodes incrementally
No Diminishing Returns
High Availability
No single point of failure
Node Failures agnostic
Tuesday, September 21, 2010
18. Desired Characteristics
High Performance
Fast operations
Non - Blocking Writes
Consistency
No need of Strong consistency
Eventual Consistency, Read - Your - Write Consistency
Tuesday, September 21, 2010
19. Desired Characteristics
Deployment Flexibility
Add/Remove node automatically
NO DFS or shared storage
Should work with commodity heterogenous hardware
Modeling Flexibility
Key - Value Pairs, Hierarchal and Graph Data
Tuesday, September 21, 2010
20. Desired Characteristics
Query Flexibility
Multi Gets
Range Queries
Upserts
Tuesday, September 21, 2010
21. Inspiration
Memcached
In-memory Key Value
Blazing Fast
Infinite Horizontal Scalability
Tuesday, September 21, 2010
22. Key Value Stores
Simple Data Model
Amazon Dynamo
Amazon S3
Project Voldemort
Redis
Scalaris and lot others
Tuesday, September 21, 2010
23. Amazon Dynamo
Internal to Amazon
Distributed K-V store
Opaque Values
Partitioning
A variant of consistent hashing
Hash Ring division
Tuesday, September 21, 2010
24. Amazon Dynamo
Partitioning
Mapping Communication via Gossip protocol
Eventually consistent view of mappings
Replication
Each key is replicated on N nodes
Preference List
Tuesday, September 21, 2010
25. Amazon Dynamo
Replication
Read/Write through Coordinator nodes
Configurations
N = number of replicas
W = min. nodes that must ACK the receipt of a WRITE
R = min. nodes contacted for a READ
R+W > N will ensure Quorum
Tuesday, September 21, 2010
26. Amazon Dynamo
Tuning (N,R,W)
Increased W means more replication
Increased R mean high consistency low performance
Typical values for Amazon Apps (N,R,W)= (3,2,2)
Tuesday, September 21, 2010
27. Amazon Dynamo
Consistency
Eventually consistent
Uses Object versioning via Vector Clocks
Consistency Protocol
Return all versions
Reconcile divergent versions
Reconciled version superseding the current is written
Tuesday, September 21, 2010
28. Amazon Dynamo
Handling Temporary Failures
Hinted Handoff
Handling Permanent Failures
Node Sync
Tuesday, September 21, 2010
29. Amazon Dynamo
Ring membership
Add/Remove node needs rebalancing
Failure Detection
Gossip about failures
Check periodically about availability and gossip
Tuesday, September 21, 2010
30. Other K-V Stores
Check out others too. Worth a read and try.
S3,Voldemort,Redis,Scalaris.
Tuesday, September 21, 2010
31. Document Stores
Step further from K-V stores
Value is full blown record(document)
Document is not Opaque(Expose a structure to perform
operations)
Each document can have different schema e.g JSON
Relations are possible
One to Many and Many to Many
Tuesday, September 21, 2010
32. Document Stores
Mostly Similar to relational db(except upfront Schema)
Amazon Simple DB
Apache CouchDB
Riak
Mongo DB
Tuesday, September 21, 2010
33. Mongo DB
We use mongo in a large automated translation software
Data Model
Key - Value, value being binary serialized JSON(BSON)
4 Mb limit on BSON
For larger object use GridFS.
Collections : more of like a table
B-trees used for indexes
Tuesday, September 21, 2010
34. Mongo DB
Storage
Uses Memory Mapped Files(Cache controlled by OS VMM)
Writes
In place updates
partial updates
Single Document Atomic updates
Tuesday, September 21, 2010
35. Mongo DB
Queries
JSON style based syntax (powered by js engine)
Support for conditional operators,regex etc
Cursor support
Query optimizers
Map-Reduce over a collection
Tuesday, September 21, 2010
36. Mongo DB
Replication
Master Slave
Replica Pairs
Master - Master
Tuesday, September 21, 2010
37. Mongo DB
Partitioning
Auto Sharding Done through chunks(50 Mb max)
Easy node addition
Auto balancing
ZERO single point of failure
Automatic Failover
Tuesday, September 21, 2010
38. Column Family Stores
Sparse, Distributed, Persistent, Multi-Dimensional sorted Map
Column Keys are grouped into sets called column-families
BigTable
HBase
Cassandra
Tuesday, September 21, 2010
40. Cassandra
Combines distributed architecture of Dynamo with column-
family data model of Big Table
Tuesday, September 21, 2010
41. Cassandra
Data Model : Multi Dimensional Map indexed by a key
Each app has its own key-space
Key can be any long string. Indexed by cassandra
Column - an attribute of record. Time Stamped
Column-Family: Grouping of columns. Similar to
relational table
Super Columns: List of columns
Tuesday, September 21, 2010
42. Cassandra
Data Model
Column family can contain any one of column/super
column
KeySpace.ColumnFamily.Key.[SuperColumn].Column
Sorting
Data is sorted at write time
Columns are sorted within their row by column name
(pluggable sorting providers)
Tuesday, September 21, 2010
43. Cassandra
Partitioning : Mostly Like Dynamo
Consistent hashing under order preserving hash function
Uses Chord approach to load balance(dynamo used v-
node)
Tuesday, September 21, 2010
44. Cassandra
Replication
Coordinator nodes and preference list as Dynamo
DataCenter aware, rack aware, rack-unaware
Rack aware uses Zookeeper
Membership based on ScuttleButt- anti-entropy gossip
Tuesday, September 21, 2010
45. Cassandra
Failure Detection
Modified version of Accrual failure detection
Failure Handling
Same as hinted handoff in Dynamo
Tuesday, September 21, 2010
46. Cassandra
Write
Writing to commit log, followed by an update to
memtable.
Dedicated disk for commit log(Makes write sequential)
No seeks-always sequential, so blazing fast
Atomic With in column family
Tuesday, September 21, 2010
47. Cassandra
Read
Similar to dynamo to figure out which nodes will serve
Similar to Big Table for storage level
Tuesday, September 21, 2010
48. Thanks!!!
Due regards to Reddy Raja for this invite.
Tuesday, September 21, 2010