my attempt to demystify datastores.
how to choose a store that fits your needs what are the questions you need to ask ?
hbase hadoop mysql cassandra vertica etc
10. Base 0 : The Data itself
● Row having columns
● Key - Value
11. Base 0 : The Data itself
● Row having columns
● Key - Value
○ Key - Blob (u think object)
12. Base 0 : The Data itself
● Row having columns
● Key - Value
○ Key - Blob (u think object)
○ Key - Document (u think json / xml)
13. Base 0 : The Data itself
● Row having columns
● Key - Value
○ Key - Blob (u think object)
○ Key - Document (u think json / xml)
● Graph (Nodes/edges kind of like key-value)
15. Base 1 : How is the Data Stored ?
Let’s consider a Sample Data Record/Row
order-id-123 customer-1 5$ bill amount Bugis
Street
1$ Tax 3 Items
16. Base 1 : How is the Data Stored ?
Let’s consider a Sample Data Record/Row
order-id-123 customer-1 5$ bill amount Bugis
Street
1$ Tax 3 Items
Columns / Attributes
Possible PrimaryKey
Column
17. Base 1 : How is the Data Stored ?
Approach 1
● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.
18. Base 1 : How is the Data Stored ?
Approach 1
● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.
● This is generally referred to as a ROW based DataStore.
19. Base 1 : How is the Data Stored ?
Approach 1
● Useful for use cases like “showing ENTIRE Order on UI”
order-id-123 customer-1 5$ bill amount Bugis
Street
1$ Tax 3 Items
20. Base 1 : How is the Data Stored ?
Approach 1
● Useful for use cases like “showing ENTIRE Order on UI”
● The entire row is fetched in one disk access
order-id-123 customer-1 5$ bill amount Bugis
Street
1$ Tax 3 Items
21. Base 1 : How is the Data Stored ?
Approach 2
● Store Columns SEPARATELY, so that they can be accessed
independently.
22. Base 1 : How is the Data Stored ?
Approach 2
● Store Columns SEPARATELY, so that they can be accessed
independently.
● This is generally referred to as a COLUMN based DataStore.
23. Base 1 : How is the Data Stored ?
Approach 2
● Avg(billing_amount) or Sum(Items)
order-id-123 customer-1 5$ bill amount
Bugis
Street1$ tax 3 items
order-id-121 customer-1 2$ bill amount 2$ tax 1 items
Bugis
Street
24. Base 1 : How is the Data Stored ?
Approach 2
● Avg(billing_amount) or Sum(Items)
● Instead of fetching entire row, fetch necessary columns for compute
○ I.e Less Data fetched from Disk = REDUCED IO
order-id-123 customer-1 5$ bill amount
Bugis
Street1$ tax 3 items
order-id-121 customer-1 2$ bill amount 2$ tax 1 items
Bugis
Street
25. Base 1 : How is the Data Stored ?
Approach 2
● What are the other optimisations for column store.
○ Imagine 4 rows with column say ‘age’
■ Row 1 - 28
■ Row 2- 30
■ Row 3 - 28
■ Row 4- 28
26. Base 1 : How is the Data Stored ?
Approach 2
● While storing on disk , if you SORT and store, you can
also think of compression:
28,28,28,30 (sorted -> good for search now)
28(3),30 (now compressed -> 28 stored once)
27. Base 1 : How is the Data Stored ?
Typically :
● MySQL / Postgres = ROW based
● Vertica / Infobright / Druid = COLUMN based
28. Base 1 : How is the Data Stored ?
Approach 2.5
● Store Group of Columns TOGETHER but store each group separately.
29. Base 1 : How is the Data Stored ?
Approach 2.5
● Store Group of Columns TOGETHER but store each group separately.
● This is generally referred to as a COLUMN-family based DataStore.
30. Base 1 : How is the Data Stored ?
Approach 2.5
Logically group the columns.
order-id-123
customer-1
5$ bill amount
Bugis
Street
1$ tax 3 items
31. Base 1 : How is the Data Stored ?
Approach 2.5
Logically group the columns.
Typically: Hbase/Cassandra
order-id-123
customer-1
5$ bill amount
Bugis
Street
1$ tax 3 items
32. Base 2 : The Indexing
● What kind of Data Structure is used ?
33. Base 2 : The Indexing
● What kind of Data Structure is used ?
○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
34. Base 2 : The Indexing
● What kind of Data Structure is used ?
○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
● Certain type of queries like certain indexes
35. Base 2 : The Indexing
● What kind of Data Structure is used ?
○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
● Certain type of queries like certain indexes
○ Range like B-tree, Inserts like Fractal.
36. Base 2 : The Indexing
● What kind of Data Structure is used ?
○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
● Certain type of queries like certain indexes
○ Range like B-tree, Inserts like Fractal.
● Whats the index loading mechanism ?
○ Redis is Memory bound.
37. Base 3 : The Theorem
● Most Datastores do
○ Horizontal scaling
○ Sharding
38. Base 3 : The Theorem
● Most Datastores do
○ Horizontal scaling
○ Sharding
● So Here is the Catch - In event of Network Partition,
○ How is Consistency / Availability Handled ?
40. Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
41. Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
● BASE ?
○ Basically Available , Soft State, Eventual Consistency ?
42. Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
● BASE ?
○ Basically Available , Soft State, Eventual Consistency ?
● Can I do joins if data is sharded ?
○ What about Distribution awareness ?
43. Base 4 : Apart from CAP theorem
● ACID ?
○ Transaction commit/Rollback support
● BASE ?
○ Basically Available , Soft State, Eventual Consistency ?
● Can I do joins if data is sharded ?
○ What about Distribution awareness ?
● The Query Interface (major concern ?)