Demystifying datastores

Vishnu Rao
MySQL Enthusiast
Doodle maker
Senior Data Engineer @ DataSpark
Formerly @ flipkart.com

The comma separated list ...
● Hadoop , Hbase, Rocks Db
● MySQL , MariaDB , Postgres
● Cassandra , MongoDb
● Druid , Redis, MemSQL
● Elastic Search , Solr
● Cockroach Db, Couch db
● Vertica , Infobright
● Redshift , Dynamo Db
● S3 , OpenStack Swift ….

The FUN-damental Qns:
Which one should I use ?

Lets try to look at the problem from
the view of the database

First lets play some baseball ...

Base 0 : The Data itself
● Row having columns

● Key - Value

● Key - Value
○ Key - Blob (u think object)

● Key - Value
○ Key - Document (u think json / xml)

● Key - Value
○ Key - Document (u think json / xml)
● Graph (Nodes/edges kind of like key-value)

Base 1 : How is the Data Stored ?

Let’s consider a Sample Data Record/Row
order-id-123 customer-1 5$ bill amount Bugis
Street
1$ Tax 3 Items

Let’s consider a Sample Data Record/Row
Street
1$ Tax 3 Items
Columns / Attributes
Possible PrimaryKey
Column

Approach 1
● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.

Approach 1
● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.
● This is generally referred to as a ROW based DataStore.

Approach 1
● Useful for use cases like “showing ENTIRE Order on UI”
Street
1$ Tax 3 Items

Approach 1
● Useful for use cases like “showing ENTIRE Order on UI”
● The entire row is fetched in one disk access
Street
1$ Tax 3 Items

Approach 2
● Store Columns SEPARATELY, so that they can be accessed
independently.

Approach 2
● Store Columns SEPARATELY, so that they can be accessed
independently.
● This is generally referred to as a COLUMN based DataStore.

Approach 2
● Avg(billing_amount) or Sum(Items)
order-id-123 customer-1 5$ bill amount
Bugis
Street1$ tax 3 items
order-id-121 customer-1 2$ bill amount 2$ tax 1 items
Bugis
Street

Approach 2
● Avg(billing_amount) or Sum(Items)
● Instead of fetching entire row, fetch necessary columns for compute
○ I.e Less Data fetched from Disk = REDUCED IO
order-id-123 customer-1 5$ bill amount
Bugis
Street1$ tax 3 items
order-id-121 customer-1 2$ bill amount 2$ tax 1 items
Bugis
Street

Approach 2
● What are the other optimisations for column store.
○ Imagine 4 rows with column say ‘age’
■ Row 1 - 28
■ Row 2- 30
■ Row 3 - 28
■ Row 4- 28

Approach 2
● While storing on disk , if you SORT and store, you can
also think of compression:
28,28,28,30 (sorted -> good for search now)
28(3),30 (now compressed -> 28 stored once)

Typically :
● MySQL / Postgres = ROW based
● Vertica / Infobright / Druid = COLUMN based

Approach 2.5
● Store Group of Columns TOGETHER but store each group separately.

Approach 2.5
● Store Group of Columns TOGETHER but store each group separately.
● This is generally referred to as a COLUMN-family based DataStore.

Approach 2.5
Logically group the columns.
order-id-123
customer-1
5$ bill amount
Bugis
Street
1$ tax 3 items

Approach 2.5
Logically group the columns.
Typically: Hbase/Cassandra
order-id-123
customer-1
5$ bill amount
Bugis
Street
1$ tax 3 items

Base 2 : The Indexing
● What kind of Data Structure is used ?

○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

● Certain type of queries like certain indexes

○ Range like B-tree, Inserts like Fractal.

○ Range like B-tree, Inserts like Fractal.
● Whats the index loading mechanism ?
○ Redis is Memory bound.

Base 3 : The Theorem
● Most Datastores do
○ Horizontal scaling
○ Sharding

Base 3 : The Theorem
● Most Datastores do
○ Horizontal scaling
○ Sharding
● So Here is the Catch - In event of Network Partition,
○ How is Consistency / Availability Handled ?

Base 4 : Apart from CAP theorem

● ACID ?
○ Transaction commit/Rollback support

● ACID ?
● BASE ?
○ Basically Available , Soft State, Eventual Consistency ?

● ACID ?
● BASE ?
● Can I do joins if data is sharded ?
○ What about Distribution awareness ?

● ACID ?
● BASE ?
● Can I do joins if data is sharded ?
○ What about Distribution awareness ?
● The Query Interface (major concern ?)

So, Try to cover the Bases & decide if you need it..
PS: There is no Silver Bullet

Thank you.
Vishnu Rao
jaihind213
sweetweet213
mash213.wordpress.com
linkedin.com/in/213vishnu

Demystifying datastores

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Demystifying datastores

Similar to Demystifying datastores (20)

More from vishnu rao

More from vishnu rao (6)

Recently uploaded

Recently uploaded (20)

Demystifying datastores