Scalable data systems at Traveloka

Scalable Data Systems
How to store and process data at scale, and what
are the challenges
Rendy B. Junior - Data System Architect @ traveloka

So my friends… Our data is growing, we
need to scale where we store our data.
We all hope it is not an afterthought question...

Start simple - replicas
Replicas: same set of data on nodes.
Usually to separate read throughput based on
load characteristic.
Challenges: eventually consistent and… Write
throughput does not increase...
A, B, C A, B, C
Read write
production
analytical
queries

Let’s scale write throughput - shard
Shards: different set of data on nodes.
Since you have more nodes to write, it
means more write throughput.
A, B C
Read write
production

Sharding Techniques
Any idea? :)

Sharding Techniques
● Range Sharding
○ Efficient scan. Require coordinator. Eg. MongoDB
● Hash Sharding
○ Efficient key-based retrieval. No coordinator.
○ Reassign all rows when need to rebalance
● Consistent Hashing
○ Assign value not to physical server but to logical partition
○ Doesn’t need to reassign all rows, just need to move the logical partition
○ Eg. Cassandra, DynamoDB

Other names...
Usually in managed databases, where shard method is determined by default
● Distribution key, eg. Redshift
● Partition key, eg. DynamoDB
● And other names...

It’s simple!
1. Shard : distribute write load
2. Replica : distribute read (even more), high availability, durability
Congrats!
You have understood fundamental concept of any scalable data system!!!
It’s all about shards and replicas
And in most cases, it is configurable...

Something people loves, but distributed system hate
Hayo tebak...

Something people loves, but distributed system hate
Hotspot!
Hotspot is injustice…
"Injustice anywhere is a threat to justice everywhere." - Martin Luther King Jr.
How to solve: rebalance, or fix sharding config
Sometime it is inevitable from db point of view, eg. single row hotspot

Scaling from Different View - Distributed File System
Storage and compute scale out
separately - loosely coupled.
Rebalance on compute is a lot faster
since it’s just redistributing metadata
Eg. BigTable / HBase

Serverless?
Less thing to worry, but less configurable
● Usually sharding is determined by default
● Usually rebalance is a background process we don’t know
● Usually replica set is a magic, we only know SLA
Doesn’t mean there’s no problem at all...
It has limitations, or less features.
Also, serverless won’t help bad design, eg. hotspot will always happen anywhere
if you choose a bad key...

Some tips
● Always test how scalable your database is
○ Sometime it works, in theory, but not practical
● Make sure you design it correctly before scaling up
○ Money doesn’t solve everything...
● Cloud? Vertical scale before horizontal
○ Why? Usually, price of 2 machines 1 core ~ 1 machine 2 cores
○ Why? Horizontal = more network overhead
● Beware of injustice!
○ I mean hotspots… put proper instrumentation
● Read more
○ New type of databases keep coming...

So how to scale data processing part...
It’s a distributed process, but it’s stateful...

Similar but different
Similar. “Shard”. Related to unit of parallelism. How many parts of data splitted
among workers
But for fault tolerance, make it resilient instead of more replicas
Eg. in Spark, it builds lineage graph so if one node is failing, it could recompute
missing partition due to node failure

More on Parallelism
Need to learn how to do this efficiently on some tools. Usually:
● Determine how many files & size you should have
● Determine how many CPUs for your job
Sample rules: x cores = x*n num of partitions, ~200MB per file
Some tools learn data source to decide num of workers (autoscale).
Eg. Dataflow, using BoundedSource.estimate_size on the SDK

Another concepts on processing - Shuffling
Some tips:
- Filter early
- Broadcast small data
- Learn engine specific
optimization...

Another concepts (again) - Worker Management
How to distribute tasks among nodes.
The concept is, usually there’s agent where we could submit request of job with
certain specification (CPU, RAM, …)
Eg. YARN, Mesos

Scalable data systems at Traveloka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable data systems at Traveloka

Similar to Scalable data systems at Traveloka (20)

Recently uploaded

Recently uploaded (20)

Scalable data systems at Traveloka

Editor's Notes