This document discusses strategies for scaling data storage and processing at scale. It covers using replicas to scale read throughput and shards to scale write throughput. The key challenges are eventual consistency with replicas and limited write throughput. Different sharding techniques like range, hash, and consistent hashing are explained. Parallelizing data processing involves sharding the data among workers, making the process fault tolerant through lineage graphs, and optimizing parallelism through techniques like filtering early and broadcasting small data. Worker management involves distributing tasks among nodes through frameworks like YARN and Mesos.
Top 10 Most Downloaded Games on Play Store in 2024
Scalable data systems at Traveloka
1. Scalable Data Systems
How to store and process data at scale, and what
are the challenges
Rendy B. Junior - Data System Architect @ traveloka
2. So my friends… Our data is growing, we
need to scale where we store our data.
We all hope it is not an afterthought question...
3. Start simple - replicas
Replicas: same set of data on nodes.
Usually to separate read throughput based on
load characteristic.
Challenges: eventually consistent and… Write
throughput does not increase...
A, B, C A, B, C
Read write
production
analytical
queries
4. Let’s scale write throughput - shard
Shards: different set of data on nodes.
Since you have more nodes to write, it
means more write throughput.
A, B C
Read write
production
6. Sharding Techniques
● Range Sharding
○ Efficient scan. Require coordinator. Eg. MongoDB
● Hash Sharding
○ Efficient key-based retrieval. No coordinator.
○ Reassign all rows when need to rebalance
● Consistent Hashing
○ Assign value not to physical server but to logical partition
○ Doesn’t need to reassign all rows, just need to move the logical partition
○ Eg. Cassandra, DynamoDB
7. Other names...
Usually in managed databases, where shard method is determined by default
● Distribution key, eg. Redshift
● Partition key, eg. DynamoDB
● And other names...
8. It’s simple!
1. Shard : distribute write load
2. Replica : distribute read (even more), high availability, durability
Congrats!
You have understood fundamental concept of any scalable data system!!!
It’s all about shards and replicas
And in most cases, it is configurable...
10. Something people loves, but distributed system hate
Hotspot!
Hotspot is injustice…
"Injustice anywhere is a threat to justice everywhere." - Martin Luther King Jr.
How to solve: rebalance, or fix sharding config
Sometime it is inevitable from db point of view, eg. single row hotspot
11. Scaling from Different View - Distributed File System
Storage and compute scale out
separately - loosely coupled.
Rebalance on compute is a lot faster
since it’s just redistributing metadata
Eg. BigTable / HBase
12. Serverless?
Less thing to worry, but less configurable
● Usually sharding is determined by default
● Usually rebalance is a background process we don’t know
● Usually replica set is a magic, we only know SLA
Doesn’t mean there’s no problem at all...
It has limitations, or less features.
Also, serverless won’t help bad design, eg. hotspot will always happen anywhere
if you choose a bad key...
13. Some tips
● Always test how scalable your database is
○ Sometime it works, in theory, but not practical
● Make sure you design it correctly before scaling up
○ Money doesn’t solve everything...
● Cloud? Vertical scale before horizontal
○ Why? Usually, price of 2 machines 1 core ~ 1 machine 2 cores
○ Why? Horizontal = more network overhead
● Beware of injustice!
○ I mean hotspots… put proper instrumentation
● Read more
○ New type of databases keep coming...
14. So how to scale data processing part...
It’s a distributed process, but it’s stateful...
15. Similar but different
Similar. “Shard”. Related to unit of parallelism. How many parts of data splitted
among workers
But for fault tolerance, make it resilient instead of more replicas
Eg. in Spark, it builds lineage graph so if one node is failing, it could recompute
missing partition due to node failure
16. More on Parallelism
Need to learn how to do this efficiently on some tools. Usually:
● Determine how many files & size you should have
● Determine how many CPUs for your job
Sample rules: x cores = x*n num of partitions, ~200MB per file
Some tools learn data source to decide num of workers (autoscale).
Eg. Dataflow, using BoundedSource.estimate_size on the SDK
17. Another concepts on processing - Shuffling
Some tips:
- Filter early
- Broadcast small data
- Learn engine specific
optimization...
18. Another concepts (again) - Worker Management
How to distribute tasks among nodes.
The concept is, usually there’s agent where we could submit request of job with
certain specification (CPU, RAM, …)
Eg. YARN, Mesos
Editor's Notes
I hope this is not an afterthought
Read replica usually the first strategy on RDBMS to increase read throughput by manually separate read load. In Traveloka we use this pattern to do ingestion from production databases to Data Lake.
Once upon a time, when our data is still ~14TB, we use mongo sharding
Hash Sharding
Store based on hash value of primary key, efficient key value retrieval / get
Example: which server = hash(id)%servers
No need coordinator, data evenly distributed
Reassign all rows when need to rebalance
Consistent Hashing
Assign value not to physical server but to logical partition
Doesn’t need to reassign all rows, just need to move the logical partition
Eg. Cassandra, DynamoDB