1. T R E A S U R E D A T A
INFRASTRUCTURE FOR AUTO SCALING
DISTRIBUTED SYSTEMS
Auto scaling of distributed processing engine is not common.
So what we did.
Kai Sasaki
Software Engineer in Treasure Data
2. ABOUT ME
- Kai Sasaki (@Lewuathe)
- Software Engineer in Treasure Data since 2015
- PTD migration
- Remerge on MapReduce
- Presto Auto Scaling System
- Improving testing environment of query engine
Working in Query Engine Team (Managing Hive, Presto in Treasure data)
- Open Source Contributor of Hadoop, Hivemall, Spark, Presto, TensorFlow
3. TOPICS
Distributed Processing Engine
General architecture of distributed processing
engine and characteristics of specific platforms like
Hadoop, Presto used in Treasure Data.
Solution with Cloud
In order to achieve high scalable distributed
system, we make use of existing cloud features
like auto scaling group, deployment management
system.
Painful Scalable System
Auto scaling of distributed processing engine is not
common. We will describe the real pain point to
scale our distributed system.
4. AGENDA
• Distributed System in TD
• Presto and Hive
• Painful points to scale out distributed processing engine
• What we’ve done
• Decoupling storage layer
• Packaging and deployment on CodeDeploy
• Capacity resizing with Auto Scaling Group
• Real Auto Scaling
• CPU metrics
• Cost reduction
10. WHY IS SCALING EFFECTIVE
• Distributed processing engine splits a job into multiple task fragments which can be run in
parallel. These task fragments are distributed to multiple worker nodes. Having more worker
nodes indicates may enable us to distribute smaller tasks to worker nodes.
SELECT
t1.user,
t1.path,
t2.email,
t2.address,
t2.age
FROM
www_access t1
JOIN
user_table t2
ON
t1.user = t2.user
WHERE
t1.path = ‘/register’ scan
scan
filter
join
output
11. WHY IS SCALING EFFECTIVE
• JOIN operation is a common in OLAP processing which generally consumes a lot of
computing resource. It is well known that there are efficient parallel join algorithms which can
be fully leveraged by distributing processing engine.
12. WHY IS SCALING EFFECTIVE
• Reading data from a backend storage (e.g. S3) is high latency operation. It can consume
much memory and network bandwidth. Distributing table scan fragments make it possible to
scatter the network load evenly.
13. PAIN POINTS
• Basically horizontal scaling of worker nodes in these system gives us a way to achieve
competitive performance easily without losing much time and money in theory.
• But there are several painful points to overcome in daily release.
• Stateful datasource
• Bootstrapping of the instance takes time.
• Manually tracking the deployment process is unstable.
• Configuration overridden mechanism is complex. (and also package version)
• Deployment verification (smoke test)
• Graceful shutdown to achieve reliable query running
• Specifying the target of deployment in multi-cloud environment
• Capacity estimation
15. DECOUPLING STORAGE LAYER
PlazmaDB Amazon S3
• Only intermediate/temporary data is stored in processing engine so that we discard any
instance in the cluster anytime. Replace input/output class with Plazma specific one.
16. CODEDEPLOY + AUTO SCALING GROUP
• We migrated our Presto infrastructure to Amazon CodeDeploy and EC2 Auto Scaling Group
which make it easy to create cluster based provisioning and scale out/in cluster capacity.
AWS
CodeDeploy
Auto Scaling
Deploy Package
17. CODEDEPLOY
• CodeDeploy is a service of AWS to enable us to automate deployment flow. CodeDeploy
manages the steps of deployment of specific package as a whole.
• We can define a group where the package should be deployed.
• CodeDeploy package is just a zip file including all contents needed to run the application.
• Configs
• Binary package
• Hook scripts
• It manages the steps of application deployment as a whole. Versioning of package checking
application health, configuration rendering can be done via CodeDeploy.
18. DEPLOYMENT TARGET
• Specifying the target of deployment is troublesome in multi-cloud environment.
• Deployment target is a definition to specify unique resource where the package should be
deployed.
• site
• stage
• service
• cluster
• For example, a Presto cluster resource can be specified as this.
• presto-coordinator-default-production-aws
• presto-worker-default-production-aws
• Deployment specifies
• CodeDeploy package version to be deployed (Git tag)
• Deployment target
19. DEPLOYMENT TARGET
• Configuration has a overridden hierarchy based on the deployment target.
• You can define the default configuration value of each layer in deployment target.
• For example
• site=aws: -Xmx4G -Xms1G
• stage=production: -Xmx192G -Xms160G
• service=presto-worker: N/A
• cluster=one: -Xmx127G
• You can control the configuration in a fine-grained manner without losing the deployment
flexibility
20. CODEDEPLOY + AUTO SCALING GROUP
• Bootstrapping EC2 instance is delegated to Auto Scaling Group and application provisioning
can be done through AWS CodeDeploy.
• CodeDeploy package includes all application configurations.
ASG for workers
Master
21. CODEDEPLOY + AUTO SCALING GROUP
• If you increase the total capacity of ASG, ASG automatically bootstrap EC2 instance and
deploy the same CodeDeploy package as other workers have.
• Provisioning of cluster takes far less time (1+ hour -> 10mins) for 100 instances in total.
ASG for workers
Master
22. CODEDEPLOY + AUTO SCALING GROUP
• We can basically scale-in the cluster capacity in similar manner. One thing we need to take care
of is shutdown process. In order to avoid to fail on-going query, it is necessary to shutdown
instances gracefully. We use lifecycle hook to notify ASG when to shutdown the instance.
ASG for workers
Master
lifecycle hook
- complete
23. DEPLOYMENT HOOK
• CodeDeploy provides a way to run the hook script to prepare application deployment.
• ApplicationStop
• BeforeInstall
• AfterInstall
• ApplicationStart
• ValidationService
• It is easy to validate the application process running properly before joining into cluster by
running smoke test in ValidationService hook.
25. REAL “AUTO” SCALING…
• But we still need to specify the desired capacity manually…
• It’s based on our past history and experience.
Auto Scaling
40? 50? or more?
26. REAL AUTO SCALING
• Auto scaling policy feature provided an easy way to automatically scaling the cluster based on
the specific metric.
• Simple Scaling Policy
• Step Scaling Policy
• Target Tracking Scaling Policy
• Target tracking scaling policy enables us to adjust the cluster capacity in a fine grained
manner. It calculates the necessary capacity based on the gap between current metric and the
target value.
27. TARGET TRACKING OF CPU
• Workload simulation with 10~30 concurrent queries.
• Production size cluster (around 40 worker instances)
28. TARGET TRACKING OF CPU
• Expected cost reduction by the target value of CPU average usage.
29. TARGET TRACKING OF CPU
• Actual capacity transition with target tracking policy
• Target 40% CPU Usage and enabling scaling in.
30. FINALLY…
• It did not work properly because of
• Scale-in transition is conservative compared to scale-out. So cluster capacity tends to be
kept high long time. -> Cost increase
• Graceful shutdown also prevents to scale-in transition because long-running queries can
block instance termination.
31. FUTURE WORK
• Real Auto Scaling without using target tracking policy
• Detect a instance which can be in the shutdown process soon
• Estimate capacity based on application specific metrics
• Fine grained test to make sure query result consistency
• Auto query migration in case of outage
32. WE HUMAN ARE NOT AUTO SCALABLE!
https://www.treasuredata.com/company/careers/