20180522 infra autoscaling_system

T R E A S U R E D A T A
INFRASTRUCTURE FOR AUTO SCALING
DISTRIBUTED SYSTEMS
Auto scaling of distributed processing engine is not common.
So what we did.
Kai Sasaki
Software Engineer in Treasure Data

ABOUT ME
- Kai Sasaki (@Lewuathe)
- Software Engineer in Treasure Data since 2015
- PTD migration
- Remerge on MapReduce
- Presto Auto Scaling System
- Improving testing environment of query engine
Working in Query Engine Team (Managing Hive, Presto in Treasure data)
- Open Source Contributor of Hadoop, Hivemall, Spark, Presto, TensorFlow

TOPICS
Distributed Processing Engine
General architecture of distributed processing
engine and characteristics of speciﬁc platforms like
Hadoop, Presto used in Treasure Data.
Solution with Cloud
In order to achieve high scalable distributed
system, we make use of existing cloud features
like auto scaling group, deployment management
system.
Painful Scalable System
Auto scaling of distributed processing engine is not
common. We will describe the real pain point to
scale our distributed system.

AGENDA
• Distributed System in TD
• Presto and Hive
• Painful points to scale out distributed processing engine
• What we’ve done
• Decoupling storage layer
• Packaging and deployment on CodeDeploy
• Capacity resizing with Auto Scaling Group
• Real Auto Scaling
• CPU metrics
• Cost reduction

HIVE AND PRESTO ON PLAZMADB
Bulk Import
Fluentd
Mobile SDK
PlazmaDB
Presto
Hive
SQL, CDP
Amazon S3

HIVE IN TREASURE DATA
• Multiple clusters with 50+ worker cluster
• Hive 0.13 (latest 3.0.0)
Stats
• 4+ million queries / month
• 257 trillion records / month
• 3+ PB / month
In May, 2018

PRESTO IN TREASURE DATA
• Multiple clusters with 40+ worker cluster
• Presto 0.188 (latest 0.201)
Stats
• 14+ million queries / month
• 1053 trillion records / month
• 19.7+ PB / month
In May, 2018

DISTRIBUTED PROCESSING IN TD
CoordinatorAppMaster
WorkerContainer
Scaling

WHY IS SCALING EFFECTIVE
• Distributed processing engine splits a job into multiple task fragments which can be run in
parallel. These task fragments are distributed to multiple worker nodes. Having more worker
nodes indicates may enable us to distribute smaller tasks to worker nodes.
SELECT
t1.user,
t1.path,
t2.email,
t2.address,
t2.age
FROM
www_access t1
JOIN
user_table t2
ON
t1.user = t2.user
WHERE
t1.path = ‘/register’ scan
scan
ﬁlter
join
output

• JOIN operation is a common in OLAP processing which generally consumes a lot of
computing resource. It is well known that there are efﬁcient parallel join algorithms which can
be fully leveraged by distributing processing engine.

• Reading data from a backend storage (e.g. S3) is high latency operation. It can consume
much memory and network bandwidth. Distributing table scan fragments make it possible to
scatter the network load evenly.

PAIN POINTS
• Basically horizontal scaling of worker nodes in these system gives us a way to achieve
competitive performance easily without losing much time and money in theory.
• But there are several painful points to overcome in daily release.
• Stateful datasource
• Bootstrapping of the instance takes time.
• Manually tracking the deployment process is unstable.
• Conﬁguration overridden mechanism is complex. (and also package version)
• Deployment veriﬁcation (smoke test)
• Graceful shutdown to achieve reliable query running
• Specifying the target of deployment in multi-cloud environment
• Capacity estimation

DECOUPLING STORAGE LAYER
PlazmaDB Amazon S3
• Only intermediate/temporary data is stored in processing engine so that we discard any
instance in the cluster anytime. Replace input/output class with Plazma speciﬁc one.

CODEDEPLOY + AUTO SCALING GROUP
• We migrated our Presto infrastructure to Amazon CodeDeploy and EC2 Auto Scaling Group
which make it easy to create cluster based provisioning and scale out/in cluster capacity.
AWS  
CodeDeploy
Auto Scaling
Deploy Package

CODEDEPLOY
• CodeDeploy is a service of AWS to enable us to automate deployment flow. CodeDeploy
manages the steps of deployment of specific package as a whole.
• We can define a group where the package should be deployed.
• CodeDeploy package is just a zip file including all contents needed to run the application.
• Configs
• Binary package
• Hook scripts
• It manages the steps of application deployment as a whole. Versioning of package checking
application health, configuration rendering can be done via CodeDeploy.

DEPLOYMENT TARGET
• Specifying the target of deployment is troublesome in multi-cloud environment.
• Deployment target is a definition to specify unique resource where the package should be
deployed.
• site
• stage
• service
• cluster
• For example, a Presto cluster resource can be specified as this.
• presto-coordinator-default-production-aws
• presto-worker-default-production-aws
• Deployment specifies
• CodeDeploy package version to be deployed (Git tag)
• Deployment target

DEPLOYMENT TARGET
• Configuration has a overridden hierarchy based on the deployment target.
• You can define the default configuration value of each layer in deployment target.
• For example
• site=aws: -Xmx4G -Xms1G
• stage=production: -Xmx192G -Xms160G
• service=presto-worker: N/A
• cluster=one: -Xmx127G
• You can control the configuration in a fine-grained manner without losing the deployment
flexibility

• Bootstrapping EC2 instance is delegated to Auto Scaling Group and application provisioning
can be done through AWS CodeDeploy.
• CodeDeploy package includes all application conﬁgurations.
ASG for workers
Master

• If you increase the total capacity of ASG, ASG automatically bootstrap EC2 instance and
deploy the same CodeDeploy package as other workers have.
• Provisioning of cluster takes far less time (1+ hour -> 10mins) for 100 instances in total.
ASG for workers
Master

• We can basically scale-in the cluster capacity in similar manner. One thing we need to take care
of is shutdown process. In order to avoid to fail on-going query, it is necessary to shutdown
instances gracefully. We use lifecycle hook to notify ASG when to shutdown the instance.
ASG for workers
Master
lifecycle hook
- complete

DEPLOYMENT HOOK
• CodeDeploy provides a way to run the hook script to prepare application deployment.
• ApplicationStop
• BeforeInstall
• AfterInstall
• ApplicationStart
• ValidationService
• It is easy to validate the application process running properly before joining into cluster by
running smoke test in ValidationService hook.

REAL “AUTO” SCALING…
• But we still need to specify the desired capacity manually…
• It’s based on our past history and experience.
Auto Scaling
40? 50? or more?

REAL AUTO SCALING
• Auto scaling policy feature provided an easy way to automatically scaling the cluster based on
the speciﬁc metric.
• Simple Scaling Policy
• Step Scaling Policy
• Target Tracking Scaling Policy
• Target tracking scaling policy enables us to adjust the cluster capacity in a ﬁne grained
manner. It calculates the necessary capacity based on the gap between current metric and the
target value.

TARGET TRACKING OF CPU
• Workload simulation with 10~30 concurrent queries.
• Production size cluster (around 40 worker instances)

• Expected cost reduction by the target value of CPU average usage.

• Actual capacity transition with target tracking policy
• Target 40% CPU Usage and enabling scaling in.

FINALLY…
• It did not work properly because of
• Scale-in transition is conservative compared to scale-out. So cluster capacity tends to be
kept high long time. -> Cost increase
• Graceful shutdown also prevents to scale-in transition because long-running queries can
block instance termination.

FUTURE WORK
• Real Auto Scaling without using target tracking policy
• Detect a instance which can be in the shutdown process soon
• Estimate capacity based on application speciﬁc metrics
• Fine grained test to make sure query result consistency
• Auto query migration in case of outage

WE HUMAN ARE NOT AUTO SCALABLE!
https://www.treasuredata.com/company/careers/

20180522 infra autoscaling_system

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 20180522 infra autoscaling_system

Similaire à 20180522 infra autoscaling_system (20)

Plus de Kai Sasaki

Plus de Kai Sasaki (20)

Dernier

Dernier (20)

20180522 infra autoscaling_system