Automation of Hadoop cluster operations in Arm Treasure Data

Confidential © Arm 2017
Automation of Hadoop cluster operations
in Arm Treasure Data
Yan Wang
Arm Treasure Data
March 14, 2019

Who am I?
● Yan Wang (王岩)
● May 2018 〜 Arm Treasure Data
Hadoop team, Software Engineer
● Contributing hadoop
● Like Japanese Mahjong
● Blog https://tiana528.github.io/
LukaMe

Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary

Arm Treasure Data Product
Customers don’t
need to operate
hadoop clusters.
We do.

Hadoop Usage
multi-clouds
Cluster
very multi-tenancy
permanent storage
HA
M
S
S S
cluster structure
patched hadoop
PTD-2.7.3-xxx
operation tool
CDH
HDP
Self-developed
Operation point of view
● Recreate cluster on incident
● Self-developed operation tool is
key point for operation
Improved in the past year

Agenda
● Future roadmap
● Summary

Reduce hadoop cluster creation time significantly
-- by making use of AWS Auto Scaling Group
● Before
Environment
Setup
Create cluster
of 100 nodes
launch nodes one
by one
● Too slow
○ Client side
■ 1 hour
○ Cluster ready
■ 1 hour
Environment
Setup
create AWS Auto
Scaling Group
● Much faster
○ Client side
■ 3 minutes
○ Cluster ready
■ 15 minutes
● After
Create cluster
of 100 nodes
9 months ago

Agenda
● Future roadmap
● Summary

General flow of how to recreate a hadoop cluster
● No downtime : A/B switch
ClusterA
job
server
ClusterA
job
server
ClusterB ClusterA
job
server
ClusterB ClusterB
job
server
create new
cluster
switch
traffic
shutdown
old cluster

Simplify hadoop cluster recreation
-- by creating our wrapper script of SRE tool
● Issues
○ Too many parameters
○ Stressful to shutdown
7 months ago
● Before ● After
service create -S aws -s development -c ClusterB ...
service delete -S aws -s development -c ClusterA ...
cluster create ClusterB
cluster delete ClusterA
● Improved
○ 1 parameter
○ Stressless to shutdown
Use SRE team tool directly Use our wrapper script
= SRE tool + verification + config

Agenda
● Future roadmap
● Summary

Gained a lot of merits by changing instance type of slaves
c3.8xlarge
Very old model
6 months ago
m5d.12xlarge
Latest model
● Improved
○ Larger per container memory
○ Larger & faster local disk
○ Lower cost
○ ...
● But …

But… new issue occured
● New issue happened
○ Amazon don’t have so many m5d instances for on-demand allocation
○ Insufficient instances to do A/B switch in one availability zone when
recreate a cluster.
● Ask Amazon support for help
○ They suggest us buying more reserved instances or use other instance
types intermediately.
● Other approaches?

Handle the situation of insufficient instances in one AZ
-- by supporting cross AZ environment
● Cross AZ environment
C
job
server
● Keypoint : no large network traffic between AZs which can be expensive.
worker
AZ_1 AZ_2
job
server
worker
AZ_1 AZ_2
job
server
job
server
worker
AZ_1 AZ_2
job
server
job
server
worker
AZ_1 AZ_2
job
server
A CA B CA B C B
REST API REST API
create new
cluster
switch
traffic
shutdown
old cluster

Agenda
● Future roadmap
● Summary

Create patches to fast fail jobs consuming too much disk
task timeline
0h 10h 20h 30h 40h
job fail here
failed
retried
We created two patches
For local : MAPREDUCE-7022 Fast fail rogue jobs based on task scratch dir size
For HDFS : MAPREDUCE-7148 Fast fail jobs when exceeds dfs quota limitation
(Disk quota configured)
failed
retried
failed
retried failed
Retry is meaningless
task timeline
0h 10h 20h 30h 40h
job fail here
failed
4 months ago
Fail fast

Agenda
● Future roadmap
● Summary

installed on all nodes
check very detailed status
Simplify incident Handling by creating health check scripts
Check A
Run command B
Check C
If … else…
Open URL ...
runbook
health check script
● When incident happen
○ Follow complex runbook during
incident. Needs to collect info first.
● When incident happen
○ Run health check during incident,
and know where is the issue.
● Future
○ integrate with Auto Scaling Group health check.
4 months ago
datadog metrics
trigger
alerts

Agenda
● Future roadmap
● Summary

Easy to know when to scale out
-- by creating capacity metrics based on machine learning
on going(POC)
alert comes
manually scale out if
having performance issue
HDFS put/get latency
Price plan & using slots
Probe query
HDFS usage
CPU I/O wait
linear regression
capacity metrics
● Expect improvement
○ Know when to scale out immediately
and easily.
● Future plan : use it for auto scale.
● Issue
○ A little late…
○ Hard for junior to understand

Agenda
● Future roadmap
● Summary

Simplify shutdown slaves
-- by using Auto Scaling Group shutdown hook
shutdown 2 at a time
wait block replication finish
then shutdown 2 more…
● Issue
○ boring operation
○ potential job retry
AWS Auto Scaling Group shutdown hook
○ safe & fast
on going
hadoop node decommission script
● Future plan : find a “proper” node to kill
○ e.g. short running tasks

Agenda
○ Replace Chef by debian packaging and Codedeploy
● Future roadmap
● Summary

Replace Chef by debian packaging and Codedeploy
We meet many issues using Chef
○ Only ruby
○ Unnecessary complicated
○ Stateful
○ 15 override rules of attributes
○ Slow
○ Fail silently
○ Dependent on other team’s release
cycle
○ two pass model
○ 5 years adding little by little
○ ...
Debian packaging
○ Standard way in Linux
AWS Codedeploy
○ Fast and easy to maintenance
○ Can be used in other cloud
○ Much easier to maintenance
○ cluster creation 15 minutes => 5
minutes
on going

Agenda
● Future roadmap
○ API-based routing and workflow-based hadoop recreation
○ Usage history based account routing
● Summary

API-based routing and workflow-based hadoop recreation
○ Totally automate hadoop cluster
recreation through workflow
○ server side validation
● Issue
○ Very manual
○ depends on manual validation
job
server
submit git pull request,
review, merge,
upload databag,
run chef-client on all nodes
change routing
A B
job
server
A B
job
server
REST API Call
curl -X PUT .../hadoop_routes -d
'{"defauls":"ClusterB"}'
change routing
A B
job
server
A B
API-based routing

Agenda
● Future roadmap
○ API-based routing and workflow-based hadoop recreation
○ Usage history based account routing
● Summary

Usage history based account routing
Busy
cluster
Idle
cluster
resource not fully utilized
job
server
Fixed
routing
Big cluster
easy to meet insufficient instance
issue when creating big cluster
Fixed
size
Busy
cluster
Idle
cluster
resource utilization increase
job
server
Dynamic routing
more accounts to
idle cluster
AZ_1 AZ_2
Dynamic account routing
easy to split cluster when instances are
insufficient
smaller
cluster1
smaller
cluster2

Agenda
● Future roadmap
● Summary

Summary
● Common idea
○ Use modernized cloud-based approach
○ API-based operation
○ Start from small and many small changes leading to large impact

We are hiring
https://www.treasuredata.com/company/careers/jobs/positions/?job=f6fd040b-c843-4991-bd49-bc674aab9a9e&team=Engineering

Confidential © Arm 201732 Confidential © Arm 201732 Confidential © Arm 201732
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!

Automation of Hadoop cluster operations in Arm Treasure Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Automation of Hadoop cluster operations in Arm Treasure Data

Similaire à Automation of Hadoop cluster operations in Arm Treasure Data (20)

Dernier

Dernier (18)

Automation of Hadoop cluster operations in Arm Treasure Data