SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Confidential © Arm 2017
Automation of Hadoop cluster operations
in Arm Treasure Data
Yan Wang
Arm Treasure Data
March 14, 2019
Confidential © Arm 20172
Who am I?
● Yan Wang (王岩)
● May 2018 〜 Arm Treasure Data
Hadoop team, Software Engineer
● Contributing hadoop
● Like Japanese Mahjong
● Blog https://tiana528.github.io/
LukaMe
Confidential © Arm 20173
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 20174
Arm Treasure Data Product
Customers don’t
need to operate
hadoop clusters.
We do.
Confidential © Arm 20175
Hadoop Usage
multi-clouds
Cluster
very multi-tenancy
permanent storage
HA
M
S
S S
cluster structure
patched hadoop
PTD-2.7.3-xxx
operation tool
CDH
HDP
Self-developed
Operation point of view
● Recreate cluster on incident
● Self-developed operation tool is
key point for operation
Improved in the past year
Confidential © Arm 20176
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 20177
Reduce hadoop cluster creation time significantly
-- by making use of AWS Auto Scaling Group
● Before
Environment
Setup
Create cluster
of 100 nodes
launch nodes one
by one
● Too slow
○ Client side
■ 1 hour
○ Cluster ready
■ 1 hour
Environment
Setup
create AWS Auto
Scaling Group
● Much faster
○ Client side
■ 3 minutes
○ Cluster ready
■ 15 minutes
● After
Create cluster
of 100 nodes
9 months ago
Confidential © Arm 20178
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 20179
General flow of how to recreate a hadoop cluster
● No downtime : A/B switch
ClusterA
job
server
ClusterA
job
server
ClusterB ClusterA
job
server
ClusterB ClusterB
job
server
create new
cluster
switch
traffic
shutdown
old cluster
Confidential © Arm 201710
Simplify hadoop cluster recreation
-- by creating our wrapper script of SRE tool
● Issues
○ Too many parameters
○ Stressful to shutdown
7 months ago
● Before ● After
service create -S aws -s development -c ClusterB ...
service delete -S aws -s development -c ClusterA ...
cluster create ClusterB
cluster delete ClusterA
● Improved
○ 1 parameter
○ Stressless to shutdown
Use SRE team tool directly Use our wrapper script
= SRE tool + verification + config
Confidential © Arm 201711
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201712
Gained a lot of merits by changing instance type of slaves
c3.8xlarge
Very old model
6 months ago
● Before ● After
m5d.12xlarge
Latest model
● Improved
○ Larger per container memory
○ Larger & faster local disk
○ Lower cost
○ ...
● But …
Confidential © Arm 201713
But… new issue occured
● New issue happened
○ Amazon don’t have so many m5d instances for on-demand allocation
○ Insufficient instances to do A/B switch in one availability zone when
recreate a cluster.
● Ask Amazon support for help
○ They suggest us buying more reserved instances or use other instance
types intermediately.
● Other approaches?
Confidential © Arm 201714
Handle the situation of insufficient instances in one AZ
-- by supporting cross AZ environment
● Cross AZ environment
C
job
server
● Keypoint : no large network traffic between AZs which can be expensive.
worker
AZ_1 AZ_2
job
server
worker
AZ_1 AZ_2
job
server
job
server
worker
AZ_1 AZ_2
job
server
job
server
worker
AZ_1 AZ_2
job
server
A CA B CA B C B
REST API REST API
create new
cluster
switch
traffic
shutdown
old cluster
Confidential © Arm 201715
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201716
Create patches to fast fail jobs consuming too much disk
task timeline
0h 10h 20h 30h 40h
job fail here
● Before ● After
failed
retried
We created two patches
For local : MAPREDUCE-7022 Fast fail rogue jobs based on task scratch dir size
For HDFS : MAPREDUCE-7148 Fast fail jobs when exceeds dfs quota limitation
(Disk quota configured)
failed
retried
failed
retried failed
Retry is meaningless
task timeline
0h 10h 20h 30h 40h
job fail here
failed
4 months ago
Fail fast
Confidential © Arm 201717
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201718
installed on all nodes
check very detailed status
Simplify incident Handling by creating health check scripts
Check A
Run command B
Check C
If … else…
Open URL ...
● Before ● After
runbook
health check script
● When incident happen
○ Follow complex runbook during
incident. Needs to collect info first.
● When incident happen
○ Run health check during incident,
and know where is the issue.
● Future
○ integrate with Auto Scaling Group health check.
4 months ago
datadog metrics
trigger
alerts
Confidential © Arm 201719
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201720
Easy to know when to scale out
-- by creating capacity metrics based on machine learning
on going(POC)
alert comes
manually scale out if
having performance issue
● Before ● After
HDFS put/get latency
Price plan & using slots
Probe query
HDFS usage
CPU I/O wait
linear regression
capacity metrics
● Expect improvement
○ Know when to scale out immediately
and easily.
● Future plan : use it for auto scale.
● Issue
○ A little late…
○ Hard for junior to understand
Confidential © Arm 201721
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201722
Simplify shutdown slaves
-- by using Auto Scaling Group shutdown hook
shutdown 2 at a time
wait block replication finish
then shutdown 2 more…
● Before ● After
● Issue
○ boring operation
○ potential job retry
AWS Auto Scaling Group shutdown hook
● Expect improvement
○ safe & fast
on going
hadoop node decommission script
● Future plan : find a “proper” node to kill
○ e.g. short running tasks
Confidential © Arm 201723
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
○ Reduce hadoop cluster creation time significantly
○ Simplify hadoop cluster recreation
○ Modernize instance type of slaves
○ Create patches to fast fail jobs consuming too much disk
○ Simplify incident handling
○ Make it easy to know when to scale out
○ Simplify shutting down nodes
○ Replace Chef by debian packaging and Codedeploy
● Future roadmap
● Summary
Confidential © Arm 201724
Replace Chef by debian packaging and Codedeploy
We meet many issues using Chef
○ Only ruby
○ Unnecessary complicated
○ Stateful
○ 15 override rules of attributes
○ Slow
○ Fail silently
○ Dependent on other team’s release
cycle
○ two pass model
○ 5 years adding little by little
○ ...
● Before ● After
Debian packaging
○ Standard way in Linux
AWS Codedeploy
○ Fast and easy to maintenance
○ Can be used in other cloud
● Expect improvement
○ Much easier to maintenance
○ cluster creation 15 minutes => 5
minutes
on going
Confidential © Arm 201725
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
● Future roadmap
○ API-based routing and workflow-based hadoop recreation
○ Usage history based account routing
● Summary
Confidential © Arm 201726
API-based routing and workflow-based hadoop recreation
● Expect improvement
○ Totally automate hadoop cluster
recreation through workflow
○ server side validation
● Issue
○ Very manual
○ depends on manual validation
job
server
submit git pull request,
review, merge,
upload databag,
run chef-client on all nodes
change routing
● Before ● After
A B
job
server
A B
job
server
REST API Call
curl -X PUT .../hadoop_routes -d
'{"defauls":"ClusterB"}'
change routing
A B
job
server
A B
API-based routing
Confidential © Arm 201727
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
● Future roadmap
○ API-based routing and workflow-based hadoop recreation
○ Usage history based account routing
● Summary
Confidential © Arm 201728
Usage history based account routing
Busy
cluster
Idle
cluster
resource not fully utilized
job
server
Fixed
routing
Big cluster
easy to meet insufficient instance
issue when creating big cluster
Fixed
size
Busy
cluster
Idle
cluster
resource utilization increase
job
server
Dynamic routing
more accounts to
idle cluster
AZ_1 AZ_2
● Before ● After
Dynamic account routing
easy to split cluster when instances are
insufficient
smaller
cluster1
smaller
cluster2
Confidential © Arm 201729
Agenda
● Hadoop in Arm Treasure Data
● Hadoop Cluster Operation Automation
● Future roadmap
● Summary
Confidential © Arm 201730
Summary
● Common idea
○ Use modernized cloud-based approach
○ API-based operation
○ Start from small and many small changes leading to large impact
Confidential © Arm 201731
We are hiring
https://www.treasuredata.com/company/careers/jobs/positions/?job=f6fd040b-c843-4991-bd49-bc674aab9a9e&team=Engineering
Confidential © Arm 201732 Confidential © Arm 201732 Confidential © Arm 201732
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!

Contenu connexe

Tendances

Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 

Tendances (20)

Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
Postgres in Amazon RDS
Postgres in Amazon RDSPostgres in Amazon RDS
Postgres in Amazon RDS
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and Memory
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 

Similaire à Automation of Hadoop cluster operations in Arm Treasure Data

Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
Mário Almeida
 

Similaire à Automation of Hadoop cluster operations in Arm Treasure Data (20)

StripeEu Twistedbytes Presentation
StripeEu Twistedbytes PresentationStripeEu Twistedbytes Presentation
StripeEu Twistedbytes Presentation
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
 
State of serverless
State of serverlessState of serverless
State of serverless
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Serverless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud:  more dev, less opsServerless Apps on Google Cloud:  more dev, less ops
Serverless Apps on Google Cloud: more dev, less ops
 
Serverless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud: more dev, less opsServerless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud: more dev, less ops
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes]
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes][BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes]
[BarCamp2018][20180915][Tips for Virtual Hosting on Kubernetes]
 
Scaling Redis: Dmitry Polyakovsky
Scaling Redis: Dmitry PolyakovskyScaling Redis: Dmitry Polyakovsky
Scaling Redis: Dmitry Polyakovsky
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Embracing Serverless with Google
Embracing Serverless with GoogleEmbracing Serverless with Google
Embracing Serverless with Google
 
Embracing Serverless with Google
Embracing Serverless with GoogleEmbracing Serverless with Google
Embracing Serverless with Google
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
LINE's Private Cloud - Meet Cloud Native World
LINE's Private Cloud - Meet Cloud Native WorldLINE's Private Cloud - Meet Cloud Native World
LINE's Private Cloud - Meet Cloud Native World
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
 

Dernier

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 

Dernier (18)

Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 

Automation of Hadoop cluster operations in Arm Treasure Data

  • 1. Confidential © Arm 2017 Automation of Hadoop cluster operations in Arm Treasure Data Yan Wang Arm Treasure Data March 14, 2019
  • 2. Confidential © Arm 20172 Who am I? ● Yan Wang (王岩) ● May 2018 〜 Arm Treasure Data Hadoop team, Software Engineer ● Contributing hadoop ● Like Japanese Mahjong ● Blog https://tiana528.github.io/ LukaMe
  • 3. Confidential © Arm 20173 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 4. Confidential © Arm 20174 Arm Treasure Data Product Customers don’t need to operate hadoop clusters. We do.
  • 5. Confidential © Arm 20175 Hadoop Usage multi-clouds Cluster very multi-tenancy permanent storage HA M S S S cluster structure patched hadoop PTD-2.7.3-xxx operation tool CDH HDP Self-developed Operation point of view ● Recreate cluster on incident ● Self-developed operation tool is key point for operation Improved in the past year
  • 6. Confidential © Arm 20176 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 7. Confidential © Arm 20177 Reduce hadoop cluster creation time significantly -- by making use of AWS Auto Scaling Group ● Before Environment Setup Create cluster of 100 nodes launch nodes one by one ● Too slow ○ Client side ■ 1 hour ○ Cluster ready ■ 1 hour Environment Setup create AWS Auto Scaling Group ● Much faster ○ Client side ■ 3 minutes ○ Cluster ready ■ 15 minutes ● After Create cluster of 100 nodes 9 months ago
  • 8. Confidential © Arm 20178 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 9. Confidential © Arm 20179 General flow of how to recreate a hadoop cluster ● No downtime : A/B switch ClusterA job server ClusterA job server ClusterB ClusterA job server ClusterB ClusterB job server create new cluster switch traffic shutdown old cluster
  • 10. Confidential © Arm 201710 Simplify hadoop cluster recreation -- by creating our wrapper script of SRE tool ● Issues ○ Too many parameters ○ Stressful to shutdown 7 months ago ● Before ● After service create -S aws -s development -c ClusterB ... service delete -S aws -s development -c ClusterA ... cluster create ClusterB cluster delete ClusterA ● Improved ○ 1 parameter ○ Stressless to shutdown Use SRE team tool directly Use our wrapper script = SRE tool + verification + config
  • 11. Confidential © Arm 201711 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 12. Confidential © Arm 201712 Gained a lot of merits by changing instance type of slaves c3.8xlarge Very old model 6 months ago ● Before ● After m5d.12xlarge Latest model ● Improved ○ Larger per container memory ○ Larger & faster local disk ○ Lower cost ○ ... ● But …
  • 13. Confidential © Arm 201713 But… new issue occured ● New issue happened ○ Amazon don’t have so many m5d instances for on-demand allocation ○ Insufficient instances to do A/B switch in one availability zone when recreate a cluster. ● Ask Amazon support for help ○ They suggest us buying more reserved instances or use other instance types intermediately. ● Other approaches?
  • 14. Confidential © Arm 201714 Handle the situation of insufficient instances in one AZ -- by supporting cross AZ environment ● Cross AZ environment C job server ● Keypoint : no large network traffic between AZs which can be expensive. worker AZ_1 AZ_2 job server worker AZ_1 AZ_2 job server job server worker AZ_1 AZ_2 job server job server worker AZ_1 AZ_2 job server A CA B CA B C B REST API REST API create new cluster switch traffic shutdown old cluster
  • 15. Confidential © Arm 201715 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 16. Confidential © Arm 201716 Create patches to fast fail jobs consuming too much disk task timeline 0h 10h 20h 30h 40h job fail here ● Before ● After failed retried We created two patches For local : MAPREDUCE-7022 Fast fail rogue jobs based on task scratch dir size For HDFS : MAPREDUCE-7148 Fast fail jobs when exceeds dfs quota limitation (Disk quota configured) failed retried failed retried failed Retry is meaningless task timeline 0h 10h 20h 30h 40h job fail here failed 4 months ago Fail fast
  • 17. Confidential © Arm 201717 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 18. Confidential © Arm 201718 installed on all nodes check very detailed status Simplify incident Handling by creating health check scripts Check A Run command B Check C If … else… Open URL ... ● Before ● After runbook health check script ● When incident happen ○ Follow complex runbook during incident. Needs to collect info first. ● When incident happen ○ Run health check during incident, and know where is the issue. ● Future ○ integrate with Auto Scaling Group health check. 4 months ago datadog metrics trigger alerts
  • 19. Confidential © Arm 201719 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 20. Confidential © Arm 201720 Easy to know when to scale out -- by creating capacity metrics based on machine learning on going(POC) alert comes manually scale out if having performance issue ● Before ● After HDFS put/get latency Price plan & using slots Probe query HDFS usage CPU I/O wait linear regression capacity metrics ● Expect improvement ○ Know when to scale out immediately and easily. ● Future plan : use it for auto scale. ● Issue ○ A little late… ○ Hard for junior to understand
  • 21. Confidential © Arm 201721 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 22. Confidential © Arm 201722 Simplify shutdown slaves -- by using Auto Scaling Group shutdown hook shutdown 2 at a time wait block replication finish then shutdown 2 more… ● Before ● After ● Issue ○ boring operation ○ potential job retry AWS Auto Scaling Group shutdown hook ● Expect improvement ○ safe & fast on going hadoop node decommission script ● Future plan : find a “proper” node to kill ○ e.g. short running tasks
  • 23. Confidential © Arm 201723 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ○ Reduce hadoop cluster creation time significantly ○ Simplify hadoop cluster recreation ○ Modernize instance type of slaves ○ Create patches to fast fail jobs consuming too much disk ○ Simplify incident handling ○ Make it easy to know when to scale out ○ Simplify shutting down nodes ○ Replace Chef by debian packaging and Codedeploy ● Future roadmap ● Summary
  • 24. Confidential © Arm 201724 Replace Chef by debian packaging and Codedeploy We meet many issues using Chef ○ Only ruby ○ Unnecessary complicated ○ Stateful ○ 15 override rules of attributes ○ Slow ○ Fail silently ○ Dependent on other team’s release cycle ○ two pass model ○ 5 years adding little by little ○ ... ● Before ● After Debian packaging ○ Standard way in Linux AWS Codedeploy ○ Fast and easy to maintenance ○ Can be used in other cloud ● Expect improvement ○ Much easier to maintenance ○ cluster creation 15 minutes => 5 minutes on going
  • 25. Confidential © Arm 201725 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ● Future roadmap ○ API-based routing and workflow-based hadoop recreation ○ Usage history based account routing ● Summary
  • 26. Confidential © Arm 201726 API-based routing and workflow-based hadoop recreation ● Expect improvement ○ Totally automate hadoop cluster recreation through workflow ○ server side validation ● Issue ○ Very manual ○ depends on manual validation job server submit git pull request, review, merge, upload databag, run chef-client on all nodes change routing ● Before ● After A B job server A B job server REST API Call curl -X PUT .../hadoop_routes -d '{"defauls":"ClusterB"}' change routing A B job server A B API-based routing
  • 27. Confidential © Arm 201727 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ● Future roadmap ○ API-based routing and workflow-based hadoop recreation ○ Usage history based account routing ● Summary
  • 28. Confidential © Arm 201728 Usage history based account routing Busy cluster Idle cluster resource not fully utilized job server Fixed routing Big cluster easy to meet insufficient instance issue when creating big cluster Fixed size Busy cluster Idle cluster resource utilization increase job server Dynamic routing more accounts to idle cluster AZ_1 AZ_2 ● Before ● After Dynamic account routing easy to split cluster when instances are insufficient smaller cluster1 smaller cluster2
  • 29. Confidential © Arm 201729 Agenda ● Hadoop in Arm Treasure Data ● Hadoop Cluster Operation Automation ● Future roadmap ● Summary
  • 30. Confidential © Arm 201730 Summary ● Common idea ○ Use modernized cloud-based approach ○ API-based operation ○ Start from small and many small changes leading to large impact
  • 31. Confidential © Arm 201731 We are hiring https://www.treasuredata.com/company/careers/jobs/positions/?job=f6fd040b-c843-4991-bd49-bc674aab9a9e&team=Engineering
  • 32. Confidential © Arm 201732 Confidential © Arm 201732 Confidential © Arm 201732 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos!