SlideShare a Scribd company logo
1 of 48
Debugging Apache
Hadoop YARN Cluster in
Production
Jian He, Junping Du and Xuan Gong
Hortonworks YARN Team
06/30/2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who are We
 Junping Du
– Apache Hadoop Committer and PMC
Member
– Dev Lead in Hortonworks YARN team
 Xuan Gong
– Apache Hadoop Committer and PMC
Member
– Software Engineer
 Jian He
– Apache Hadoop Committer and PMC
Member
– Staff Software Engineer
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today’s Agenda
 YARN in a Nutshell
 Trouble-shooting Process and Tools
 Case Study
 Enhanced YARN Log Tool Demo
 Summary and Future
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Architecture
 ResourceManager
 NodeManager
 ApplicationMaster
 Other daemons:
– Application
History/Timeline Server
– Job History Server (for
MR only)
– Proxy Server
– Etc.
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RM and NM in a nutshell
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Troubles” to start troubleshooting effort on a YARN cluster
 Applications Failed
 Applications Hang/Slow
 YARN configuration doesn’t work
 YARN APIs (CLI, WebService, etc.) doesn’t work
 YARN daemons crashed (OOM issue, etc.)
 YARN daemons’ log has error/warnings
 YARN cluster monitoring tools (like Ambari) alert
Problem Type Distribution
Configuration
Executing Jobs
Cluster Administration
Installation
Application Development
Performance
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Process: Phenomenon -> Root Cause -> Solution
 Solution:
– Infrastructure/Hardware issue
• Replace disks
• Fix network
– Mis-configuration
• Fix configuration
• Enhance documentation
– Setup issue
• Fix setup
• Restart services
– Application issue
• Update application
• Workaround
– A YARN Bug
• Report/fix it in Apache
community!
 Phenomenon:
– Application Failed
 Root cause:
– Container Launch failures
• Classpath issue
• Resource localization
failures
– Too many attempt failures
• Network connection
issue
• NM disk issues
• AM failed caused by
node restarted
– Application logic issue
• Container failed with
OOM, etc.
– Security issue
• Token related issues
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Case Study
 "java.lang.RuntimeException: java.io.FileNotFoundException:
/etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in
system)”
 That actually due to too many TCP connections issue
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Iceberg of troubleshooting – Dig Deeply
 Most connections are from local NM to DNs
– LogAggregationService
– ResourceLocalizationService
 We found the root cause is threads leak on NM LogAggregationService:
– YARN-4697
NM aggregation thread pool is not bound by limits
– YARN-4325
Purge app state from NM state-store should cover more LOG_HANDLING cases
– YARN-4984
LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread
leak.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lesson Learned for Trouble-shooting on a production cluster
 What’s mean by a “Production” Cluster?
– Cannot afford stop/restart cluster for trouble shooting
– Most operations on cluster are “Read Only”
– In fenced network, remote debugging with local cluster admin.
 Lesson learned:
1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can
2. Work closely with the end user to gain an understanding of the issue and symptoms
3. Setup knowledge base used to compare to previous cases
4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify
5. Version your configuration!
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Handy Tools for YARN Troubleshooting
 Log
 UI
 Historic Info
– JobHistoryServer (for MR only)
– Application Timeline Service (v1, v1.5, v2.0)
 Monitoring tools, like: AMBARI
 Runtime info
– Memory Dump
– Jstack
– System Metrics
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Log
 Log CLI
– yarn logs -applicationId <application ID> [OPTIONS]
– Discuss more later
 Enable Debug log
– When daemons are NOT running
• Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh
• Start the daemons
– When Daemons are running
• Dynamic change log level via daemon’s logLevel UI/CLI
• CLI:
– yarn daemonlog [-getlevel <host:httpPort> <classname>]
– yarn daemonlog [-setlevel <host:httpPort> <classname> <level>]
– for YARN Client side
• Similar setting as daemons not running
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Runtime Log Level settings in YARN UI
 RM: http://<rm_addr>:8088/logLevel
 NM: http://<nm_addr>:8042/logLevel
 ATS: http://<ats_addr>:8188/logLevel
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
UI (Ambari and YARN)
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Job History Server
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Memory dump analysis
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop metrics
 RPC metrics
– RpcQueueTimeAvgTime
– ReceivedBytes
…
 JVM metrics
– MemHeapUsedM
– ThreadsBlocked
…
 Documentation:
– http://s.apache.org/UwSu
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN top
 top like command line view for application stats, queue stats
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is my job hung ?
 Job can be stuck at 3 states.
NEW_SAVING: Waiting for app to be persisted in state-store
- Connection error with state-store (zookeeper etc.)
Accepted: Waiting to allocate ApplicationMaster container.
- Low max- AM-resource-percentage config
Running: waiting for containers to be allocated?
- Are there resources available for the app
- Otherwise, application land issue, stuck on
socket read/write.
App
states
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Friday evening, Customer experiences cluster outages.
 Large amount of jobs getting stuck.
 There are resources available in the cluster.
 Restarting Resource Manger can resolve issue temporarily
 But after several hours, cluster again goes back to the bad state
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Are there any resources available in the queue ?
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case Study
 Are there any resources available for the app ?
– Sometimes, even if cluster has resources, user may still not be able to run their applications
because they hit the user-limit.
– User-limit controls how much resources a single user can use
– Check user-limit info on the scheduler UI
– Check application head room on application UI
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Not a problem of resource contention.
 Use yarn logs command to get hung application logs.
– Found app waiting for containers to be allocated.
 Problem: cluster has free resources, but app is not able to use it.
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 May be a scheduling issue.
 Analyze the scheduler log. (Most difficult)
– User not much familiar with the scheduler log.
– RM log is too huge, hard to do text searching in the logs.
– Getting worse if enabling debug log.
 Dump the scheduling log into a separate file
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Scheduler log shows several apps are skipped for scheduling.
 Pick one of the applications, go to the application attempt UI,
 Check the resource requests table (see below), notice billions of containers are asked by
the application.
8912124
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study
 Tried to kill those misbehaving jobs, cluster went fine.
 Find the user who submit those jobs and stop him/her from doing that.
 Big achievement so far, unblock the cluster.
 Offline debugging and find product bug.
 Surprisingly, we use int for measuring memory size in the scheduler.
 That misbehaving app asked too much resources, which caused integer overflow in the
scheduler.
 YARN-4844, replace int with long for resource memory API.
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What we learn
 Rebooting service can solve many problems. 
– Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).
 Denial of Service - Poorly written, or accidental configuration for workloads can cause
component outages.
– Carefully code against DOS scenarios.
– Example: User RPC method (getQueueInfo) holds scheduler lock
 UI enhancement
– Small change, big impact.
– Example: Resource requests table on application very useful in this case.
 Alerts
– Ask too many containers, alerting to the users.
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 10 % of the jobs are failing every day.
 After they re-run, jobs sometime finish successfully.
 No resource contention when jobs are running
 Logs contain a lot of mysterious connection errors (unable to read call parameters)
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 Initial attempt,
– Dig deeper into the code to see under what conditions, this exception may throw.
– Not able to figure out.
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Case study 2
 Requested more failed application logs
 Identify pattern for these applications
 Finally, we realize all apps failed on a certain set of nodes.
 Ask customer to exclude those nodes. Jobs running fine after that.
 Customer checked “/var/log/messages” and found disk issues for those nodes.
When dealing with mysterious connection
failures, hung problems, try to find
correlation between failed apps and nodes.
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhanced YARN Log CLI (YARN-4904)
 Useful Log CLIs
– Get container logs for running apps
• yarn logs –applicationId ${appId}
– Get a specific container log
• yarn logs –applicationId ${appId} –containerId ${containerId}
– Get AM Container logs.
• yarn logs -applicationId ${appId} –am 1
– Get a specific log file
• yarn logs -applicationId ${appId} –logFiles syslog
• Support java regular expression
– Get the log file's first 'n' bytes or the last 'n' bytes
• yarn logs –applicationId ${appId} –size 100
– Dump the application/container logs
• yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information
• yarn logs –applicationId ${appId} -show_application_log_info
• yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
YARN in a Nutshell
Trouble-shooting Process and Tools
Case Study
Enhanced YARN Log Tool Demo
Summary and Future
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary and Future
 Summary
– Methodology and Tools for trouble-shooting on YARN
– Case Study
– Enhanced YARN Log CLI
• YARN-4904
 Future Enhancement
– ATS (Application Timeline Service) v2
• YARN-2928
• #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at
Scale”
– New ResourceManager UI
• YARN-3368
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New RM UI (YARN-3368)
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup Slides
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Log Command example Screenshot
yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info
yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info

More Related Content

More from DataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Debugging Apache Hadoop YARN Cluster in Production

  • 1. Debugging Apache Hadoop YARN Cluster in Production Jian He, Junping Du and Xuan Gong Hortonworks YARN Team 06/30/2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who are We  Junping Du – Apache Hadoop Committer and PMC Member – Dev Lead in Hortonworks YARN team  Xuan Gong – Apache Hadoop Committer and PMC Member – Software Engineer  Jian He – Apache Hadoop Committer and PMC Member – Staff Software Engineer
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Today’s Agenda  YARN in a Nutshell  Trouble-shooting Process and Tools  Case Study  Enhanced YARN Log Tool Demo  Summary and Future
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Architecture  ResourceManager  NodeManager  ApplicationMaster  Other daemons: – Application History/Timeline Server – Job History Server (for MR only) – Proxy Server – Etc.
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RM and NM in a nutshell
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Troubles” to start troubleshooting effort on a YARN cluster  Applications Failed  Applications Hang/Slow  YARN configuration doesn’t work  YARN APIs (CLI, WebService, etc.) doesn’t work  YARN daemons crashed (OOM issue, etc.)  YARN daemons’ log has error/warnings  YARN cluster monitoring tools (like Ambari) alert Problem Type Distribution Configuration Executing Jobs Cluster Administration Installation Application Development Performance
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Process: Phenomenon -> Root Cause -> Solution  Solution: – Infrastructure/Hardware issue • Replace disks • Fix network – Mis-configuration • Fix configuration • Enhance documentation – Setup issue • Fix setup • Restart services – Application issue • Update application • Workaround – A YARN Bug • Report/fix it in Apache community!  Phenomenon: – Application Failed  Root cause: – Container Launch failures • Classpath issue • Resource localization failures – Too many attempt failures • Network connection issue • NM disk issues • AM failed caused by node restarted – Application logic issue • Container failed with OOM, etc. – Security issue • Token related issues
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Case Study  "java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in system)”  That actually due to too many TCP connections issue
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Dig Deeply  Most connections are from local NM to DNs – LogAggregationService – ResourceLocalizationService  We found the root cause is threads leak on NM LogAggregationService: – YARN-4697 NM aggregation thread pool is not bound by limits – YARN-4325 Purge app state from NM state-store should cover more LOG_HANDLING cases – YARN-4984 LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lesson Learned for Trouble-shooting on a production cluster  What’s mean by a “Production” Cluster? – Cannot afford stop/restart cluster for trouble shooting – Most operations on cluster are “Read Only” – In fenced network, remote debugging with local cluster admin.  Lesson learned: 1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can 2. Work closely with the end user to gain an understanding of the issue and symptoms 3. Setup knowledge base used to compare to previous cases 4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify 5. Version your configuration!
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Handy Tools for YARN Troubleshooting  Log  UI  Historic Info – JobHistoryServer (for MR only) – Application Timeline Service (v1, v1.5, v2.0)  Monitoring tools, like: AMBARI  Runtime info – Memory Dump – Jstack – System Metrics
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Log  Log CLI – yarn logs -applicationId <application ID> [OPTIONS] – Discuss more later  Enable Debug log – When daemons are NOT running • Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh • Start the daemons – When Daemons are running • Dynamic change log level via daemon’s logLevel UI/CLI • CLI: – yarn daemonlog [-getlevel <host:httpPort> <classname>] – yarn daemonlog [-setlevel <host:httpPort> <classname> <level>] – for YARN Client side • Similar setting as daemons not running
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Runtime Log Level settings in YARN UI  RM: http://<rm_addr>:8088/logLevel  NM: http://<nm_addr>:8042/logLevel  ATS: http://<ats_addr>:8188/logLevel
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved UI (Ambari and YARN)
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Job History Server
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Memory dump analysis
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop metrics  RPC metrics – RpcQueueTimeAvgTime – ReceivedBytes …  JVM metrics – MemHeapUsedM – ThreadsBlocked …  Documentation: – http://s.apache.org/UwSu
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN top  top like command line view for application stats, queue stats
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why is my job hung ?  Job can be stuck at 3 states. NEW_SAVING: Waiting for app to be persisted in state-store - Connection error with state-store (zookeeper etc.) Accepted: Waiting to allocate ApplicationMaster container. - Low max- AM-resource-percentage config Running: waiting for containers to be allocated? - Are there resources available for the app - Otherwise, application land issue, stuck on socket read/write. App states
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Friday evening, Customer experiences cluster outages.  Large amount of jobs getting stuck.  There are resources available in the cluster.  Restarting Resource Manger can resolve issue temporarily  But after several hours, cluster again goes back to the bad state
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Are there any resources available in the queue ?
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study  Are there any resources available for the app ? – Sometimes, even if cluster has resources, user may still not be able to run their applications because they hit the user-limit. – User-limit controls how much resources a single user can use – Check user-limit info on the scheduler UI – Check application head room on application UI
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Not a problem of resource contention.  Use yarn logs command to get hung application logs. – Found app waiting for containers to be allocated.  Problem: cluster has free resources, but app is not able to use it.
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  May be a scheduling issue.  Analyze the scheduler log. (Most difficult) – User not much familiar with the scheduler log. – RM log is too huge, hard to do text searching in the logs. – Getting worse if enabling debug log.  Dump the scheduling log into a separate file
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Scheduler log shows several apps are skipped for scheduling.  Pick one of the applications, go to the application attempt UI,  Check the resource requests table (see below), notice billions of containers are asked by the application. 8912124
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study  Tried to kill those misbehaving jobs, cluster went fine.  Find the user who submit those jobs and stop him/her from doing that.  Big achievement so far, unblock the cluster.  Offline debugging and find product bug.  Surprisingly, we use int for measuring memory size in the scheduler.  That misbehaving app asked too much resources, which caused integer overflow in the scheduler.  YARN-4844, replace int with long for resource memory API.
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What we learn  Rebooting service can solve many problems.  – Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336).  Denial of Service - Poorly written, or accidental configuration for workloads can cause component outages. – Carefully code against DOS scenarios. – Example: User RPC method (getQueueInfo) holds scheduler lock  UI enhancement – Small change, big impact. – Example: Resource requests table on application very useful in this case.  Alerts – Ask too many containers, alerting to the users.
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  10 % of the jobs are failing every day.  After they re-run, jobs sometime finish successfully.  No resource contention when jobs are running  Logs contain a lot of mysterious connection errors (unable to read call parameters)
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  Initial attempt, – Dig deeper into the code to see under what conditions, this exception may throw. – Not able to figure out.
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case study 2  Requested more failed application logs  Identify pattern for these applications  Finally, we realize all apps failed on a certain set of nodes.  Ask customer to exclude those nodes. Jobs running fine after that.  Customer checked “/var/log/messages” and found disk issues for those nodes. When dealing with mysterious connection failures, hung problems, try to find correlation between failed apps and nodes.
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhanced YARN Log CLI (YARN-4904)  Useful Log CLIs – Get container logs for running apps • yarn logs –applicationId ${appId} – Get a specific container log • yarn logs –applicationId ${appId} –containerId ${containerId} – Get AM Container logs. • yarn logs -applicationId ${appId} –am 1 – Get a specific log file • yarn logs -applicationId ${appId} –logFiles syslog • Support java regular expression – Get the log file's first 'n' bytes or the last 'n' bytes • yarn logs –applicationId ${appId} –size 100 – Dump the application/container logs • yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information • yarn logs –applicationId ${appId} -show_application_log_info • yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo Summary and Future
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary and Future  Summary – Methodology and Tools for trouble-shooting on YARN – Case Study – Enhanced YARN Log CLI • YARN-4904  Future Enhancement – ATS (Application Timeline Service) v2 • YARN-2928 • #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at Scale” – New ResourceManager UI • YARN-3368
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New RM UI (YARN-3368)
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup Slides
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info