Submit Search
Upload
Debugging Apache Hadoop YARN Cluster in Production
•
Download as PPTX, PDF
•
8 likes
•
8,887 views
DataWorks Summit/Hadoop Summit
Follow
Debugging Apache Hadoop YARN Cluster in Production
Read less
Read more
Technology
Report
Share
Report
Share
1 of 48
Download now
Recommended
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Data Con LA
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Recommended
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Data Con LA
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
Apache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Lonnie McRorey
More Related Content
More from DataWorks Summit/Hadoop Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
Apache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
More from DataWorks Summit/Hadoop Summit
(20)
Data Science Crash Course
Data Science Crash Course
Apache Spark Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
Apache Kafka Best Practices
Apache Kafka Best Practices
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
Recently uploaded
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Lonnie McRorey
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Hervé Boutemy
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Alfredo García Lavilla
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
NavinnSomaal
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
LoriGlavin3
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
BookNet Canada
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
LoriGlavin3
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
LoriGlavin3
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Databarracks
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
mohitsingh558521
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
Curtis Poe
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
LoriGlavin3
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
2toLead Limited
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
UiPathCommunity
Recently uploaded
(20)
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
How to write a Business Continuity Plan
How to write a Business Continuity Plan
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Debugging Apache Hadoop YARN Cluster in Production
1.
Debugging Apache Hadoop YARN
Cluster in Production Jian He, Junping Du and Xuan Gong Hortonworks YARN Team 06/30/2016
2.
2 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Who are We Junping Du – Apache Hadoop Committer and PMC Member – Dev Lead in Hortonworks YARN team Xuan Gong – Apache Hadoop Committer and PMC Member – Software Engineer Jian He – Apache Hadoop Committer and PMC Member – Staff Software Engineer
3.
3 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Today’s Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo Summary and Future
4.
4 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell
5.
5 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN Architecture ResourceManager NodeManager ApplicationMaster Other daemons: – Application History/Timeline Server – Job History Server (for MR only) – Proxy Server – Etc.
6.
6 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved RM and NM in a nutshell
7.
7 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools
8.
8 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved “Troubles” to start troubleshooting effort on a YARN cluster Applications Failed Applications Hang/Slow YARN configuration doesn’t work YARN APIs (CLI, WebService, etc.) doesn’t work YARN daemons crashed (OOM issue, etc.) YARN daemons’ log has error/warnings YARN cluster monitoring tools (like Ambari) alert Problem Type Distribution Configuration Executing Jobs Cluster Administration Installation Application Development Performance
9.
9 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Process: Phenomenon -> Root Cause -> Solution Solution: – Infrastructure/Hardware issue • Replace disks • Fix network – Mis-configuration • Fix configuration • Enhance documentation – Setup issue • Fix setup • Restart services – Application issue • Update application • Workaround – A YARN Bug • Report/fix it in Apache community! Phenomenon: – Application Failed Root cause: – Container Launch failures • Classpath issue • Resource localization failures – Too many attempt failures • Network connection issue • NM disk issues • AM failed caused by node restarted – Application logic issue • Container failed with OOM, etc. – Security issue • Token related issues
10.
10 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Case Study "java.lang.RuntimeException: java.io.FileNotFoundException: /etc/hadoop/2.3.4.0-3485/0/core-site.xml (Too many open files in system)” That actually due to too many TCP connections issue
11.
11 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Iceberg of troubleshooting – Dig Deeply Most connections are from local NM to DNs – LogAggregationService – ResourceLocalizationService We found the root cause is threads leak on NM LogAggregationService: – YARN-4697 NM aggregation thread pool is not bound by limits – YARN-4325 Purge app state from NM state-store should cover more LOG_HANDLING cases – YARN-4984 LogAggregationService shouldn't swallow exception in handling createAppDir() which cause thread leak.
12.
12 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Lesson Learned for Trouble-shooting on a production cluster What’s mean by a “Production” Cluster? – Cannot afford stop/restart cluster for trouble shooting – Most operations on cluster are “Read Only” – In fenced network, remote debugging with local cluster admin. Lesson learned: 1. Get related info (screenshots, log files, jstack, memory heap dump, etc.) as much as you can 2. Work closely with the end user to gain an understanding of the issue and symptoms 3. Setup knowledge base used to compare to previous cases 4. If possible, reproduce the issue on test/backup cluster – easy to trouble shooting and verify 5. Version your configuration!
13.
13 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Handy Tools for YARN Troubleshooting Log UI Historic Info – JobHistoryServer (for MR only) – Application Timeline Service (v1, v1.5, v2.0) Monitoring tools, like: AMBARI Runtime info – Memory Dump – Jstack – System Metrics
14.
14 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Log Log CLI – yarn logs -applicationId <application ID> [OPTIONS] – Discuss more later Enable Debug log – When daemons are NOT running • Put log level settings like: export YARN_ROOT_LOGGER = “DEBUG, console” to yarn-env.sh • Start the daemons – When Daemons are running • Dynamic change log level via daemon’s logLevel UI/CLI • CLI: – yarn daemonlog [-getlevel <host:httpPort> <classname>] – yarn daemonlog [-setlevel <host:httpPort> <classname> <level>] – for YARN Client side • Similar setting as daemons not running
15.
15 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Runtime Log Level settings in YARN UI RM: http://<rm_addr>:8088/logLevel NM: http://<nm_addr>:8042/logLevel ATS: http://<ats_addr>:8188/logLevel
16.
16 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved UI (Ambari and YARN)
17.
17 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Job History Server
18.
18 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Memory dump analysis
19.
19 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Hadoop metrics RPC metrics – RpcQueueTimeAvgTime – ReceivedBytes … JVM metrics – MemHeapUsedM – ThreadsBlocked … Documentation: – http://s.apache.org/UwSu
20.
20 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN top top like command line view for application stats, queue stats
21.
21 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study
22.
22 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Why is my job hung ? Job can be stuck at 3 states. NEW_SAVING: Waiting for app to be persisted in state-store - Connection error with state-store (zookeeper etc.) Accepted: Waiting to allocate ApplicationMaster container. - Low max- AM-resource-percentage config Running: waiting for containers to be allocated? - Are there resources available for the app - Otherwise, application land issue, stuck on socket read/write. App states
23.
23 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case Study Friday evening, Customer experiences cluster outages. Large amount of jobs getting stuck. There are resources available in the cluster. Restarting Resource Manger can resolve issue temporarily But after several hours, cluster again goes back to the bad state
24.
24 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case Study Are there any resources available in the queue ?
25.
25 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case Study Are there any resources available for the app ? – Sometimes, even if cluster has resources, user may still not be able to run their applications because they hit the user-limit. – User-limit controls how much resources a single user can use – Check user-limit info on the scheduler UI – Check application head room on application UI
26.
26 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
27.
27 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case study Not a problem of resource contention. Use yarn logs command to get hung application logs. – Found app waiting for containers to be allocated. Problem: cluster has free resources, but app is not able to use it.
28.
28 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case study May be a scheduling issue. Analyze the scheduler log. (Most difficult) – User not much familiar with the scheduler log. – RM log is too huge, hard to do text searching in the logs. – Getting worse if enabling debug log. Dump the scheduling log into a separate file
29.
29 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case study Scheduler log shows several apps are skipped for scheduling. Pick one of the applications, go to the application attempt UI, Check the resource requests table (see below), notice billions of containers are asked by the application. 8912124
30.
30 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case study Tried to kill those misbehaving jobs, cluster went fine. Find the user who submit those jobs and stop him/her from doing that. Big achievement so far, unblock the cluster. Offline debugging and find product bug. Surprisingly, we use int for measuring memory size in the scheduler. That misbehaving app asked too much resources, which caused integer overflow in the scheduler. YARN-4844, replace int with long for resource memory API.
31.
31 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved What we learn Rebooting service can solve many problems. – Thanks to working-preserving RM and NM recovery (YARN-556 & YARN-1336). Denial of Service - Poorly written, or accidental configuration for workloads can cause component outages. – Carefully code against DOS scenarios. – Example: User RPC method (getQueueInfo) holds scheduler lock UI enhancement – Small change, big impact. – Example: Resource requests table on application very useful in this case. Alerts – Ask too many containers, alerting to the users.
32.
32 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case study 2 10 % of the jobs are failing every day. After they re-run, jobs sometime finish successfully. No resource contention when jobs are running Logs contain a lot of mysterious connection errors (unable to read call parameters)
33.
33 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case study 2 Initial attempt, – Dig deeper into the code to see under what conditions, this exception may throw. – Not able to figure out.
34.
34 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Case study 2 Requested more failed application logs Identify pattern for these applications Finally, we realize all apps failed on a certain set of nodes. Ask customer to exclude those nodes. Jobs running fine after that. Customer checked “/var/log/messages” and found disk issues for those nodes. When dealing with mysterious connection failures, hung problems, try to find correlation between failed apps and nodes.
35.
35 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo
36.
36 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Enhanced YARN Log CLI (YARN-4904) Useful Log CLIs – Get container logs for running apps • yarn logs –applicationId ${appId} – Get a specific container log • yarn logs –applicationId ${appId} –containerId ${containerId} – Get AM Container logs. • yarn logs -applicationId ${appId} –am 1 – Get a specific log file • yarn logs -applicationId ${appId} –logFiles syslog • Support java regular expression – Get the log file's first 'n' bytes or the last 'n' bytes • yarn logs –applicationId ${appId} –size 100 – Dump the application/container logs • yarn logs –applicationId ${appId} –out ${local_dir}– List application/container log information • yarn logs –applicationId ${appId} -show_application_log_info • yarn logs –applicationId ${appId} –containerId ${containerId} -show_container_log_info
37.
37 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved
38.
38 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Agenda YARN in a Nutshell Trouble-shooting Process and Tools Case Study Enhanced YARN Log Tool Demo Summary and Future
39.
39 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Summary and Future Summary – Methodology and Tools for trouble-shooting on YARN – Case Study – Enhanced YARN Log CLI • YARN-4904 Future Enhancement – ATS (Application Timeline Service) v2 • YARN-2928 • #hs16sj “How YARN Timeline Service v.2 Unlocks 360-Degree Platform Insights at Scale” – New ResourceManager UI • YARN-3368
40.
40 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved New RM UI (YARN-3368)
41.
41 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Thank You
42.
42 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved Backup Slides
43.
43 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1
44.
44 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –containerId container_1467090861129_0001_01_000002
45.
45 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr
46.
46 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –am 1 –logFiles stderr –size -1000
47.
47 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –out ${localDir}
48.
48 © Hortonworks
Inc. 2011 – 2016. All Rights Reserved YARN Log Command example Screenshot yarn logs –applicationId application_1467090861129_0001 –-show_application_log_info yarn logs –applicationId application_1467090861129_0001 –-show_container_log_info
Download now