You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
1. Hadoop Operations –
Best Practices from the Field
October 17, 2014
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth
Suresh Srinivas
email: suresh@hortonworks.com
twitter: @suresh_m_s
There is often a lot of overlap between the two. We moderate each forum to learn what needs to be improved.
Preliminary analysis suggested that we focus deeper analysis on core Hadoop, defined as HDFS, YARN and Map Reduce. This chart shows the count of support cases per month. One interesting observation that came out of this is a spike in support case activity centered around May 2014.
Instead of a raw count, this chart shows the proportion of support cases attributed to core Hadoop (HDFS, YARN or Map Reduce). The gray line at the top covers all other components, a total of 26 different components. Here we see a trend stabilizing around 30% of support cases driven from core. This was another validation that focusing on core for this study would likely help the most users.
This chart shows root cause analysis of the core issues during the time period. We use ~40 different root cause categories, but I’ve limited this view to the most prominent root causes. Explain each category.
Investment in operations at the core helps the most users. We need to keep revisiting the code to make constant improvements.
Fewer nodes is less resilient than many nodes. Failure of a DataNode that’s heavier on storage causes more re-replication activity. Map Reduce jobs may need to rerun more tasks. Commodity != poor quality.
Compressed ordinary object pointers are a technique used in the JVM to represent managed pointers as 32-bit values, which saves on the space taken by 64-bit native pointers. Xmx different from Xms can cause big expensive malloc. Surprising results when you run out of memory late in the process lifetime.
NFS soft mount option important for returning control to caller after timeouts.
If you’ve used POSIX ACLs on a Linux file system, then you already know how it works in HDFS too.
By convention, snapshots can be referenced as a file system path under sub-directory “.snapshot”.
Also easily consumed by other clients if you want to roll your own UI. Initial integration was done with Tez. Tez is a framework for modeling distributed computations as a directed acyclic graph of tasks. Tez code is instrumented to publish information about DAG execution to the Timeline Server.
This demo shows integration with Map Reduce, which is still a work in progress. The patch is available in Apache. This view is simplified.