Hdfs 2016-hadoop-summit-san-jose-v4

HDFS: Optimization, Stabilization and
Supportability
June 28, 2016
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth
Arpit Agarwal
email: aagarwal@hortonworks.com
twitter: @aagarw

© Hortonworks Inc. 2011
About Us
Chris Nauroth
• Member of Technical Staff, Hortonworks
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements
• Hadoop user since 2010
– Prior employment experience deploying, maintaining and using Hadoop clusters
Arpit Agarwal
• Member of Technical Staff, Hortonworks
– Apache Hadoop Committer, PMC Member
– Major contributor to HDFS Heterogeneous Storage Support, Windows Compatibility
Page 2
Architecting the Future of Big Data

Motivation
• HDFS engineers are on the front line for operational support of Hadoop.
– HDFS is the foundational storage layer for typical Hadoop deployments.
– Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem.
– Conversely, application problems can become visible at the layer of HDFS operations.
• Analysis of Hadoop Support Cases
– Support case trends reveal common patterns for HDFS operational challenges.
– Those challenges inform what needs to improve in the software.
• Software Improvements
– Optimization: Identify and mitigate bottlenecks.
– Stabilization: Prevent unusual circumstances from harming cluster uptime.
– Supportability: When something goes wrong, provide visibility and tools to fix it.
Thank you to the entire community of Apache contributors.
Page 3

Performance
• Garbage Collection
– NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.).
– Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are
larger, therefore the memory footprint has increased for tracking block state)
– Much has been written about garbage collection tuning for large heap JVM processes.
– In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage
collection pressure.
Page 4

Performance
• Block Reporting
– The process by which DataNodes report information about their stored blocks to the NameNode.
– Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently.
– Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently.
– All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency
directly.
– However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end
user operations sufficiently.
Page 5

HDFS-7435: PB encoding of block reports is very
inefficient
• Block report RPC message encoding can cause memory allocation inefficiency and garbage
collection churn.
– HDFS RPC messages are encoded using Protocol Buffers.
– Block reports encode each block as a sequence of 3 64-bit long fields.
– Behind the scenes, this becomes an ArrayList<Integer> with a default capacity of 10.
– DataNodes almost always send a larger block report than this, so array reallocation churn is almost guaranteed.
– Boxing and unboxing causes additional allocation requirements.
• Solution: a more GC-friendly encoding of block reports.
– Take over serialization directly.
– Manually encode number of longs, followed by list of primitive longs.
– Eliminates ArrayList reallocation costs.
– Eliminates boxing and unboxing costs by deserializing straight to primitive long.
Page 6

HDFS-9710: Change DN to send block receipt IBRs in
batches
• Incremental block reports trigger multiple RPC calls.
– When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately.
– Even multiple block receipts translate to multiple individual incremental block report RPCs.
– With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the
NameNode to process.
• Solution: batch multiple block receipt events into a single RPC message.
– Reduces RPC overhead of sending multiple messages.
– Scales better with respect to number of nodes and number of blocks in a cluster.
Page 7

Liveness
• "...make progress despite the fact that its concurrently executing components ("processes") may
have to "take turns" in critical sections..." -Wikipedia
• DataNode Heartbeats
– Responsible for reporting health of a DataNode to the NameNode.
– Operational problems of managing load and performance can block timely heartbeat processing.
– Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and
asynchronous dispatch of commands (e.g. delete block).
• Blocked heartbeat processing can cause cascading failure and downtime.
– Blocked heartbeat processing: looks the same as DataNode not running at all.
– DataNodes not running: flagged by the NameNode as stale, then dead.
– Multiple stale DataNodes: reduced cluster capacity.
– Multiple dead DataNodes: storm of wasteful re-replication activity.
Page 8

HDFS-9239: DataNode Lifeline Protocol: an alternative
protocol for reporting DataNode health
• The lifeline keeps the DataNode alive, despite conditions of unusually high load.
– Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by
DataNodes.
– Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for
asynchronous command dispatch, and therefore do not need to contend on a shared lock.
– Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive.
– Prevents erroneous and costly re-replication activity.
Page 9

HDFS-9311: Support optional offload of NameNode HA
service health checks to a separate RPC server.
• RPC offload of HA health check and failover messages.
– Similar to problem of timely heartbeat message delivery.
– NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the
NameNode.
– Messages are related to handling periodic health checks and initiating shutdown and failover if necessary.
– A NameNode overwhelmed with unusually high load cannot process these messages.
– Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged
outage period.
– The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the
case of unusually high load.
Page 10

Optimizing Applications
• HDFS Utilization Patterns
– Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS.
– FileSystem API unfortunately can make it too easy to implement inefficient call patterns.
Page 11

HIVE-10223: Consolidate several redundant FileSystem
API calls.
• Hadoop FileSystem API can cause applications to make redundant RPC calls.
• Before:
if (fs.isFile(file)) { // RPC #1
...
} else if (fs.isDirectory(file)) { // RPC #2
...
}
• After:
FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPC
if (fileStatus.isFile()) { // Local, no RPC
...
} else if (fileStatus.isDirectory()) { // Local, no RPC
...
}
• Good for Hive, because it reduces latency associated with NameNode RPCs.
• Good for the whole ecosystem, because it reduces load on the NameNode, a shared service.
Page 12

PIG-4442: Eliminate redundant RPC call to get file
information in HPath.
• A similar story of redundant RPC within Pig code.
• Before:
long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1
short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2
• After:
FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPC
long blockSize = fileStatus.getBlockSize(); // Local, no RPC
short replication = fileStatus.getReplication(); // Local, no RPC
• Revealed from inspection of HDFS audit log.
– HDFS audit log shows a record of each file system operation executed against the NameNode.
– This continues to be one of the most significant sources of HDFS troubleshooting information.
– In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a
Pig job submission.
Page 13

Managing NameNode Load
• NameNode no longer a single point of failure
–However NameNode performance can still be a bottleneck
• Assumption that applications will be well-behaved
• A single inefficient job can easily overwhelm the NameNode with too much RPC load.
Page 14

Hadoop RPC Architecture
• Hadoop RPC admits incoming calls into a shared queue.
• Worker threads consume incoming calls from that shared queue and process them
• In an overloaded situation, calls spend longer waiting in the queue for a worker thread
to become available.
• If the RPC queue overflows, requests are queued in the OS socket buffers.
–More buffering leads to higher RPC latencies and potentially client side timeouts.
–Timeouts often result in job failures and restarts
–Restarted jobs cause more work - positive feedback loop.
• Affects all callers, not just the caller that triggered the unusually high load.
Page 15

HADOOP-10597: RPC Server signals backoff to clients
when all request queues are full
• If an RPC server’s queue is full, respond to new requests with a backoff signal.
• Clients react by performing exponential backoff before retrying the call.
–Reduce job failures by avoiding client timeouts
• Improves QoS for clients when server is under heavy load.
• RPC calls that would have timed out will instead succeed, but with longer latency.
Page 16

HADOOP-10282: FairCallQueue
• Replace single RPC queue with multiple prioritized queues.
• Server maintains sliding window of RPC request counts, by user.
• New RPC calls placed into queues with priority based on the calling user’s history
• Calls are de-queued and processed with higher probability from higher-priority queues
• De-prioritizes heavy users under high load, prevents starvation of other jobs
• Complements RPC Congestion Control.
Page 17

HADOOP-12916: Allow RPC scheduler/CallQueue backoff
using response times
• Flexible back-off policies.
– Triggering backoff when the queue is full is often too late.
– Clients may be already experiencing timeouts before the RPC queue overflows.
• Instead, track call response time and trigger backoff when response time exceeds
bounds.
• Further reduces the probability of client timeouts and hence reduces job failures.
Page 18

HADOOP-13128: Manage Hadoop RPC resource usage
via resource coupon (proposed feature)
• Multi-tenancy is a key challenge in large enterprise deployments.
• Allows HDFS and the YARN ResourceManager to coordinate allocation of RPC
resources to multiple applications running concurrently in a multi-tenant deployment.
• FairCallQueue can lead to priority inversion
– NameNode is not aware of relative priorities of YARN jobs
– Requests from a high priority application can be demoted to a lower-priority RPC call queue.
– Resource coupon presented by incoming RPC requests.
• Allow the Resource Manager to request a slice of NameNode capacity via a coupon.
Page 19

Logging
• Logging requires a careful balance.
• Too much logging causes
– Information overload
– Increased system load - Rendering strings is expensive, creates garbage
• Too little logging hides valuable operational information.
Page 20

Too much logging
• Benign errors can confuse administrators
– INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 32 on 8021, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from
192.168.22.1:60216 Call#9371 Retry#0: org.apache.hadoop.ipc.StandbyException:
Operation category READ is not supported in state standby
– ERROR datanode.DataNode (DataXceiver.java:run(278)) –
myhost.hortonworks.com:50010:DataXceiver error processing unknown operation
src: /127.0.0.1:60681 dst: /127.0.0.1:50010 java.io.EOFException
Page 21

Logging Pitfalls
• Forgotten guard logic.
– if (LOG.isDebugEnabled()) {
LOG.debug(“Processing block: “ + block); // expensive toString()
implementation!
}
• Switching the logging API to SLF4J can eliminate the need for log-level guards in most
cases.
– LOG.debug(“Processing block: {}”, block); // calls toString() only if
debug enabled
• Logging in a tight loop.
• Logging while holding a shared resource, such as a mutually exclusive lock.
Page 22

HDFS-9434: Recommission a datanode with 500k blocks
may pause NN for 30 seconds
• Logging is too verbose
– Summary of patch: don’t log too much!
– Move detailed logging to debug or trace level.
• Before:
LOG.info("BLOCK* processOverReplicatedBlock: " +
"Postponing processing of over-replicated " +
block + " since storage + " + storage
+ "datanode " + cur + " does not yet have up-to-date " +
"block information.");
• After:
LOG.trace("BLOCK* processOverReplicatedBlock: Postponing {}"
+ " since storage {} does not yet have up-to-date information.",
block, storage);
Page 23

Troubleshooting
• Metrics are vital for diagnosis of most operational problems.
– Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike)
– Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC
calls)
Page 24

HDFS-6982: nntop
• Find activity trends of HDFS operations.
– HDFS audit log contains a record of each file system operation to the NameNode.
2015-11-16 21:00:00,109 INFO FSNamesystem.audit: allowed=true ugi=bob (auth:SIMPLE)
ip=/192.168.1.5 cmd=listStatus src=/app-logs/pcd_batch/application_1431545431771/
dst=null perm=null
– However identifying sources of load from audit log requires ad-hoc scripting.
• nntop: HDFS operation counts aggregated per operation and per user within time
windows.
– TopUserOpCounts - default time windows of 1 minute, 5 minutes, 25 minutes
– curl
'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’
Page 25

nnTop sample Output
"windowLenMs": 60000,
"ops": [
{
"opType": "create",
"topUsers": [
{
"user": "alice@EXAMPLE.COM",
"count": 4632
},
{
"user": "bob@EXAMPLE.COM",
"count": 1387
}
],
"totalCount": 6019
}
...
Page 26

Troubleshooting Kerberos
• Kerberos is hard.
– Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration.
– Management tools like Apache Ambari automate initial provisioning of principals, keytabs and
configuration.
– When it doesn’t work, finding root cause is challenging.
Page 27

HADOOP-12426: kdiag
• Kerberos misconfiguration diagnosis.
– DNS
– Hadoop configuration files
– KDC configuration
• kdiag: a command-line tool for diagnosis of Kerberos problems
– Prints various environment variables, Java system properties and Hadoop configuration options
related to security.
– Attempt a login.
– If keytab used, print principal information from keytab.
– Print krb5.conf.
– Validate kinit executable (used for ticket renewals).
Page 28

kdiag Sample Output - misconfigured DNS
[hdfs@c6401 ~]$ hadoop org.apache.hadoop.security.KDiag
== Kerberos Diagnostics scan at Mon Jun 27 23:13:40 UTC 2016 ==
16/06/27 23:13:40 ERROR security.KDiag: java.net.UnknownHostException:
java.net.UnknownHostException: c6401.ambari.apache.org: c6401.ambari.apache.org:
unknown error
at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
at org.apache.hadoop.security.KDiag.execute(KDiag.java:266)
at org.apache.hadoop.security.KDiag.run(KDiag.java:221)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.security.KDiag.exec(KDiag.java:926)
at org.apache.hadoop.security.KDiag.main(KDiag.java:936)
...
Page 29

Summary
• A variety of recent enhancements have improved the ability of HDFS to serve as the foundational
storage layer of the Hadoop ecosystem.
• Optimization
– Performance
– Optimizing Applications
• Stabilization
– Liveness
– Managing Load
• Supportability
– Logging
– Troubleshooting
Page 30

Thank you! Q&A
• A few recommended best practices while we address questions…
– Enable HDFS audit logs and periodically monitor audit logs/nnTop for unexpected patterns.
– Configure service heap settings correctly.
– https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-
80953924-1cbf-4655-9953-1e744290a6c3.1.html
– Use dedicated disks for NN metadata directories/journal node directories.
– http://hortonworks.com/blog/hdfs-metadata-directories-explained/
– Run balancer (and soon disk-balancer) periodically.
– http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer
– Monitor for LDAP group lookup performance issues.
– https://community.hortonworks.com/content/kbentry/38591/hadoop-and-ldap-usage-load-patterns-and-tuning.html
– Use SmartSense for proactive analysis of potential issues and recommended fixes.
– http://hortonworks.com/products/subscriptions/smartsense/
Page 31

Hdfs 2016-hadoop-summit-san-jose-v4

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Hdfs 2016-hadoop-summit-san-jose-v4

Similaire à Hdfs 2016-hadoop-summit-san-jose-v4 (20)

Dernier

Dernier (20)

Hdfs 2016-hadoop-summit-san-jose-v4