Konstantin Shvachko and Chen Liang of LinkedIn team up with Chao Sun of Uber to present regarding the current state of and future plans for HDFS scalability, with an extended discussion on the newly introduced read-from-standby feature.
This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.
5. Motivation
4
• 2x Growth/Year In Workloads and Size
• Approaching active Name Node performance limits rapidly
• We need a scalability solution
• Key Insights:
• Reads comprise 95% of all metadata operations in our practice
• Another source of truth for read: Standby Nodes
• Standby Nodes Serving Read Requests
• Can substantially decrease active Name Node workload
• Allowing cluster to scale further!
6. Architecture
ROLE OF STANDBY NODES
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Write
5
/Read• Standby nodes have same copy of all
metadata (with some delay)
• Standby Node syncs edits from Active
NameNode
• Standby nodes can potentially serve read
requests
• All reads can go to Standby nodes
• OR, time critical applications can still
choose to read from Active only
8. Fast Journaling
DELAY REDUCTION
7
• Fast Edit Tailing HDFS-13150
• Current JN is slow: serving whole segments of edits from disk
• Optimization on JN and SbNN
o JN caching recent edits in memory, only applied edits are served
o SbNN request only recent edits through RPC calls
o Fall back to existing mechanism on error
• Significantly reduce SbNN delay
o Reduce from 1 minute to 2 to 50 milliseconds
• Standby node delay is no more than a few ms in most cases
9. Consistency Model
8
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• Read-Your-Own-Write
• Client writes to Active NameNode
• Then read from the StandbyNode
• Read should reflect the write
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
txnid = 99
100 100
10. txnid = 100
Consistency Model
9
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• LastSeenStateID
• Monotonically increasing Id of ANN
namespace state txnid
• Kept on client side, client’s known most
recent ANN state
• Sent to SbNN, SbNN only replies after it
has caught up to this state
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
100 100
11. Corner Case: Stale Reads
10
• Stale Read Cases
• Case1: Multiple client instances
• DFSClient#1 to write to ANN, DFSClient#2 to
read SbNN
• DFSClient#2’s state older than DFSClient#1,
read is out of sync
• Case2: Out-of-band communication
• Client#1 writes to ANN, inform client#2
• Client#2 read from SbNN, not see the write
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication
12. msync API
11
• Dealing with Stale Reads: FileSystem.msync()
• Sync between existing client instances
• Force the DFSClient to sync up to the most
recent state of ANN
• Multiple client instances: call msync on
DFSClient#2
• Out-of-band communication: client#2 calls
msync first before read
• “Always msync” mode HDFS-14211
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication
13. Robustness Optimization: Standby Node Back-off
REDIRECT WHEN TOO FAR BEHIND
• In the case where a Standby node state is too far behind, client may retry another node
• e.g. Standby node machine running slow
• Standby Node Back-off
• 1: Upon receiving request, if Standby node finds itself too far behind requested state, it
rejects the request, throwing retry exception
• 2: If a request has been in queue for long, and Standby still is not caught up, Standby
rejects the request, throwing retry exception
• Client Retry
• Upon retry exception, client tries a different standby node, or simply falling back to
ANN 12
14. Configuration and Startup Process
13
• Configuring NameNodes
• Configure namenodes via haadmin
• Observer mode is similar to Standby, but serves
read and does not perform checkpointing
• All NameNodes start as check pointing Standby,
Standby can be transitioned to Active or Observer
• Configuring Client
• Configure to use ObserverReadProxyProvider
• If not, client still works but only talks to ANN
• ObserverReadProxy will discover the state of all
NNs
Active
Standby
Observer
Check
Pointing
Standby
Read
Serving
Standby
Active
15. Current Status
14
• Test and benchmark
• With YARN application, e.g. TeraSort
• With HDFS benchmarks, e.g. DFSIO
• Run on a cluster with >100 nodes and with Kerberos and delegation token enabled
• Merged to trunk (3.3.0)
• Being backported to branch-2
• Active work on further improvement/optimization
• Has been running at Uber in production
16. Background
● Back in 2017, Uber’s HDFS clusters were in a bad shape
○ Rapid growth in # of jobs accessing HDFS
○ Ingestion & adhoc jobs co-locate on the same cluster
○ Lots of listing calls on very large directories (esp. Hudi)
● HDFS traffic composition: 96% reads, 4% writes
● Presto is very sensitive to HDFS latency
○ Occupies ~20% of HDFS traffic
○ Only reads from HDFS, no write
17. Implementation & Timeline
● Implementation (compare to open source version)
○ No msync or fast edit log tailing
■ Only eventual consistency with max staleness of 10s
○ Observer was NOT eligible to NN failover
○ Batched edits loading to reduce NN locktime when tailing edits
● Timeline
○ 08/2017 - finished the POC and basic testing in dev clusters
○ 12/2017 - started collaborating with HDFS open source community (e.g.,
Linkedin, Paypal)
○ 12/2018 - fully rolled out to Presto in production
○ Tool multiple retries along the process
■ Disable access time (dfs.namenode.accesstime.precision)
■ HDFS-13898, HDFS-13924
18. Impact
Comparing to traffic goes to active NameNode, Observer NameNode
improves the overall throughput by ~20% (roughly the same throughput
from Presto), while RPC queue time has dropped ~30X.
21. Three-Stage Scalability Plan
2X GROWTH / YEAR IN WORKLOADS AND SIZE
• Stage I. Consistent reads from standby
• Optimize for reads: 95% of all operations
• Consistent reading is a coordination problem
• Stage II. In-memory Partitioned Namespace
• Optimize write operations
• Eliminate NameNode’s global lock – fine-grained locking
• Stage III. Dynamically Distributed Namespace Service
• Linear scaling to accommodate increases in RPC load and metadata growth
HDFS-12943
20
22. NameNode Current State
NAMENODE’S GLOBAL LOCK – PERFORMANCE BOTTLENECK
• Three main data structures
• INodeMap: id -> INode
• BlocksMap: key -> BlockInfo
• DatanodeMap: don’t split
• GSet – an efficient HashMap
implementation
• Hash(key) -> Value
• Global lock to update INodes and
blocks
21
NameNode – FSNamesystem
INodeMap – Directory Tree
GSet: Id -> INode
BlocksMap – Block Manager
GSet: Block -> BlockInfo
DataNode Manager
23. Stage II. In-memory Partitioned Namespace
ELIMINATE NAMENODE’S GLOBAL LOCK
• PartitionedGSet:
• two level mapping
1. RangeMap: keyRange -> GSet
2. RangeGSet: key -> INode
• Fine-grained locking
• Individual locks per range
• Different ranges are accessed
in parallel
22
NameNode
GSet-1
DataNode Manager
GSet-2 GSet-n
GSet-1 GSet-2 GSet-n
INodeMap - Partitioned GSet
BlocksMap - Partitioned GSet
24. Stage II. In-memory Partitioned Namespace
EARLY POC RESULTS
23
• PartitionedGSet: two level mapping
• LatchLock: swap RangeMap lock for GSet locks corresponding to inode keys
• Run NNTroughputBenchmark creating 10 million directories
• 30% throughput gain
• Large batches of edits
• Why not 100%?
• Key is inodeId – incremental number generator
• Contention on the last partition
• Expect MORE
25. Stage III. Dynamically Distributed Namespace
SCALABLE DATA AND METADATA
• Split NameNode state into
multiple servers based on ranges
• Each NameNode
• Serves a designate range of
INode keys
• Metadata in PartitionedGSet
• Can reassign certain
subranges to adjacent nodes
• Coordination Service (Ratis)
• Change ranges served by NNs
• Renames / moves, Quotas
24
NameNode 1
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
NameNode 2 NameNode n
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
26. Thank You!
Konstantin V Shvachko Chen Liang Chao Sun
Sr. Staff
Software Engineer
@LinkedIn
Software Engineer
@Uber
Senior
Software Engineer
@LinkedIn
25
Consistent Reads from Standby Node
Notes de l'éditeur
State transition diagram
Winter is coming!
See Appendix. The SlideShare version will have more details about the Satellite Cluster configuration and operational solutions