SlideShare une entreprise Scribd logo
1  sur  26
Consistent Reads from Standby Node
Konstantin V Shvachko
Sr. Staff Software Engineer
@LinkedIn
Chen Liang
Senior Software Engineer
@LinkedIn
Chao Sun
Software Engineer
@Uber
Agenda
HDFS CONSISTENT READ FROM STANDBY
1
• Motivation
• Consistency Read from Standby
• Challenges
• Design and Implementation
• Next steps
The Team
2
• Konstantin Shvachko (LinkedIn)
• Chen Liang (LinkedIn)
• Erik Krogen (LinkedIn)
• Chao Sun (Uber)
• Plamen Jeliazkov (Paypal)
Consistent Reads From
Standby Nodes
Motivation
4
• 2x Growth/Year In Workloads and Size
• Approaching active Name Node performance limits rapidly
• We need a scalability solution
• Key Insights:
• Reads comprise 95% of all metadata operations in our practice
• Another source of truth for read: Standby Nodes
• Standby Nodes Serving Read Requests
• Can substantially decrease active Name Node workload
• Allowing cluster to scale further!
Architecture
ROLE OF STANDBY NODES
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Write
5
/Read• Standby nodes have same copy of all
metadata (with some delay)
• Standby Node syncs edits from Active
NameNode
• Standby nodes can potentially serve read
requests
• All reads can go to Standby nodes
• OR, time critical applications can still
choose to read from Active only
Challenges
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Write Read
6
/Read
• Standby Node delay
• ANN write edits to JN, then SbNN
applying the edits from JN
• With delay at minute magnitude
• Consistency
• If client performs a read after a write,
client would expect to see the state
change
Fast Journaling
DELAY REDUCTION
7
• Fast Edit Tailing HDFS-13150
• Current JN is slow: serving whole segments of edits from disk
• Optimization on JN and SbNN
o JN caching recent edits in memory, only applied edits are served
o SbNN request only recent edits through RPC calls
o Fall back to existing mechanism on error
• Significantly reduce SbNN delay
o Reduce from 1 minute to 2 to 50 milliseconds
• Standby node delay is no more than a few ms in most cases
Consistency Model
8
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• Read-Your-Own-Write
• Client writes to Active NameNode
• Then read from the StandbyNode
• Read should reflect the write
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
txnid = 99
100 100
txnid = 100
Consistency Model
9
• Consistency Principle:
• If client c1 modifies an object state at id1
at time t1, then in any future time t2 > t1,
c1 will see the state of that object at id2 >=
id1
• LastSeenStateID
• Monotonically increasing Id of ANN
namespace state txnid
• Kept on client side, client’s known most
recent ANN state
• Sent to SbNN, SbNN only replies after it
has caught up to this state
Active
NameNode
Standby
NameNodes
JournalNodes
lastSeenStateId
txnid = 100
= 100
100 100
Corner Case: Stale Reads
10
• Stale Read Cases
• Case1: Multiple client instances
• DFSClient#1 to write to ANN, DFSClient#2 to
read SbNN
• DFSClient#2’s state older than DFSClient#1,
read is out of sync
• Case2: Out-of-band communication
• Client#1 writes to ANN, inform client#2
• Client#2 read from SbNN, not see the write
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication
msync API
11
• Dealing with Stale Reads: FileSystem.msync()
• Sync between existing client instances
• Force the DFSClient to sync up to the most
recent state of ANN
• Multiple client instances: call msync on
DFSClient#2
• Out-of-band communication: client#2 calls
msync first before read
• “Always msync” mode HDFS-14211
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Read your own writes
Active
NameNode
DFSClient#1
Standby
NameNode
Write
DFSClient#2
Read
Third-party communication
Robustness Optimization: Standby Node Back-off
REDIRECT WHEN TOO FAR BEHIND
• In the case where a Standby node state is too far behind, client may retry another node
• e.g. Standby node machine running slow
• Standby Node Back-off
• 1: Upon receiving request, if Standby node finds itself too far behind requested state, it
rejects the request, throwing retry exception
• 2: If a request has been in queue for long, and Standby still is not caught up, Standby
rejects the request, throwing retry exception
• Client Retry
• Upon retry exception, client tries a different standby node, or simply falling back to
ANN 12
Configuration and Startup Process
13
• Configuring NameNodes
• Configure namenodes via haadmin
• Observer mode is similar to Standby, but serves
read and does not perform checkpointing
• All NameNodes start as check pointing Standby,
Standby can be transitioned to Active or Observer
• Configuring Client
• Configure to use ObserverReadProxyProvider
• If not, client still works but only talks to ANN
• ObserverReadProxy will discover the state of all
NNs
Active
Standby
Observer
Check
Pointing
Standby
Read
Serving
Standby
Active
Current Status
14
• Test and benchmark
• With YARN application, e.g. TeraSort
• With HDFS benchmarks, e.g. DFSIO
• Run on a cluster with >100 nodes and with Kerberos and delegation token enabled
• Merged to trunk (3.3.0)
• Being backported to branch-2
• Active work on further improvement/optimization
• Has been running at Uber in production
Background
● Back in 2017, Uber’s HDFS clusters were in a bad shape
○ Rapid growth in # of jobs accessing HDFS
○ Ingestion & adhoc jobs co-locate on the same cluster
○ Lots of listing calls on very large directories (esp. Hudi)
● HDFS traffic composition: 96% reads, 4% writes
● Presto is very sensitive to HDFS latency
○ Occupies ~20% of HDFS traffic
○ Only reads from HDFS, no write
Implementation & Timeline
● Implementation (compare to open source version)
○ No msync or fast edit log tailing
■ Only eventual consistency with max staleness of 10s
○ Observer was NOT eligible to NN failover
○ Batched edits loading to reduce NN locktime when tailing edits
● Timeline
○ 08/2017 - finished the POC and basic testing in dev clusters
○ 12/2017 - started collaborating with HDFS open source community (e.g.,
Linkedin, Paypal)
○ 12/2018 - fully rolled out to Presto in production
○ Tool multiple retries along the process
■ Disable access time (dfs.namenode.accesstime.precision)
■ HDFS-13898, HDFS-13924
Impact
Comparing to traffic goes to active NameNode, Observer NameNode
improves the overall throughput by ~20% (roughly the same throughput
from Presto), while RPC queue time has dropped ~30X.
Impact (cont.)
Presto listing status call latency has dropped 8-10X after migrating to
Observer
Next Steps
Three-Stage Scalability Plan
2X GROWTH / YEAR IN WORKLOADS AND SIZE
• Stage I. Consistent reads from standby
• Optimize for reads: 95% of all operations
• Consistent reading is a coordination problem
• Stage II. In-memory Partitioned Namespace
• Optimize write operations
• Eliminate NameNode’s global lock – fine-grained locking
• Stage III. Dynamically Distributed Namespace Service
• Linear scaling to accommodate increases in RPC load and metadata growth
HDFS-12943
20
NameNode Current State
NAMENODE’S GLOBAL LOCK – PERFORMANCE BOTTLENECK
• Three main data structures
• INodeMap: id -> INode
• BlocksMap: key -> BlockInfo
• DatanodeMap: don’t split
• GSet – an efficient HashMap
implementation
• Hash(key) -> Value
• Global lock to update INodes and
blocks
21
NameNode – FSNamesystem
INodeMap – Directory Tree
GSet: Id -> INode
BlocksMap – Block Manager
GSet: Block -> BlockInfo
DataNode Manager
Stage II. In-memory Partitioned Namespace
ELIMINATE NAMENODE’S GLOBAL LOCK
• PartitionedGSet:
• two level mapping
1. RangeMap: keyRange -> GSet
2. RangeGSet: key -> INode
• Fine-grained locking
• Individual locks per range
• Different ranges are accessed
in parallel
22
NameNode
GSet-1
DataNode Manager
GSet-2 GSet-n
GSet-1 GSet-2 GSet-n
INodeMap - Partitioned GSet
BlocksMap - Partitioned GSet
Stage II. In-memory Partitioned Namespace
EARLY POC RESULTS
23
• PartitionedGSet: two level mapping
• LatchLock: swap RangeMap lock for GSet locks corresponding to inode keys
• Run NNTroughputBenchmark creating 10 million directories
• 30% throughput gain
• Large batches of edits
• Why not 100%?
• Key is inodeId – incremental number generator
• Contention on the last partition
• Expect MORE
Stage III. Dynamically Distributed Namespace
SCALABLE DATA AND METADATA
• Split NameNode state into
multiple servers based on ranges
• Each NameNode
• Serves a designate range of
INode keys
• Metadata in PartitionedGSet
• Can reassign certain
subranges to adjacent nodes
• Coordination Service (Ratis)
• Change ranges served by NNs
• Renames / moves, Quotas
24
NameNode 1
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
NameNode 2 NameNode n
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
INodeMap
Part-GSet
DataNode
Manager
BlocksMap
Part-GSet
Thank You!
Konstantin V Shvachko Chen Liang Chao Sun
Sr. Staff
Software Engineer
@LinkedIn
Software Engineer
@Uber
Senior
Software Engineer
@LinkedIn
25
Consistent Reads from Standby Node

Contenu connexe

Tendances

Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

Tendances (20)

HDFS Overview
HDFS OverviewHDFS Overview
HDFS Overview
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similaire à Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
Redis Labs
 

Similaire à Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node (20)

Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
 
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayStoring eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 201910 Ways to Scale Your Website Silicon Valley Code Camp 2019
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
 
10 Ways to Scale with Redis - LA Redis Meetup 2019
10 Ways to Scale with Redis - LA Redis Meetup 201910 Ways to Scale with Redis - LA Redis Meetup 2019
10 Ways to Scale with Redis - LA Redis Meetup 2019
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Signing DNSSEC answers on the fly at the edge: challenges and solutions
Signing DNSSEC answers on the fly at the edge: challenges and solutionsSigning DNSSEC answers on the fly at the edge: challenges and solutions
Signing DNSSEC answers on the fly at the edge: challenges and solutions
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
MongoDB World 2018: Active-Active Application Architectures: Become a MongoDB...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFSIEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
 
Delivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDBDelivering big content at NBC News with RavenDB
Delivering big content at NBC News with RavenDB
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 

Plus de Erik Krogen

Plus de Erik Krogen (6)

Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
 
Hadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On AzureHadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On Azure
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
 
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Hadoop Meetup Jan 2019 - Hadoop Encryption
Hadoop Meetup Jan 2019 - Hadoop EncryptionHadoop Meetup Jan 2019 - Hadoop Encryption
Hadoop Meetup Jan 2019 - Hadoop Encryption
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

  • 1. Consistent Reads from Standby Node Konstantin V Shvachko Sr. Staff Software Engineer @LinkedIn Chen Liang Senior Software Engineer @LinkedIn Chao Sun Software Engineer @Uber
  • 2. Agenda HDFS CONSISTENT READ FROM STANDBY 1 • Motivation • Consistency Read from Standby • Challenges • Design and Implementation • Next steps
  • 3. The Team 2 • Konstantin Shvachko (LinkedIn) • Chen Liang (LinkedIn) • Erik Krogen (LinkedIn) • Chao Sun (Uber) • Plamen Jeliazkov (Paypal)
  • 5. Motivation 4 • 2x Growth/Year In Workloads and Size • Approaching active Name Node performance limits rapidly • We need a scalability solution • Key Insights: • Reads comprise 95% of all metadata operations in our practice • Another source of truth for read: Standby Nodes • Standby Nodes Serving Read Requests • Can substantially decrease active Name Node workload • Allowing cluster to scale further!
  • 6. Architecture ROLE OF STANDBY NODES DataNodes Active NameNode Standby NameNodes JournalNodes Write 5 /Read• Standby nodes have same copy of all metadata (with some delay) • Standby Node syncs edits from Active NameNode • Standby nodes can potentially serve read requests • All reads can go to Standby nodes • OR, time critical applications can still choose to read from Active only
  • 7. Challenges DataNodes Active NameNode Standby NameNodes JournalNodes Write Read 6 /Read • Standby Node delay • ANN write edits to JN, then SbNN applying the edits from JN • With delay at minute magnitude • Consistency • If client performs a read after a write, client would expect to see the state change
  • 8. Fast Journaling DELAY REDUCTION 7 • Fast Edit Tailing HDFS-13150 • Current JN is slow: serving whole segments of edits from disk • Optimization on JN and SbNN o JN caching recent edits in memory, only applied edits are served o SbNN request only recent edits through RPC calls o Fall back to existing mechanism on error • Significantly reduce SbNN delay o Reduce from 1 minute to 2 to 50 milliseconds • Standby node delay is no more than a few ms in most cases
  • 9. Consistency Model 8 • Consistency Principle: • If client c1 modifies an object state at id1 at time t1, then in any future time t2 > t1, c1 will see the state of that object at id2 >= id1 • Read-Your-Own-Write • Client writes to Active NameNode • Then read from the StandbyNode • Read should reflect the write Active NameNode Standby NameNodes JournalNodes lastSeenStateId txnid = 100 = 100 txnid = 99 100 100
  • 10. txnid = 100 Consistency Model 9 • Consistency Principle: • If client c1 modifies an object state at id1 at time t1, then in any future time t2 > t1, c1 will see the state of that object at id2 >= id1 • LastSeenStateID • Monotonically increasing Id of ANN namespace state txnid • Kept on client side, client’s known most recent ANN state • Sent to SbNN, SbNN only replies after it has caught up to this state Active NameNode Standby NameNodes JournalNodes lastSeenStateId txnid = 100 = 100 100 100
  • 11. Corner Case: Stale Reads 10 • Stale Read Cases • Case1: Multiple client instances • DFSClient#1 to write to ANN, DFSClient#2 to read SbNN • DFSClient#2’s state older than DFSClient#1, read is out of sync • Case2: Out-of-band communication • Client#1 writes to ANN, inform client#2 • Client#2 read from SbNN, not see the write Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Read your own writes Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Third-party communication
  • 12. msync API 11 • Dealing with Stale Reads: FileSystem.msync() • Sync between existing client instances • Force the DFSClient to sync up to the most recent state of ANN • Multiple client instances: call msync on DFSClient#2 • Out-of-band communication: client#2 calls msync first before read • “Always msync” mode HDFS-14211 Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Read your own writes Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Third-party communication
  • 13. Robustness Optimization: Standby Node Back-off REDIRECT WHEN TOO FAR BEHIND • In the case where a Standby node state is too far behind, client may retry another node • e.g. Standby node machine running slow • Standby Node Back-off • 1: Upon receiving request, if Standby node finds itself too far behind requested state, it rejects the request, throwing retry exception • 2: If a request has been in queue for long, and Standby still is not caught up, Standby rejects the request, throwing retry exception • Client Retry • Upon retry exception, client tries a different standby node, or simply falling back to ANN 12
  • 14. Configuration and Startup Process 13 • Configuring NameNodes • Configure namenodes via haadmin • Observer mode is similar to Standby, but serves read and does not perform checkpointing • All NameNodes start as check pointing Standby, Standby can be transitioned to Active or Observer • Configuring Client • Configure to use ObserverReadProxyProvider • If not, client still works but only talks to ANN • ObserverReadProxy will discover the state of all NNs Active Standby Observer Check Pointing Standby Read Serving Standby Active
  • 15. Current Status 14 • Test and benchmark • With YARN application, e.g. TeraSort • With HDFS benchmarks, e.g. DFSIO • Run on a cluster with >100 nodes and with Kerberos and delegation token enabled • Merged to trunk (3.3.0) • Being backported to branch-2 • Active work on further improvement/optimization • Has been running at Uber in production
  • 16. Background ● Back in 2017, Uber’s HDFS clusters were in a bad shape ○ Rapid growth in # of jobs accessing HDFS ○ Ingestion & adhoc jobs co-locate on the same cluster ○ Lots of listing calls on very large directories (esp. Hudi) ● HDFS traffic composition: 96% reads, 4% writes ● Presto is very sensitive to HDFS latency ○ Occupies ~20% of HDFS traffic ○ Only reads from HDFS, no write
  • 17. Implementation & Timeline ● Implementation (compare to open source version) ○ No msync or fast edit log tailing ■ Only eventual consistency with max staleness of 10s ○ Observer was NOT eligible to NN failover ○ Batched edits loading to reduce NN locktime when tailing edits ● Timeline ○ 08/2017 - finished the POC and basic testing in dev clusters ○ 12/2017 - started collaborating with HDFS open source community (e.g., Linkedin, Paypal) ○ 12/2018 - fully rolled out to Presto in production ○ Tool multiple retries along the process ■ Disable access time (dfs.namenode.accesstime.precision) ■ HDFS-13898, HDFS-13924
  • 18. Impact Comparing to traffic goes to active NameNode, Observer NameNode improves the overall throughput by ~20% (roughly the same throughput from Presto), while RPC queue time has dropped ~30X.
  • 19. Impact (cont.) Presto listing status call latency has dropped 8-10X after migrating to Observer
  • 21. Three-Stage Scalability Plan 2X GROWTH / YEAR IN WORKLOADS AND SIZE • Stage I. Consistent reads from standby • Optimize for reads: 95% of all operations • Consistent reading is a coordination problem • Stage II. In-memory Partitioned Namespace • Optimize write operations • Eliminate NameNode’s global lock – fine-grained locking • Stage III. Dynamically Distributed Namespace Service • Linear scaling to accommodate increases in RPC load and metadata growth HDFS-12943 20
  • 22. NameNode Current State NAMENODE’S GLOBAL LOCK – PERFORMANCE BOTTLENECK • Three main data structures • INodeMap: id -> INode • BlocksMap: key -> BlockInfo • DatanodeMap: don’t split • GSet – an efficient HashMap implementation • Hash(key) -> Value • Global lock to update INodes and blocks 21 NameNode – FSNamesystem INodeMap – Directory Tree GSet: Id -> INode BlocksMap – Block Manager GSet: Block -> BlockInfo DataNode Manager
  • 23. Stage II. In-memory Partitioned Namespace ELIMINATE NAMENODE’S GLOBAL LOCK • PartitionedGSet: • two level mapping 1. RangeMap: keyRange -> GSet 2. RangeGSet: key -> INode • Fine-grained locking • Individual locks per range • Different ranges are accessed in parallel 22 NameNode GSet-1 DataNode Manager GSet-2 GSet-n GSet-1 GSet-2 GSet-n INodeMap - Partitioned GSet BlocksMap - Partitioned GSet
  • 24. Stage II. In-memory Partitioned Namespace EARLY POC RESULTS 23 • PartitionedGSet: two level mapping • LatchLock: swap RangeMap lock for GSet locks corresponding to inode keys • Run NNTroughputBenchmark creating 10 million directories • 30% throughput gain • Large batches of edits • Why not 100%? • Key is inodeId – incremental number generator • Contention on the last partition • Expect MORE
  • 25. Stage III. Dynamically Distributed Namespace SCALABLE DATA AND METADATA • Split NameNode state into multiple servers based on ranges • Each NameNode • Serves a designate range of INode keys • Metadata in PartitionedGSet • Can reassign certain subranges to adjacent nodes • Coordination Service (Ratis) • Change ranges served by NNs • Renames / moves, Quotas 24 NameNode 1 INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet NameNode 2 NameNode n INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet
  • 26. Thank You! Konstantin V Shvachko Chen Liang Chao Sun Sr. Staff Software Engineer @LinkedIn Software Engineer @Uber Senior Software Engineer @LinkedIn 25 Consistent Reads from Standby Node

Notes de l'éditeur

  1. State transition diagram
  2. Winter is coming!
  3. See Appendix. The SlideShare version will have more details about the Satellite Cluster configuration and operational solutions