SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Scaling Hadoop at LinkedIn
Konstantin V Shvachko
Sr. Staff Software Engineer
Zhe Zhang
Engineering Manager
Erik Krogen
Senior Software Engineer
Continuous Scalability as a Service
LINKEDIN RUNS ITS BIG DATA ANALYTICS ON HADOOP
1
More Members
More Social
Activity
More Data &
Analytics
Hadoop Infrastructure
ECOSYSTEM OF TOOLS FOR LINKEDIN ANALYTICS
• Hadoop Core
• Storage: HDFS
• Compute: YARN and MapReduce
•
• – workflow scheduler
• Dali – data abstraction and access layer
for Hadoop
• ETL:
• Dr. Elephant – artificially intelligent,
polite, but uncompromising bot
• – distributed SQL engine for
interactive analytic
2
TensorFlow
More Data
Growth Spiral
THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE
3
More ComputeMore Nodes
Cluster Growth 2015 - 2017
THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE
4
0
1
2
3
4
DEC 2015 DEC 2016 DEC 2017
Space Used (PB)
Objects(Millions)
Tasks(Millions/Day)
HDFS Cluster
STANDARD HDFS ARCHITECTURE
• HDFS metadata is decoupled from data
• NameNode keeps the directory tree in RAM
• Thousands of DataNodes store data blocks
• HDFS clients request metadata from active
NameNode and stream data to/from
DataNodes
DataNodes
Active
NameNode
Standby
NameNode
JournalNodes
5
Cluster Heterogeneity
&
Balancing
Homogeneous Hardware
ELUSIVE STARTING POINT OF A CLUSTER
7
Heterogeneous Hardware
PERIODIC CLUSTER EXPANSION ADDS A VARIETY OF HARDWARE
8
Balancer
MAINTAINS EVEN DISTRIBUTION OF DATA AMONG DATANODES
• DataNodes should be filled uniformly by % used space
• Locality principle: more data => more tasks => more data generated
• Balancer iterates until the balancing target is achieved
• Each iteration moves some blocks from overutilized to underutilized nodes
• Highly multithreaded. Spawns many dispatcher threads. In a thread:
• getBlocks() returns a list of blocks of total size S to move out of sourceDN
• Choose targetDN
• Schedule transfers from sourceDN to targetDN
9
Balancer Optimizations
BALANCER STARTUP CAUSES JOB TIMEOUT
• Problem 1. At the start of each Balancer iteration threads hit NameNode at once
• Increase RPC CallQueue length => user jobs timeout
• Solution. Disperse Balancer calls at startup over 10 sec period. Restricts the number
of RPC calls from Balancer to NameNode to 20 calls per second
• Problem 2. Inefficient block iterator in getBlocks()
• NameNode iterates blocks from a randomly selected startBlock index.
• Scans all blocks from the beginning, instead of jumping to startBlock position.
• 4x reduction in exec time for getBlocks() from 40ms down to 9ms
• Result: Balancer overhead on NameNode performance is negligible
HDFS-11384
HDFS-11634
10
Block Report Processing
POP QUIZ
What do Hadoop clusters and particle
colliders have in common?
Hadoop Large Hadron Particle Collider
12
POP QUIZ
What do Hadoop clusters and particle
colliders have in common?
Hadoop Large Hadron Particle Collider
Improbable events happen all the time! 13
Optimization of Block Report Processing
DATANODES REPORT BLOCKS TO THE NAMENODE VIA BLOCK REPORTS
• DataNodes send periodic block reports to the NameNode (6 hours)
• Block report lists all the block replicas on the DataNode
• The list contains block id, generation stamp and length for each replica
• Found a rare race condition in processing block reports on the NameNode
• Race with repeated reports from the same node
• The error recovers itself on the next successful report after six hours
• HDFS-10301, fixed the bug and simplified the processing logic
• Block reports are expensive as they hold the global lock for a long time
• Designed segmented block report proposal HDFS-11313
14
Cluster Versioning
Upgrade to Hadoop 2.7
COMMUNITY RELEASE PROCESS
• Motivation: chose 2.7 as the most stable branch compared to 2.8, 2.9, 3.0
• The team contributed majority of commits to 2.7 since 2.7.3
• Release process
• Worked with the community to lead release
• Apache Hadoop release v 2.7.4 (Aug 2017)
• Testing, performance benchmarks
• Community testing: automated and manual testing tools
• Apache BigTop integration and testing
• Performance testing with Dynamometer
• Benchmarks: TestDFSIO, Slive, GridMix
16
Upgrade to Hadoop 2.7
INTEGRATION INTO LINKEDIN ENVIRONMENT
• Comprehensive testing plan
• Rolling upgrade
• Balancer
• OrgQueue
• Per-component testing
• Pig, Hive, Spark, Presto
• Azkaban, Dr. Elephant, Gobblin
• Production Jobs
17
• What went wrong?
• DDOS of InGraphs with new metrics
• Introduced new metrics turned ON by
default
• Large scale required to cause the issue
Satellite Cluster Project
Small File
Problem
“Keeping the mice away from the elephants”
19
Small File
Problem
• “Small file” is less than one block
• Each requires at least two objects: block & inode
• Small files bloat the memory usage of NameNode
& lead to numerous RPC calls to NameNode
• Block-to-inode ratio steadily decreasing; now at 1.11
• 90% of our files are small!
20
Satellite Cluster Project
IDENTIFY THE MICE: SYSTEM LOGS
Data
Volume
• Realization: many of these small files
were logs (YARN, MR, Spark…)
• Average size of log files: 100KB!
• Only accessed by framework/system
daemons
• NodeManager
• MapReduce / Spark AppMaster
• MapReduce / Spark History Server
• Dr. Elephant
Logs
0.07%
Objects
Logs
43.5%
21
Satellite Cluster Project
GIVE THE MICE THEIR OWN CLUSTER
• Two separate clusters, one ~100x more
nodes than the other
• Operational challenge: how to
bootstrap? > 100M files to copy
• Write new logs to new cluster
• Custom FileSystem presents combined
read view of both clusters’ files
• Log retention policy eventually deletes
all files on old cluster
* Additional details in appendix
22
Satellite Cluster Project
ARCHITECTURE
Primary Cluster
Satellite Cluster
Log Files
Bulk data transfer
stays internal to the
primary cluster Same namespace capacity
(one active NameNode)
but much cheaper due to
smaller data capacity
(fewer DataNodes)
23
Testing: Dynamometer
Dynamometer
• Realistic performance
benchmark & stress test for HDFS
• Open sourced on LinkedIn
GitHub, hope to contribute to
Apache
• Evaluate scalability limits
• Provide confidence before new
feature/config deployment
25
Dynamometer
SIMULATED HDFS CLUSTER RUNS ON YARN
• Real NameNode, fake
DataNodes to run on
~5% the hardware
• Replay real traces
from production
cluster audit logs
Dynamometer
Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN Node
• • •
YARN Node
NameNode
Dynamometer
Infrastructure
Application
Workload MapReduce Job
Host YARN Cluster
Simulated Client Simulated Client
Simulated ClientSimulated Client
26
Namespace Locking
Nonfair
Locking
• NameNode uses a global read-write lock which
supports two modes:
• Fair: locks are acquired in FIFO order (HDFS default)
• Nonfair: locks can be acquired out of order
• Nonfair locking discriminates writes, but benefits
reads via increased read parallelism
• NameNode operations are > 95% reads
28
Dynamometer
EXAMPLE: PREDICTION OF FAIR VS NONFAIR NAMENODE LOCKING PERFORMANCE
RequestWaitTime
Requests Per Second
Dynamometer (Fair)
Production (Fair)
Dynamometer (Unfair)
Production (Unfair)
Dynamometer
predictions closely
mirror observations
post-deployment
29
Nonfair Locking
IN ACTION
Nonfair locking
deployed
RPCQueueTimeAvgTime 30
Optimal Journaling
Optimal Journaling Device
BENCHMARK METHODOLOGY
• Persistent state of NameNode
• Latest checkpoint FSImage. Periodic 8 hour intervals
• Journal EditsLog – latest updates to the namespace
• Journal IO workload
• Sequential writes of 1KB chunks, no reads
• Flush and sync to disk after every write
• NNThroughputBenchmark tuned for efficient use of CPU
• Ensure throughput is bottlenecked mostly by IOs while NameNode is journaling
• Operations: mkdir, create, rename
32
Optimal Journaling Device
HARDWARE RAID CONTROLLER WITH MEMORY CACHE
33
0
5,000
10,000
15,000
20,000
25,000
mkdirs create rename
ops/sec
SATA-HD SAS-HD SSD SW-RAID HW-RAID
Optimal Journaling Device
SUMMARY
• SATA vs SAS vs SSD vs software RAID vs hardware RAID
• SAS is 15-25% better than SATA
• SSD is on par with SATA
• SW-RAID doesn’t improve performance compared to single SATA drive
• HW-RAID provides 2x performance gain vs SATA drives
34
Open Source Community
Open Source Community
KEEPING INTERNAL BRANCH IN SYNC WITH UPSTREAM
• The commit rule:
• Backport to all upstream branches before internal
• Ensures future upgrades will have all historical changes
• Release management: 2.7.4, 2.7.5, 2.7.6
• Open sourcing of OrgQueue with Hortonworks
• 2.9+ GPU support with Microsoft
• StandbyNode reads with Uber & PayPal
36
Next Steps
What’s Next?
2X GROWTH IN 2018 IS IMMINENT
• Stage I. Consistent reads from standby
• Optimize for reads: 95% of all operations
• Consistent reading is a coordination problem
• Stage II. Eliminate NameNode’s global lock
• Implement namespace as a KV-store
• Stage III. Partitioned namespace
• Linear scaling to accommodate increases RPC load
HDFS-12943
HDFS-10419
38
Consistent Reads from Standby Node
ARCHITECTURE
DataNodes
Active
NameNode
Standby
NameNodes
JournalNodes
Read/Write Read Only• Stale Read Problem
• Standby Node syncs edits from Active
NameNode via Quorum Journal Manager
• Standby state is always behind the Active
• Consistent Reads Requirements
• Read your own writes
• Third-party communication
39
HDFS-12943
Thank You!
Konstantin V Shvachko Zhe Zhang Erik Krogen
Sr. Staff
Software Engineer
Senior
Software Engineer
Engineering
Manager
40
Appendix
Satellite Cluster Project
IMPLEMENTATION
mapreduce.job.hdfs-servers =
hdfs://li-satellite-02.grid.linkedin.com:9000
mapreduce.jobhistory.intermediate-done-dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/intermediate
mapreduce.jobhistory.done-dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/finished
yarn.nodemanager.remote-app-log-dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/app-logs
spark.eventLog.dir =
hdfs://li-satellite-02.grid.linkedin.com:9000/system/spark-history
• Two clusters li-satellite-01 (thousands of nodes), li-satellite-02 (32 nodes)
• FailoverFileSystem – transparent view of /system directory during migration
• Access li-satellite-02 first. If not there go to li-satellite-01. listStatus() merges from both
• Configuration change for Hadoop-owned services/frameworks:
• NodeManager, MapReduce / Spark AppMaster & History Server, Azkaban, Dr. Elephant
42
Satellite Cluster Project
CHALLENGES
• Copy >100 million existing files (< 100TB) from li-satellite-01 to li-satellite-02
• Can take > 12 hours, but saturated NameNode before copying a single byte
• Solution: FailoverFileSystem created new history files on li-satellite-02
• Removed /system from li-satellite-01 after log retention period passed
• Very large block reports
• 32 DataNodes each holds 9 million block replicas (vs 200K on normal cluster)
• Takes forever to process on NameNode; long lock hold times
• Solution: Virtual partitioning of each drive into 10 storages
• With 6 drives, block report split into 60 storages per DataNode
• 150K blocks per storage – good report size
43

Contenu connexe

Tendances

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Tendances (20)

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 

Similaire à Scaling Hadoop at LinkedIn

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 

Similaire à Scaling Hadoop at LinkedIn (20)

Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
 
What's new with enterprise Redis - Leena Joshi, Redis Labs
What's new with enterprise Redis - Leena Joshi, Redis LabsWhat's new with enterprise Redis - Leena Joshi, Redis Labs
What's new with enterprise Redis - Leena Joshi, Redis Labs
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Scaling Hadoop at LinkedIn

  • 1. Scaling Hadoop at LinkedIn Konstantin V Shvachko Sr. Staff Software Engineer Zhe Zhang Engineering Manager Erik Krogen Senior Software Engineer
  • 2. Continuous Scalability as a Service LINKEDIN RUNS ITS BIG DATA ANALYTICS ON HADOOP 1 More Members More Social Activity More Data & Analytics
  • 3. Hadoop Infrastructure ECOSYSTEM OF TOOLS FOR LINKEDIN ANALYTICS • Hadoop Core • Storage: HDFS • Compute: YARN and MapReduce • • – workflow scheduler • Dali – data abstraction and access layer for Hadoop • ETL: • Dr. Elephant – artificially intelligent, polite, but uncompromising bot • – distributed SQL engine for interactive analytic 2 TensorFlow
  • 4. More Data Growth Spiral THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE 3 More ComputeMore Nodes
  • 5. Cluster Growth 2015 - 2017 THE INFRASTRUCTURE IS UNDER CONSTANT GROWTH PRESSURE 4 0 1 2 3 4 DEC 2015 DEC 2016 DEC 2017 Space Used (PB) Objects(Millions) Tasks(Millions/Day)
  • 6. HDFS Cluster STANDARD HDFS ARCHITECTURE • HDFS metadata is decoupled from data • NameNode keeps the directory tree in RAM • Thousands of DataNodes store data blocks • HDFS clients request metadata from active NameNode and stream data to/from DataNodes DataNodes Active NameNode Standby NameNode JournalNodes 5
  • 9. Heterogeneous Hardware PERIODIC CLUSTER EXPANSION ADDS A VARIETY OF HARDWARE 8
  • 10. Balancer MAINTAINS EVEN DISTRIBUTION OF DATA AMONG DATANODES • DataNodes should be filled uniformly by % used space • Locality principle: more data => more tasks => more data generated • Balancer iterates until the balancing target is achieved • Each iteration moves some blocks from overutilized to underutilized nodes • Highly multithreaded. Spawns many dispatcher threads. In a thread: • getBlocks() returns a list of blocks of total size S to move out of sourceDN • Choose targetDN • Schedule transfers from sourceDN to targetDN 9
  • 11. Balancer Optimizations BALANCER STARTUP CAUSES JOB TIMEOUT • Problem 1. At the start of each Balancer iteration threads hit NameNode at once • Increase RPC CallQueue length => user jobs timeout • Solution. Disperse Balancer calls at startup over 10 sec period. Restricts the number of RPC calls from Balancer to NameNode to 20 calls per second • Problem 2. Inefficient block iterator in getBlocks() • NameNode iterates blocks from a randomly selected startBlock index. • Scans all blocks from the beginning, instead of jumping to startBlock position. • 4x reduction in exec time for getBlocks() from 40ms down to 9ms • Result: Balancer overhead on NameNode performance is negligible HDFS-11384 HDFS-11634 10
  • 13. POP QUIZ What do Hadoop clusters and particle colliders have in common? Hadoop Large Hadron Particle Collider 12
  • 14. POP QUIZ What do Hadoop clusters and particle colliders have in common? Hadoop Large Hadron Particle Collider Improbable events happen all the time! 13
  • 15. Optimization of Block Report Processing DATANODES REPORT BLOCKS TO THE NAMENODE VIA BLOCK REPORTS • DataNodes send periodic block reports to the NameNode (6 hours) • Block report lists all the block replicas on the DataNode • The list contains block id, generation stamp and length for each replica • Found a rare race condition in processing block reports on the NameNode • Race with repeated reports from the same node • The error recovers itself on the next successful report after six hours • HDFS-10301, fixed the bug and simplified the processing logic • Block reports are expensive as they hold the global lock for a long time • Designed segmented block report proposal HDFS-11313 14
  • 17. Upgrade to Hadoop 2.7 COMMUNITY RELEASE PROCESS • Motivation: chose 2.7 as the most stable branch compared to 2.8, 2.9, 3.0 • The team contributed majority of commits to 2.7 since 2.7.3 • Release process • Worked with the community to lead release • Apache Hadoop release v 2.7.4 (Aug 2017) • Testing, performance benchmarks • Community testing: automated and manual testing tools • Apache BigTop integration and testing • Performance testing with Dynamometer • Benchmarks: TestDFSIO, Slive, GridMix 16
  • 18. Upgrade to Hadoop 2.7 INTEGRATION INTO LINKEDIN ENVIRONMENT • Comprehensive testing plan • Rolling upgrade • Balancer • OrgQueue • Per-component testing • Pig, Hive, Spark, Presto • Azkaban, Dr. Elephant, Gobblin • Production Jobs 17 • What went wrong? • DDOS of InGraphs with new metrics • Introduced new metrics turned ON by default • Large scale required to cause the issue
  • 20. Small File Problem “Keeping the mice away from the elephants” 19
  • 21. Small File Problem • “Small file” is less than one block • Each requires at least two objects: block & inode • Small files bloat the memory usage of NameNode & lead to numerous RPC calls to NameNode • Block-to-inode ratio steadily decreasing; now at 1.11 • 90% of our files are small! 20
  • 22. Satellite Cluster Project IDENTIFY THE MICE: SYSTEM LOGS Data Volume • Realization: many of these small files were logs (YARN, MR, Spark…) • Average size of log files: 100KB! • Only accessed by framework/system daemons • NodeManager • MapReduce / Spark AppMaster • MapReduce / Spark History Server • Dr. Elephant Logs 0.07% Objects Logs 43.5% 21
  • 23. Satellite Cluster Project GIVE THE MICE THEIR OWN CLUSTER • Two separate clusters, one ~100x more nodes than the other • Operational challenge: how to bootstrap? > 100M files to copy • Write new logs to new cluster • Custom FileSystem presents combined read view of both clusters’ files • Log retention policy eventually deletes all files on old cluster * Additional details in appendix 22
  • 24. Satellite Cluster Project ARCHITECTURE Primary Cluster Satellite Cluster Log Files Bulk data transfer stays internal to the primary cluster Same namespace capacity (one active NameNode) but much cheaper due to smaller data capacity (fewer DataNodes) 23
  • 26. Dynamometer • Realistic performance benchmark & stress test for HDFS • Open sourced on LinkedIn GitHub, hope to contribute to Apache • Evaluate scalability limits • Provide confidence before new feature/config deployment 25
  • 27. Dynamometer SIMULATED HDFS CLUSTER RUNS ON YARN • Real NameNode, fake DataNodes to run on ~5% the hardware • Replay real traces from production cluster audit logs Dynamometer Driver DataNode DataNode DataNode DataNode • • • YARN Node • • • YARN Node NameNode Dynamometer Infrastructure Application Workload MapReduce Job Host YARN Cluster Simulated Client Simulated Client Simulated ClientSimulated Client 26
  • 29. Nonfair Locking • NameNode uses a global read-write lock which supports two modes: • Fair: locks are acquired in FIFO order (HDFS default) • Nonfair: locks can be acquired out of order • Nonfair locking discriminates writes, but benefits reads via increased read parallelism • NameNode operations are > 95% reads 28
  • 30. Dynamometer EXAMPLE: PREDICTION OF FAIR VS NONFAIR NAMENODE LOCKING PERFORMANCE RequestWaitTime Requests Per Second Dynamometer (Fair) Production (Fair) Dynamometer (Unfair) Production (Unfair) Dynamometer predictions closely mirror observations post-deployment 29
  • 31. Nonfair Locking IN ACTION Nonfair locking deployed RPCQueueTimeAvgTime 30
  • 33. Optimal Journaling Device BENCHMARK METHODOLOGY • Persistent state of NameNode • Latest checkpoint FSImage. Periodic 8 hour intervals • Journal EditsLog – latest updates to the namespace • Journal IO workload • Sequential writes of 1KB chunks, no reads • Flush and sync to disk after every write • NNThroughputBenchmark tuned for efficient use of CPU • Ensure throughput is bottlenecked mostly by IOs while NameNode is journaling • Operations: mkdir, create, rename 32
  • 34. Optimal Journaling Device HARDWARE RAID CONTROLLER WITH MEMORY CACHE 33 0 5,000 10,000 15,000 20,000 25,000 mkdirs create rename ops/sec SATA-HD SAS-HD SSD SW-RAID HW-RAID
  • 35. Optimal Journaling Device SUMMARY • SATA vs SAS vs SSD vs software RAID vs hardware RAID • SAS is 15-25% better than SATA • SSD is on par with SATA • SW-RAID doesn’t improve performance compared to single SATA drive • HW-RAID provides 2x performance gain vs SATA drives 34
  • 37. Open Source Community KEEPING INTERNAL BRANCH IN SYNC WITH UPSTREAM • The commit rule: • Backport to all upstream branches before internal • Ensures future upgrades will have all historical changes • Release management: 2.7.4, 2.7.5, 2.7.6 • Open sourcing of OrgQueue with Hortonworks • 2.9+ GPU support with Microsoft • StandbyNode reads with Uber & PayPal 36
  • 39. What’s Next? 2X GROWTH IN 2018 IS IMMINENT • Stage I. Consistent reads from standby • Optimize for reads: 95% of all operations • Consistent reading is a coordination problem • Stage II. Eliminate NameNode’s global lock • Implement namespace as a KV-store • Stage III. Partitioned namespace • Linear scaling to accommodate increases RPC load HDFS-12943 HDFS-10419 38
  • 40. Consistent Reads from Standby Node ARCHITECTURE DataNodes Active NameNode Standby NameNodes JournalNodes Read/Write Read Only• Stale Read Problem • Standby Node syncs edits from Active NameNode via Quorum Journal Manager • Standby state is always behind the Active • Consistent Reads Requirements • Read your own writes • Third-party communication 39 HDFS-12943
  • 41. Thank You! Konstantin V Shvachko Zhe Zhang Erik Krogen Sr. Staff Software Engineer Senior Software Engineer Engineering Manager 40
  • 43. Satellite Cluster Project IMPLEMENTATION mapreduce.job.hdfs-servers = hdfs://li-satellite-02.grid.linkedin.com:9000 mapreduce.jobhistory.intermediate-done-dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/intermediate mapreduce.jobhistory.done-dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/mr-history/finished yarn.nodemanager.remote-app-log-dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/app-logs spark.eventLog.dir = hdfs://li-satellite-02.grid.linkedin.com:9000/system/spark-history • Two clusters li-satellite-01 (thousands of nodes), li-satellite-02 (32 nodes) • FailoverFileSystem – transparent view of /system directory during migration • Access li-satellite-02 first. If not there go to li-satellite-01. listStatus() merges from both • Configuration change for Hadoop-owned services/frameworks: • NodeManager, MapReduce / Spark AppMaster & History Server, Azkaban, Dr. Elephant 42
  • 44. Satellite Cluster Project CHALLENGES • Copy >100 million existing files (< 100TB) from li-satellite-01 to li-satellite-02 • Can take > 12 hours, but saturated NameNode before copying a single byte • Solution: FailoverFileSystem created new history files on li-satellite-02 • Removed /system from li-satellite-01 after log retention period passed • Very large block reports • 32 DataNodes each holds 9 million block replicas (vs 200K on normal cluster) • Takes forever to process on NameNode; long lock hold times • Solution: Virtual partitioning of each drive into 10 storages • With 6 drives, block report split into 60 storages per DataNode • 150K blocks per storage – good report size 43