SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Apache Hadoop 0.22
and Other Versions
Konstantin V Shvachko
Principal Hadoop Architect, eBay
IBM Karmasphere Twitter
February – March, 2012
eBay Inc. confidential
Apache Hadoop Ecosystem
• Hadoop Core
– Common – communication and user facing APIs
– HDFS – distributed file system
– MapReduce – distributed computation framework
• Pig – dataflow language
• Hive – data warehouse, SQL
• Zookeeper – distributed coordination service
• HBase – columnar store
• Oozie – complex job workflow
• eBay Specific
– Cascading
– Lzo compression
2
eBay Inc. confidential
Hadoop Versioning
• Straight line from 0.1 to 0.20
• Fanned out starting from 0.20.2
• Multiple distributions in 2010 based on 0.20
– Apache, Y, CDH, FB
– More today
• Focus on Apache Releases
– Release 0.20.2 2010-02-16
– Release 0.21.0 2010-08-13
– Release 0.20.203.0 2011-05-11 Security Stable
– Release 0.20.204.0 2011-09-05 Improvements
– Release 0.20.205.0 2011-10-17 HBase support
• Genealogy of elephants
3
eBay Inc. confidential4
eBay Inc. confidential
Major Branches
• Hadoop 1.0.0 (security branch) 2011-12-27
– Rename of 0.20.205
– Beta
• Hadoop 0.22.0 2011-12-10
– Continuation of 0.21.0
– Beta
• Hadoop 0.23.0 2011-11-11
– Fedaration – static partitioning of HDFS namespace
– Yarn – new implementation of MapReduce
– Scalability
– Alpha
• 2011 – record number of major releases!
• No unifying release, containing all the good features
5
eBay Inc. confidential
Hadoop 0.22 Branch
• Branched 2010-11-17
• Released 2011-12-10
• Many events in-between
• RM role – started in August 2011
• Stabilization
–Hadoop Platform team, eBay
–Many contributors from the community
6
eBay Inc. confidential
Features HDFS - 0.22
• New implementation of file append
• HBase support with hflush and hsync
• Symbolic links
• BackupNode and CheckpointNode
• DataNodes tolerate single disk failure. Disk-fail-in-place
• File concatenation
• SLive test
• Sticky bit
• Offline Image Viewer
7
eBay Inc. confidential
Features MapReduce - 0.22
• Hierarchical job queues
• Job limits per queue / pool
• Dynamically stop / start job queues
• Andvances in new MapReduce API
– Input/Output formats, ChainMapper / ChainReducer
• TaskTracker blacklisting
• DistributedCache sharing
8
eBay Inc. confidential
Features not Supported in Hadoop 0.22.0
Compared to Hadoop 1.0
• Security
– LinuxTaskController removed MAPREDUCE-2767
• Optimizations (operability) of the MapReduce framework
introduced in the Hadoop 0.20.security line of releases
– Limits on per-job JobConf, Counters, StatusReport, Split-Sizes
– User / queue limits on tasks / jobs in the CapacityScheduler
• Disk-fail-in-place – MapReduce part
• JMX-based metrics v2
• Jetty workaround
• CapacityScheduler should assign multiple tasks per heartbeat
• User's task logs filling up local disks on the TaskTrackers
• FairScheduler back-port from trunk
9
eBay Inc. confidential
Not in Hadoop 0.22.0 HDFS Part
• Shortcut a local client reads to a Datanodes files directly
– Important HBase optimization
– Porting is in progress
• WebHDFS: accessing HDFS over HTTP
– New experimental feature, back-ported from trunk
• NameNode startup time
– Handling block reports and missed heartbeats from DataNodes
– The rest is forward ported from 1.0
– More startup improvements in 0.22
10
eBay Inc. confidential
Hadoop 0.23 Features
• HDFS Federation
– Independent NameNodes sharing a common pool of DataNodes
– Cluster is a family of volumes with shared block storage layer
– User sees volumes as isolated file systems
– ViewFS: the client-side mount table
– Federated approach provides a static partitioning of the federated namespace
• Yarn: Scalability for MapReduce framework
– Separation of JobTracker functions
1. Job scheduling and resource allocation:
• Fundamentally centralized
2. Job monitoring and job life-cycle coordination
• Delegate coordination of different jobs to other nodes
– Dynamic partitioning of cluster resources: no fixed slots
• “Apache Hadoop: The scalability update” USENIX ;login: June, 2011
11
eBay Inc. confidential
Append and HBase
• Append means
– Reopening of existing files for appending new data
– Replica synchronization after failure
– Consistent view of file data during writing by different clients
– hflush, hsync – guarantee data delivered to DNs and persisted on NN
• First implementation of append in 0.19 HADOOP-1700
– 0.20-append branch
• Redesign of append in 0.21 HDFS-265
• HBase needs hflush and hsync only
• Hadoop 1.0 - HBase support via hflush, hsync
• Hadoop 0.22 – fully functional append, including HBase support
12
eBay Inc. confidential
BackupNode
• BackupNode a read-only NameNode
– Contains all file system metadata: files and directories
excluding block locations
– Can perform NameNode operations that don’t modify namespace
• BN maintains up-to-date in-memory image of file system namespace
always synchronized with the NameNode state
– NameNode streams journal to BackupNode
• BackupNode can create a checkpoint without downloading
checkpoint and journal files from active NameNode
• Intended to evolve into hot HA HDFS-2064
13
eBay Inc. confidential
Hadoop at eBay
• 2011 started with 532-node 5 PB cluster running CDH2
• EBay 0.20.203-based build (Wilma)
– Hadoop 0.20.203 – latest stable Apache release
• HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo
– 500+ users; 2000 jobs per day
• Runs on 1000-node cluster
– 24 PB – capacity, 72 GB RAM / node
• Many smaller clusters
• Stabilization of Hadoop platform based on 0.22
14
eBay Inc. confidential
Testing
• One year of testing by different groups in Hadoop ecosystem
• Extensive testing of append by HBase community
• Fully automated build and certification with BigTop
• Hadoop platform team at eBay
– Extensive stabilization effort starting September
– Most bugs found in 0.22 are also in trunk and 0.23
– All new features tested
– Stress testing
– Reliability testing
• Works with: Pig 0.8, Hive 0.7, custom changes
HBase 0.92, Oozie, open sourced
Zookeeper, Cascading no changes needed
15
eBay Inc. confidential
Testing Tools, Examples
• TeraSort, TestDFSIO, DistCp
• GridMix, Rumen – production job traces
• SLive – adjustable mix of HDFS operations, permanent load
• Upgrade / rollback from 0.20.? and 0.20.203 to 0.22
• Oversubscribed cluster running out of memory
• Loosing racks with running jobs and HBase
– Cluster survived consecutive loss of 4 racks, shrinking to single rack
with HBase still alive and MR jobs completing
• Disk-fail-in-place helps identify bad drives during hardware burn-in
16
eBay Inc. confidential
Benchmarking
• TestDFSIO: 10 GB files (same as 100 GB)
• TeraSort: -5% (scheduler to blame)
• YCSB - same
• Internal eBay applications – same or better
• Lots of tuning: Hadoop, Java, OS, HW
– Gradual improvement of results
17
Throughput
MB/sec
Read Write Append
Hadoop-0.22 100 84 83
0.20 breed 96 66 n/a
eBay Inc. confidential
Good to have for 0.22.1
• Restore Security
• Disk Fail in place for MapReduce
• Optimizations
– Multiple tasks per heartbeat for CapacityScheduler
– CapacityScheduler preemption
• MR job and task limits
• Cluster startup time
• Add HA?
• Merge MR-1.0 into Hadoop 0.22?
18
eBay Inc. confidential
Important
• Works but not 0.20
– Good new features
– Reliability is the first concern
– Performance and missing functionality can be reconstructed
• Community release
– Not distributed / advertized by commercial distributors
– Community involvement important
• Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0
It’s the other way around
– Go to Hadoop 0.22 instead
• Forward-going release progress
– Stop porting new features, start releasing them
19
eBay Inc. confidential
Thank you
20
Hadoop 0.22 Contributions Accepted

Contenu connexe

Tendances

Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaDataWorks Summit
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 

Tendances (20)

Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 

Similaire à Apache Hadoop 0.22 and Other Versions

Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012Chris Huang
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 

Similaire à Apache Hadoop 0.22 and Other Versions (20)

Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Hbase status quo apache-con europe - nov 2012
Hbase status quo   apache-con europe - nov 2012Hbase status quo   apache-con europe - nov 2012
Hbase status quo apache-con europe - nov 2012
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 

Plus de Konstantin V. Shvachko

HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemKonstantin V. Shvachko
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 

Plus de Konstantin V. Shvachko (6)

HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 

Dernier

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Apache Hadoop 0.22 and Other Versions

  • 1. Apache Hadoop 0.22 and Other Versions Konstantin V Shvachko Principal Hadoop Architect, eBay IBM Karmasphere Twitter February – March, 2012
  • 2. eBay Inc. confidential Apache Hadoop Ecosystem • Hadoop Core – Common – communication and user facing APIs – HDFS – distributed file system – MapReduce – distributed computation framework • Pig – dataflow language • Hive – data warehouse, SQL • Zookeeper – distributed coordination service • HBase – columnar store • Oozie – complex job workflow • eBay Specific – Cascading – Lzo compression 2
  • 3. eBay Inc. confidential Hadoop Versioning • Straight line from 0.1 to 0.20 • Fanned out starting from 0.20.2 • Multiple distributions in 2010 based on 0.20 – Apache, Y, CDH, FB – More today • Focus on Apache Releases – Release 0.20.2 2010-02-16 – Release 0.21.0 2010-08-13 – Release 0.20.203.0 2011-05-11 Security Stable – Release 0.20.204.0 2011-09-05 Improvements – Release 0.20.205.0 2011-10-17 HBase support • Genealogy of elephants 3
  • 5. eBay Inc. confidential Major Branches • Hadoop 1.0.0 (security branch) 2011-12-27 – Rename of 0.20.205 – Beta • Hadoop 0.22.0 2011-12-10 – Continuation of 0.21.0 – Beta • Hadoop 0.23.0 2011-11-11 – Fedaration – static partitioning of HDFS namespace – Yarn – new implementation of MapReduce – Scalability – Alpha • 2011 – record number of major releases! • No unifying release, containing all the good features 5
  • 6. eBay Inc. confidential Hadoop 0.22 Branch • Branched 2010-11-17 • Released 2011-12-10 • Many events in-between • RM role – started in August 2011 • Stabilization –Hadoop Platform team, eBay –Many contributors from the community 6
  • 7. eBay Inc. confidential Features HDFS - 0.22 • New implementation of file append • HBase support with hflush and hsync • Symbolic links • BackupNode and CheckpointNode • DataNodes tolerate single disk failure. Disk-fail-in-place • File concatenation • SLive test • Sticky bit • Offline Image Viewer 7
  • 8. eBay Inc. confidential Features MapReduce - 0.22 • Hierarchical job queues • Job limits per queue / pool • Dynamically stop / start job queues • Andvances in new MapReduce API – Input/Output formats, ChainMapper / ChainReducer • TaskTracker blacklisting • DistributedCache sharing 8
  • 9. eBay Inc. confidential Features not Supported in Hadoop 0.22.0 Compared to Hadoop 1.0 • Security – LinuxTaskController removed MAPREDUCE-2767 • Optimizations (operability) of the MapReduce framework introduced in the Hadoop 0.20.security line of releases – Limits on per-job JobConf, Counters, StatusReport, Split-Sizes – User / queue limits on tasks / jobs in the CapacityScheduler • Disk-fail-in-place – MapReduce part • JMX-based metrics v2 • Jetty workaround • CapacityScheduler should assign multiple tasks per heartbeat • User's task logs filling up local disks on the TaskTrackers • FairScheduler back-port from trunk 9
  • 10. eBay Inc. confidential Not in Hadoop 0.22.0 HDFS Part • Shortcut a local client reads to a Datanodes files directly – Important HBase optimization – Porting is in progress • WebHDFS: accessing HDFS over HTTP – New experimental feature, back-ported from trunk • NameNode startup time – Handling block reports and missed heartbeats from DataNodes – The rest is forward ported from 1.0 – More startup improvements in 0.22 10
  • 11. eBay Inc. confidential Hadoop 0.23 Features • HDFS Federation – Independent NameNodes sharing a common pool of DataNodes – Cluster is a family of volumes with shared block storage layer – User sees volumes as isolated file systems – ViewFS: the client-side mount table – Federated approach provides a static partitioning of the federated namespace • Yarn: Scalability for MapReduce framework – Separation of JobTracker functions 1. Job scheduling and resource allocation: • Fundamentally centralized 2. Job monitoring and job life-cycle coordination • Delegate coordination of different jobs to other nodes – Dynamic partitioning of cluster resources: no fixed slots • “Apache Hadoop: The scalability update” USENIX ;login: June, 2011 11
  • 12. eBay Inc. confidential Append and HBase • Append means – Reopening of existing files for appending new data – Replica synchronization after failure – Consistent view of file data during writing by different clients – hflush, hsync – guarantee data delivered to DNs and persisted on NN • First implementation of append in 0.19 HADOOP-1700 – 0.20-append branch • Redesign of append in 0.21 HDFS-265 • HBase needs hflush and hsync only • Hadoop 1.0 - HBase support via hflush, hsync • Hadoop 0.22 – fully functional append, including HBase support 12
  • 13. eBay Inc. confidential BackupNode • BackupNode a read-only NameNode – Contains all file system metadata: files and directories excluding block locations – Can perform NameNode operations that don’t modify namespace • BN maintains up-to-date in-memory image of file system namespace always synchronized with the NameNode state – NameNode streams journal to BackupNode • BackupNode can create a checkpoint without downloading checkpoint and journal files from active NameNode • Intended to evolve into hot HA HDFS-2064 13
  • 14. eBay Inc. confidential Hadoop at eBay • 2011 started with 532-node 5 PB cluster running CDH2 • EBay 0.20.203-based build (Wilma) – Hadoop 0.20.203 – latest stable Apache release • HDFS, MapReduce, Pig, Hive, Cascading, Mobius, lzo – 500+ users; 2000 jobs per day • Runs on 1000-node cluster – 24 PB – capacity, 72 GB RAM / node • Many smaller clusters • Stabilization of Hadoop platform based on 0.22 14
  • 15. eBay Inc. confidential Testing • One year of testing by different groups in Hadoop ecosystem • Extensive testing of append by HBase community • Fully automated build and certification with BigTop • Hadoop platform team at eBay – Extensive stabilization effort starting September – Most bugs found in 0.22 are also in trunk and 0.23 – All new features tested – Stress testing – Reliability testing • Works with: Pig 0.8, Hive 0.7, custom changes HBase 0.92, Oozie, open sourced Zookeeper, Cascading no changes needed 15
  • 16. eBay Inc. confidential Testing Tools, Examples • TeraSort, TestDFSIO, DistCp • GridMix, Rumen – production job traces • SLive – adjustable mix of HDFS operations, permanent load • Upgrade / rollback from 0.20.? and 0.20.203 to 0.22 • Oversubscribed cluster running out of memory • Loosing racks with running jobs and HBase – Cluster survived consecutive loss of 4 racks, shrinking to single rack with HBase still alive and MR jobs completing • Disk-fail-in-place helps identify bad drives during hardware burn-in 16
  • 17. eBay Inc. confidential Benchmarking • TestDFSIO: 10 GB files (same as 100 GB) • TeraSort: -5% (scheduler to blame) • YCSB - same • Internal eBay applications – same or better • Lots of tuning: Hadoop, Java, OS, HW – Gradual improvement of results 17 Throughput MB/sec Read Write Append Hadoop-0.22 100 84 83 0.20 breed 96 66 n/a
  • 18. eBay Inc. confidential Good to have for 0.22.1 • Restore Security • Disk Fail in place for MapReduce • Optimizations – Multiple tasks per heartbeat for CapacityScheduler – CapacityScheduler preemption • MR job and task limits • Cluster startup time • Add HA? • Merge MR-1.0 into Hadoop 0.22? 18
  • 19. eBay Inc. confidential Important • Works but not 0.20 – Good new features – Reliability is the first concern – Performance and missing functionality can be reconstructed • Community release – Not distributed / advertized by commercial distributors – Community involvement important • Don’t try to upgrade from Hadoop 0.21 to Hadoop 1.0 It’s the other way around – Go to Hadoop 0.22 instead • Forward-going release progress – Stop porting new features, start releasing them 19
  • 20. eBay Inc. confidential Thank you 20 Hadoop 0.22 Contributions Accepted