SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Next-Generation Hadoop
Operations
What’s ahead in the next 12 months
for Hadoop cluster administration

Andrew Ryan
AppOps Engineer
Feb 16, 2011
Agenda
    Hadoop operations @Facebook: an
1   overview

2   Existing operational best practices
    The challenges ahead: new directions in
3   Hadoop

4   Emerging operational best practices

5   Conclusions and next steps
Hadoop Operations @Facebook
▪    Lean staffing, fast moving, highly leveraged
▪    Basic oncall structure:
     ▪    Level 1: 24x7 sysadmin team (“SRO”) for whole site
     ▪    Level 2: 2 people (“AppOps”) trading 1-week oncall shifts
     ▪    Level 3: 4 different Hadoop dev subteams with 1-week rotations
     ▪    Plus oncalls from other adjunct teams: SiteOps for machine
          repairs, NetEng for network, etc.
▪    Every engineer @FB is issued a cell phone and expected to be
     available in emergencies and/or if they make a change to a
     production system or code.
Operational gaps in Hadoop
Our best practices address all these gaps
▪    Hardware selection, preparation, and configuration
▪    Installation/packaging
▪    Upgrades
▪    Autostart/start/stop/restart/status as correct UNIX user
▪    Node level application and system monitoring
▪    Cluster-level and job-level monitoring
▪    Integrated log viewing/tailing/grepping
▪    Fast, reliable, centrally logged cluster-level shell ( != slaves.sh)
Existing operational best practices (1)
Sysadmin
▪    All the stuff you would do for a large distributed system but especially…
▪    Failed/failing hardware is your biggest enemy. FIND IT AND FIX IT, OR
     GET IT OUT OF YOUR CLUSTERS! (the ‘excludes’ file is your friend)
▪    Regularly run every possible diagnostic to safely scan for bad hardware
▪    Identify and remove “repeat offender” hardware

▪    Fail fast, recover quickly, small things add up in big clusters:
     ▪    RHEL/Centos kickstart steals your disk space (1.5%-3%+ per disk)
     ▪    No swap + vm.panic_on_oom=1 + kernel.kdb=0 for “fast auto reboot
          on OOM”
     ▪    Never fsck ext3 data drives unless Hadoop says you have to
Sysadmin example
Identifying your “America’s Most Wanted” pays off
Existing operational best practices (2)
Tooling
▪    Maintain a central registry of clusters, nodes, and each node’s role in
     the cluster, integrated with your service/asset management platform
▪    Build centrally maintained tools to:
     ▪    Start/stop/restart/autostart daemons on hosts (hadoopctl)
     ▪    View/grep/tail daemon logs on hosts (hadooplog)
     ▪    Start/stop, or execute commands on entire clusters (clusterctl)
     ▪    Manage excludes files based on repair status (excluderator)
     ▪    Deploy any arbitrary version of software to clusters
     ▪    Monitor daemon health and collect statistics
Tooling example
Deploy & upgrade clusters
# Deploy an HDFS/MapReduce cluster pair: 2 to 4000 nodes via
 torrent
$ deploy-hadoop-release.py --clusterdeploy=DFS1,SILVER
 branch@rev
$ clusterctl restart DFS1 SILVER


# “Refresh deploy” on 10 clusters, and then restart just the datanodes
$ deploy-hadoop-release.py –poddeploy=DFSSCRIBE-ALL redeploy
$ clusterctl restart DFSSCRIBE-ALL:datanode
Existing operational best practices (3)
Process
▪    Document everything
▪    Segregate different classes of users on different clusters, with
     appropriate service levels and capacities
▪    Graph user-visible metrics like HDFS and job latency
▪    Build “least destructive” procedures for getting hardware back in
     service
▪    Developers and Ops should use the same procedures and tools
Process example
Graphing our users’ experience on the cluster
A Hadoop cluster admin’s worst
     enemies




▪    The “X-Files”: machines which fail in strange ways, undetected by your
     monitoring systems
     ▪    Get your basics under control, then you’ll have more time for these
▪    “America’s Most Wanted”: machines which keep failing, again and
     again
     ▪    Our data: 1% of our machines accounted for 30% of our repair
          tickets
New directions for Hadoop
▪    Hbase (Facebook Messages, real-time click logs)
▪    Zero-downtime upgrades (AvatarNode, rolling upgrades)
▪    “Megadatanodes” and Hadoop RAID
▪    HDFS as an “appliance”


     See also:
     http://www.facebook.com/notes/facebook-engineering/looking-at-
     the-code-behind-our-three-uses-of-apache-hadoop/468211193919
Hbase and Hadoop
▪    Very new technology with emerging operational characteristics
▪    Applications using Hbase are also new, with their own usage quirks
▪    Aiming for large number of small clusters (~100 nodes)
▪    Slow/dead nodes are a big problem: these are real-time, user facing
▪    Region failover slow ; no speculative execution
▪    Full-downtime restarts must be avoided


View the Messages tech talk here: http://fb.me/95OQ8YaD2rkb3r
Zero-downtime upgrades
▪    HDFS upgrades are 1-2 hours of downtime
▪    Jobtracker upgrades are quick (5 min), but kill all currently running
     jobs
▪    Rolling upgrades work today, but are too slow for large clusters
▪    Must be able to be strict and lenient about multiple versions of client
     and server software installed and running in the cluster
“Megadatanodes” and Hadoop RAID
▪    Storage requirements continue to increase rapidly, as does CPU/
     RAM
▪    9X increase in datanode density from 2009-2011 (4TB->36TB)
▪    Hadoop RAID with XOR and Reed-Solomon bring tremendous cost
     savings along with management challenges:
     ▪    Losing one node is a big deal (200k-600k blocks/node?). A rack?
          Ouch!
     ▪    Tools and admin capabilities are not ready yet
▪    Will HDFS administration in 2012 be “like administering a cluster of
     4000 Netapps”?
▪    Host/rack level network will be a bottleneck
HDFS as an “appliance”
▪    Use HDFS cluster instead of commercial storage appliance
     ▪    Requires commercial-grade support & features
     ▪    Must be price-competitive




                             vs.
Emerging operational best practices
▪    More careful selection of hardware and network designs to
     accommodate new uses of Hadoop
▪    Find and deal with slowness at a node/rack/segment level
▪    Auto-healing at granularity better than “reboot” or “restart”
▪    Node-level version detection and installation
▪    Rolling, zero-downtime upgrades (AvatarNode + new JobTracker)


…and do all this without making Hadoop any harder to set up and run
Next steps
▪    Are we trying to do too much?
     ▪    Facebook needs an enormous data warehouse
     ▪    Facebook needs a large distributed filesystem
     ▪    Facebook needs a database alternative to MySQL
     ▪    Facebook always looking to spend less money
     ▪    …and all that other stuff too
▪    Failure is not an option
▪    Never a dull moment!
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Contenu connexe

Tendances

Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 

Tendances (20)

Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 

En vedette

Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Hadoop User Group
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 

En vedette (20)

Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 

Similaire à Next Generation Hadoop Operations

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksEdureka!
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksEdureka!
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
NameNode Analytics - Querying HDFS Namespace in Real Time
NameNode Analytics - Querying HDFS Namespace in Real TimeNameNode Analytics - Querying HDFS Namespace in Real Time
NameNode Analytics - Querying HDFS Namespace in Real TimePlamen Jeliazkov
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Mandakini Kumari
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made EasyAll Things Open
 

Similaire à Next Generation Hadoop Operations (20)

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
NameNode Analytics - Querying HDFS Namespace in Real Time
NameNode Analytics - Querying HDFS Namespace in Real TimeNameNode Analytics - Querying HDFS Namespace in Real Time
NameNode Analytics - Querying HDFS Namespace in Real Time
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Plus de Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryptionOwen O'Malley
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 IcebergOwen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column EncryptionOwen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 

Plus de Owen O'Malley (19)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
ORC Files
ORC FilesORC Files
ORC Files
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 

Dernier

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Dernier (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Next Generation Hadoop Operations

  • 1.
  • 2. Next-Generation Hadoop Operations What’s ahead in the next 12 months for Hadoop cluster administration Andrew Ryan AppOps Engineer Feb 16, 2011
  • 3. Agenda Hadoop operations @Facebook: an 1 overview 2 Existing operational best practices The challenges ahead: new directions in 3 Hadoop 4 Emerging operational best practices 5 Conclusions and next steps
  • 4.
  • 5. Hadoop Operations @Facebook ▪  Lean staffing, fast moving, highly leveraged ▪  Basic oncall structure: ▪  Level 1: 24x7 sysadmin team (“SRO”) for whole site ▪  Level 2: 2 people (“AppOps”) trading 1-week oncall shifts ▪  Level 3: 4 different Hadoop dev subteams with 1-week rotations ▪  Plus oncalls from other adjunct teams: SiteOps for machine repairs, NetEng for network, etc. ▪  Every engineer @FB is issued a cell phone and expected to be available in emergencies and/or if they make a change to a production system or code.
  • 6. Operational gaps in Hadoop Our best practices address all these gaps ▪  Hardware selection, preparation, and configuration ▪  Installation/packaging ▪  Upgrades ▪  Autostart/start/stop/restart/status as correct UNIX user ▪  Node level application and system monitoring ▪  Cluster-level and job-level monitoring ▪  Integrated log viewing/tailing/grepping ▪  Fast, reliable, centrally logged cluster-level shell ( != slaves.sh)
  • 7. Existing operational best practices (1) Sysadmin ▪  All the stuff you would do for a large distributed system but especially… ▪  Failed/failing hardware is your biggest enemy. FIND IT AND FIX IT, OR GET IT OUT OF YOUR CLUSTERS! (the ‘excludes’ file is your friend) ▪  Regularly run every possible diagnostic to safely scan for bad hardware ▪  Identify and remove “repeat offender” hardware ▪  Fail fast, recover quickly, small things add up in big clusters: ▪  RHEL/Centos kickstart steals your disk space (1.5%-3%+ per disk) ▪  No swap + vm.panic_on_oom=1 + kernel.kdb=0 for “fast auto reboot on OOM” ▪  Never fsck ext3 data drives unless Hadoop says you have to
  • 8. Sysadmin example Identifying your “America’s Most Wanted” pays off
  • 9. Existing operational best practices (2) Tooling ▪  Maintain a central registry of clusters, nodes, and each node’s role in the cluster, integrated with your service/asset management platform ▪  Build centrally maintained tools to: ▪  Start/stop/restart/autostart daemons on hosts (hadoopctl) ▪  View/grep/tail daemon logs on hosts (hadooplog) ▪  Start/stop, or execute commands on entire clusters (clusterctl) ▪  Manage excludes files based on repair status (excluderator) ▪  Deploy any arbitrary version of software to clusters ▪  Monitor daemon health and collect statistics
  • 10. Tooling example Deploy & upgrade clusters # Deploy an HDFS/MapReduce cluster pair: 2 to 4000 nodes via torrent $ deploy-hadoop-release.py --clusterdeploy=DFS1,SILVER branch@rev $ clusterctl restart DFS1 SILVER # “Refresh deploy” on 10 clusters, and then restart just the datanodes $ deploy-hadoop-release.py –poddeploy=DFSSCRIBE-ALL redeploy $ clusterctl restart DFSSCRIBE-ALL:datanode
  • 11. Existing operational best practices (3) Process ▪  Document everything ▪  Segregate different classes of users on different clusters, with appropriate service levels and capacities ▪  Graph user-visible metrics like HDFS and job latency ▪  Build “least destructive” procedures for getting hardware back in service ▪  Developers and Ops should use the same procedures and tools
  • 12. Process example Graphing our users’ experience on the cluster
  • 13. A Hadoop cluster admin’s worst enemies ▪  The “X-Files”: machines which fail in strange ways, undetected by your monitoring systems ▪  Get your basics under control, then you’ll have more time for these ▪  “America’s Most Wanted”: machines which keep failing, again and again ▪  Our data: 1% of our machines accounted for 30% of our repair tickets
  • 14. New directions for Hadoop ▪  Hbase (Facebook Messages, real-time click logs) ▪  Zero-downtime upgrades (AvatarNode, rolling upgrades) ▪  “Megadatanodes” and Hadoop RAID ▪  HDFS as an “appliance” See also: http://www.facebook.com/notes/facebook-engineering/looking-at- the-code-behind-our-three-uses-of-apache-hadoop/468211193919
  • 15. Hbase and Hadoop ▪  Very new technology with emerging operational characteristics ▪  Applications using Hbase are also new, with their own usage quirks ▪  Aiming for large number of small clusters (~100 nodes) ▪  Slow/dead nodes are a big problem: these are real-time, user facing ▪  Region failover slow ; no speculative execution ▪  Full-downtime restarts must be avoided View the Messages tech talk here: http://fb.me/95OQ8YaD2rkb3r
  • 16. Zero-downtime upgrades ▪  HDFS upgrades are 1-2 hours of downtime ▪  Jobtracker upgrades are quick (5 min), but kill all currently running jobs ▪  Rolling upgrades work today, but are too slow for large clusters ▪  Must be able to be strict and lenient about multiple versions of client and server software installed and running in the cluster
  • 17. “Megadatanodes” and Hadoop RAID ▪  Storage requirements continue to increase rapidly, as does CPU/ RAM ▪  9X increase in datanode density from 2009-2011 (4TB->36TB) ▪  Hadoop RAID with XOR and Reed-Solomon bring tremendous cost savings along with management challenges: ▪  Losing one node is a big deal (200k-600k blocks/node?). A rack? Ouch! ▪  Tools and admin capabilities are not ready yet ▪  Will HDFS administration in 2012 be “like administering a cluster of 4000 Netapps”? ▪  Host/rack level network will be a bottleneck
  • 18. HDFS as an “appliance” ▪  Use HDFS cluster instead of commercial storage appliance ▪  Requires commercial-grade support & features ▪  Must be price-competitive vs.
  • 19. Emerging operational best practices ▪  More careful selection of hardware and network designs to accommodate new uses of Hadoop ▪  Find and deal with slowness at a node/rack/segment level ▪  Auto-healing at granularity better than “reboot” or “restart” ▪  Node-level version detection and installation ▪  Rolling, zero-downtime upgrades (AvatarNode + new JobTracker) …and do all this without making Hadoop any harder to set up and run
  • 20. Next steps ▪  Are we trying to do too much? ▪  Facebook needs an enormous data warehouse ▪  Facebook needs a large distributed filesystem ▪  Facebook needs a database alternative to MySQL ▪  Facebook always looking to spend less money ▪  …and all that other stuff too ▪  Failure is not an option ▪  Never a dull moment!
  • 21. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0