SlideShare une entreprise Scribd logo
1  sur  33
High Availability Hadoop

Steve Loughran
 @steveloughran

Sanjay Radia
 @srr

© Hortonworks Inc. 2012
Questions Hadoop users ask

• Can Hadoop keep my data safe?

• Can Hadoop keep my data available?

• What happens when things go wrong?

• What about applications that use Hadoop?

• What's next?


                                             Page 2
     © Hortonworks Inc. 2012
Hadoop HDFS: replication is the key
                                               Switch



                   ToR Switch                           ToR Switch   ToR Switch
                                      file

                                      block1
                        Name          block2             DataNode     DataNode
                        Node          block3
                                      …




                                                         DataNode     DataNode




                         2ary
                        Name                             DataNode     DataNode
                        Node




                        (Job
                      Tracker)                           DataNode     DataNode




                                                                                  Page 3
            © Hortonworks Inc. 2012
Replication handles data integrity
• CRC32 checksum per 512 bytes
• Verified across datanodes on write
• Verified on all reads
• Background verification of all blocks (~weekly)
• Corrupt blocks re-replicated
• All replicas corrupt  operations team
  intervention

2009: Yahoo! lost 19 out of 329M blocks on 20K
servers –bugs now fixed
                                                    Page 4
     © Hortonworks Inc. 2012
Rack/Switch failure
                                               Switch



                   ToR Switch                           ToR Switch   ToR Switch
                                      file

                                      block1
                        Name          block2             DataNode     DataNode
                        Node          block3
                                      …




                                                         DataNode     DataNode




                         2ary
                        Name                             DataNode     DataNode
                        Node




                        (Job
                      Tracker)                           DataNode     DataNode




                                                                                  Page 5
            © Hortonworks Inc. 2012
Reducing the impact of a Rack Failure
• Is this a real threat?
  –Google: failures are rack-correlated
  –Microsoft: ToR switches fail the most
  –Facebook: router mis-configuration  cascade failure


• Reduce the impact
  –Dual bonded Ethernet with 2x Top of Rack Switch
  –10 GbE for more replication bandwidth
  –Bigger clusters
  –Facebook: sub-cluster block placement


                                                      Page 6
      © Hortonworks Inc. 2012
NameNode failure rare but visible
                                          ToR Switch



  1. Try to reboot/restart
                                  NN IP




  2. Bring up new                                             Shared storage for
  NameNode server                           Name              filesystem image and
                                  NN IP     Node
  -with same IP                                               journal ("edit log")
  -or restart DataNodes

                                             2ary
                                            Name       (Secondary NN receives
                                            Node
                                                       streamed journal and checkpoints
                                                       filesystem image)



Yahoo!: 22 NameNode failures on 25 clusters in 18 months = .99999 availability

                                                                                     Page 7
        © Hortonworks Inc. 2012
NameNode failure costs
• Manual intervention
• Time to bring up service by hand
• Remote applications receive errors and can fail.
• MapReduce layer can fail.
• Cost of spare server
• Cost of 7x24 ops team




                                                     Page 8
     © Hortonworks Inc. 2012
What to improve

• Address costs of NameNode failure in Hadoop 1

• Add Live NN failover (HDFS 2.0)

• Eliminate shared storage (HDFS 2.x)

• Add resilience to the entire stack




                                              Page 9
     © Hortonworks Inc. 2012
Full Stack HA
add resilience to planned/unplanned outages of
layers underneath




                                                 10
© Hortonworks Inc. 2012
Hadoop Full Stack HA Architecture

                                    Worker Nodes of Hadoop Cluster
                                    task          task                 task           task
                                                                  …
                              Node w/           Node w/               Node w/        Node w/
                              DN & TT           DN & TT               DN & TT        DN & TT




 Apps                                                                                             Job
Running                                                                                        Submitting
Outside                                                                                         Clients
                                                         Failover

                                      JT into Safemode

                        NN                                 JT                   NN

                    Server                               Server               Server

                                      HA Cluster for Master Daemons

                                                                                                            11
          © Hortonworks Inc. 2012
HA in Hadoop 1 (HDP1)
Use existing HA clustering technologies to add
cold failover of key manager services:
   VMWare vSphere HA
   RedHat HA Linux




                                                 12
© Hortonworks Inc. 2012
vSphere : VMs are managed




                             Page 13
   © Hortonworks Inc. 2012
“Canary” monitoring
• IN-VM Monitor daemon for each service
  –NameNode: process, ports, http, DFS operations
  –JobTracker: process, ports, http, JT status
   + special handling of HDFS state on startup.
• Probe failures reported to vSphere
• Triggers VM kill and restart
• Differentiating hung service from GC pause hard




                                                    Page 14
     © Hortonworks Inc. 2012
Defining liveness
• kill -0 `cat <path to .pid file>`
• Port open <hostname, port>
• HTTP GET <URL, response code range>
• DFS list <filesystem, path>
• DFS out of safe mode (filesystem)
• JT cluster status <hostname, port>
• JT out of safe mode <hostname, port>




                                         Page 15
    © Hortonworks Inc. 2012
RedHat HA Linux
                   ToR Switch




         NN IP            Name
                                      DataNode     DataNode
                          Node
          IP1




          NN IP           Name
                                      DataNode     DataNode
                          Node
          IP2




        2NN IP             2ary
                          Name        DataNode     DataNode
          IP3             Node




          JT IP
                       (Job
                     Tracker)         DataNode     DataNode
         IP4




    HA Linux: heartbeats & failover



                                                              Page 16
© Hortonworks Inc. 2012
Linux HA Implementation
• Replace init.d script with “Resource Agent” script
• Hook up status probes as for vSphere
• Detection & handling of hung process hard
• Test in virtual + physical environments
• Testing with physical clusters
• ARP trouble on cross-switch failover -don't




                                                  Page 17
     © Hortonworks Inc. 2012
Yes, but does it work?

public void testHungNN() {
  assertRestartsHDFS {
    nnServer.kill(19,
      "/var/run/hadoop/hadoop-namenode.pid")
  }
}

 Groovy JUnit tests + “chaos” library

 Need a home for this -Bigtop?
                                          Page 18
     © Hortonworks Inc. 2012
And how long does it take?

60 Nodes, 60K files, 6 million blocks, 300 TB raw storage –
1-3 minutes
   – Failure detection and Failover – 0.5 to 2 minutes
   – Namenode Startup (exit safemode) – 30 sec
180 Nodes, 200K files, 18 million blocks, 900TB raw
storage – 2-4 minutes
   – Failure detection and Failover – 0.5 to 2 minutes
   – Namenode Startup (exit safemode) – 110 sec
vSphere: add 60s for OS bootup.




Cold Failover is good enough for small/medium clusters
                                                          19
       © Hortonworks Inc. 2012
IPC client resilience
Configurable retry & time to block
  ipc.client.connect.max.retries
  dfs.client.retry.policy.enabled


1. Set in core-site.xml for automatic pickup.

2. Failure-aware applications can tune/disable




                    Blocking works for most clients
                                                      Page 20
     © Hortonworks Inc. 2012
Job Tracker Resilience
• “Safe Mode”
  –rejects new submissions
  –does not fail ongoing Jobs, blacklist Task Trackers
• FS monitoring and automatic safe mode entry:
  mapreduce.jt.hdfs.monitor.enable
• Queue rebuild on restart
  mapreduce.jt.hdfs.monitor.enable
• Job Submission Client to retry:
 mapreduce.jobclient.retry.policy.enabled




                                                         Page 21
     © Hortonworks Inc. 2012
Testing full stack functionality
1. Run existing test suites against a cluster being
   killed repeatedly: MapReduce, HBase, Pig

2. Specific test jobs performing DFS operations
   from inside MapReduce jobs to stress MR layer

Results:
  NN outages did not cause failures
  two unrelated bugs in HDFS were found;


                                                 Page 22
     © Hortonworks Inc. 2012
Putting it all together: Demo




                                Page 23
    © Hortonworks Inc. 2012
HA in Hadoop HDFS 2




                          Page 24
© Hortonworks Inc. 2012
Hadoop 2.0 HA




 Zoo-
Keeper                    Standby
                          Active                    IP1
                                        Active
                         Failure-                                               DataNode
                        Controller       NN




 Zoo-
Keeper



                          Active
                          Standby      Standby
                                        Active
                         Failure-                                               DataNode
                        Controller       NN        IP2
 Zoo-
Keeper




             Block reports to both NNs; Failure Controllers & Zookeeper co-ordinate



                                                                                           Page 25
         © Hortonworks Inc. 2012
Page 26
© Hortonworks Inc. 2012
Hadoop 2.1 HA




 Zoo-
Keeper                    Active                        IP1
                                              Active
                         Failure-                                DataNode
                        Controller             NN
                                             Journal
                                              Node


 Zoo-                              Journal
Keeper                              Node

                                             Journal
                                              Node
                          Standby            Standby
                         Failure-                                DataNode
                        Controller             NN      IP2
 Zoo-
Keeper




             Quorum Journal Manager replaces shared storage
             BookKeeper Journal Manager uses Apache BookKeeper

                                                                              Page 27
         © Hortonworks Inc. 2012
When will HDFS 2 be ready?
CDH : shipping
Apache: beta-phase




                          Page 28
© Hortonworks Inc. 2012
Moving forward
• Retry policies for all remote client
  protocols/libraries in the stack.

• Monitor/restart for all services

• Zookeeper service lookup everywhere

• YARN needs HA of Resource Manager



                                         Page 29
     © Hortonworks Inc. 2012
Single Points of Failure
There's always a SPOF

Q. How do you find it?

A. It finds you


                              Page 30
    © Hortonworks Inc. 2012
Questions?




CFP for H Summit EU open!



                             Page 31
   © Hortonworks Inc. 2012
Monitoring Lifecycle
                                                               block for upstream services
                               dependencies                    e.g. JT on HDFS

                                           all dependencies live
       halt

                                                                           wait for probes live;
                                   booting                                 fail fast on live  fail

                   halt
                                       all probes live             boot timeout
                                                                   failure of live probe


                                    live                                        check all probes until
                                                                                halted or probe failure


                           halt        probe failure
                                       probe timeout



                                                              failed                       60s after heartbeats stop,
     halted                                                                                vSphere restarts VM
unhook from vSphere                                    immediate process exit


                                                                                                                 Page 32
         © Hortonworks Inc. 2012
vSphere versus RedHat HA
vSphere ideal for small clusters
  –one VM/service
  –less physical hosts than service VMs
  –Obvious choice for sites with vSphere HA

RedHat HA great for large clusters
  –turns a series of master nodes into a pool of servers
   with floating services & IP Addresses
  –downgrades “outages” to “transient events”




                                                           Page 33
     © Hortonworks Inc. 2012

Contenu connexe

Tendances

MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanJamie Pitts
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFShuguk
 
Integration of Cloud and Grid Middleware at DGRZR
Integration of Cloud and Grid Middleware at DGRZRIntegration of Cloud and Grid Middleware at DGRZR
Integration of Cloud and Grid Middleware at DGRZRStefan Freitag
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
Consolidated shared indexes in real time
Consolidated shared indexes in real timeConsolidated shared indexes in real time
Consolidated shared indexes in real timeJeff Mace
 
MySQL Cluster NoSQL Memcached API
MySQL Cluster NoSQL Memcached APIMySQL Cluster NoSQL Memcached API
MySQL Cluster NoSQL Memcached APIMat Keep
 
How to make DSL
How to make DSLHow to make DSL
How to make DSLYukio Goto
 
OrientDB the graph database
OrientDB the graph databaseOrientDB the graph database
OrientDB the graph databaseArtem Orobets
 
Replication and Replica Sets
Replication and Replica SetsReplication and Replica Sets
Replication and Replica SetsMongoDB
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution ISSGC Summer School
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQLEvan Weaver
 
Dash7 Technology 2009 2011 Seoul Briefing
Dash7 Technology 2009 2011 Seoul BriefingDash7 Technology 2009 2011 Seoul Briefing
Dash7 Technology 2009 2011 Seoul BriefingHaystack Technologies
 
Replication and Replica Sets
Replication and Replica SetsReplication and Replica Sets
Replication and Replica SetsMongoDB
 

Tendances (20)

MapReduce Using Perl and Gearman
MapReduce Using Perl and GearmanMapReduce Using Perl and Gearman
MapReduce Using Perl and Gearman
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten Replicators
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFS
 
Nov 2011 HUG: HParser
Nov 2011 HUG: HParserNov 2011 HUG: HParser
Nov 2011 HUG: HParser
 
Integration of Cloud and Grid Middleware at DGRZR
Integration of Cloud and Grid Middleware at DGRZRIntegration of Cloud and Grid Middleware at DGRZR
Integration of Cloud and Grid Middleware at DGRZR
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
 
Consolidated shared indexes in real time
Consolidated shared indexes in real timeConsolidated shared indexes in real time
Consolidated shared indexes in real time
 
MySQL Cluster NoSQL Memcached API
MySQL Cluster NoSQL Memcached APIMySQL Cluster NoSQL Memcached API
MySQL Cluster NoSQL Memcached API
 
How to make DSL
How to make DSLHow to make DSL
How to make DSL
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
OrientDB the graph database
OrientDB the graph databaseOrientDB the graph database
OrientDB the graph database
 
Replication and Replica Sets
Replication and Replica SetsReplication and Replica Sets
Replication and Replica Sets
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
Dash7 Technology 2009 2011 Seoul Briefing
Dash7 Technology 2009 2011 Seoul BriefingDash7 Technology 2009 2011 Seoul Briefing
Dash7 Technology 2009 2011 Seoul Briefing
 
Ads int faq
Ads int faqAds int faq
Ads int faq
 
Sector CloudSlam 09
Sector CloudSlam 09Sector CloudSlam 09
Sector CloudSlam 09
 
Replication and Replica Sets
Replication and Replica SetsReplication and Replica Sets
Replication and Replica Sets
 

En vedette

When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go BadSteve Loughran
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Steve Loughran
 
The Wondrous Curse of Interoperability
The Wondrous Curse of InteroperabilityThe Wondrous Curse of Interoperability
The Wondrous Curse of InteroperabilitySteve Loughran
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrowSteve Loughran
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionSteve Loughran
 
New Roles In The Cloud
New Roles In The CloudNew Roles In The Cloud
New Roles In The CloudSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloudSteve Loughran
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Application Architecture For The Cloud
Application Architecture For The CloudApplication Architecture For The Cloud
Application Architecture For The CloudSteve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraSteve Loughran
 

En vedette (19)

Deploying On EC2
Deploying On EC2Deploying On EC2
Deploying On EC2
 
Hadoop & Hep
Hadoop & HepHadoop & Hep
Hadoop & Hep
 
Beyond Unit Testing
Beyond Unit TestingBeyond Unit Testing
Beyond Unit Testing
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
 
Benchmarking
BenchmarkingBenchmarking
Benchmarking
 
Testing
TestingTesting
Testing
 
The Wondrous Curse of Interoperability
The Wondrous Curse of InteroperabilityThe Wondrous Curse of Interoperability
The Wondrous Curse of Interoperability
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
Hadoop Futures
Hadoop FuturesHadoop Futures
Hadoop Futures
 
New Roles In The Cloud
New Roles In The CloudNew Roles In The Cloud
New Roles In The Cloud
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloud
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Application Architecture For The Cloud
Application Architecture For The CloudApplication Architecture For The Cloud
Application Architecture For The Cloud
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop gets Groovy
Hadoop gets GroovyHadoop gets Groovy
Hadoop gets Groovy
 

Similaire à HA Hadoop -ApacheCon talk

Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and FutureDataWorks Summit
 
Cloud Foundry Open Tour - London
Cloud Foundry Open Tour - LondonCloud Foundry Open Tour - London
Cloud Foundry Open Tour - Londonmarklucovsky
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Ebs architecture con9036_pdf_9036_0001
Ebs architecture con9036_pdf_9036_0001Ebs architecture con9036_pdf_9036_0001
Ebs architecture con9036_pdf_9036_0001jucaab
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012StampedeCon
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
High Availability != High-cost
High Availability != High-costHigh Availability != High-cost
High Availability != High-costnormanmaurer
 
Flash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORMFlash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORMRIA RUI Society
 

Similaire à HA Hadoop -ApacheCon talk (20)

Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop Inside
Hadoop InsideHadoop Inside
Hadoop Inside
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and Future
 
Cloud Foundry Open Tour - London
Cloud Foundry Open Tour - LondonCloud Foundry Open Tour - London
Cloud Foundry Open Tour - London
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Ebs architecture con9036_pdf_9036_0001
Ebs architecture con9036_pdf_9036_0001Ebs architecture con9036_pdf_9036_0001
Ebs architecture con9036_pdf_9036_0001
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
An Hour of DB2 Tips
An Hour of DB2 TipsAn Hour of DB2 Tips
An Hour of DB2 Tips
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
High Availability != High-cost
High Availability != High-costHigh Availability != High-cost
High Availability != High-cost
 
Flash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORMFlash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORM
 

Plus de Steve Loughran

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionSteve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!Steve Loughran
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming DeployedSteve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARNSteve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflowSteve Loughran
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-statusSteve Loughran
 

Plus de Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
 

Dernier

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

HA Hadoop -ApacheCon talk

  • 1. High Availability Hadoop Steve Loughran @steveloughran Sanjay Radia @srr © Hortonworks Inc. 2012
  • 2. Questions Hadoop users ask • Can Hadoop keep my data safe? • Can Hadoop keep my data available? • What happens when things go wrong? • What about applications that use Hadoop? • What's next? Page 2 © Hortonworks Inc. 2012
  • 3. Hadoop HDFS: replication is the key Switch ToR Switch ToR Switch ToR Switch file block1 Name block2 DataNode DataNode Node block3 … DataNode DataNode 2ary Name DataNode DataNode Node (Job Tracker) DataNode DataNode Page 3 © Hortonworks Inc. 2012
  • 4. Replication handles data integrity • CRC32 checksum per 512 bytes • Verified across datanodes on write • Verified on all reads • Background verification of all blocks (~weekly) • Corrupt blocks re-replicated • All replicas corrupt  operations team intervention 2009: Yahoo! lost 19 out of 329M blocks on 20K servers –bugs now fixed Page 4 © Hortonworks Inc. 2012
  • 5. Rack/Switch failure Switch ToR Switch ToR Switch ToR Switch file block1 Name block2 DataNode DataNode Node block3 … DataNode DataNode 2ary Name DataNode DataNode Node (Job Tracker) DataNode DataNode Page 5 © Hortonworks Inc. 2012
  • 6. Reducing the impact of a Rack Failure • Is this a real threat? –Google: failures are rack-correlated –Microsoft: ToR switches fail the most –Facebook: router mis-configuration  cascade failure • Reduce the impact –Dual bonded Ethernet with 2x Top of Rack Switch –10 GbE for more replication bandwidth –Bigger clusters –Facebook: sub-cluster block placement Page 6 © Hortonworks Inc. 2012
  • 7. NameNode failure rare but visible ToR Switch 1. Try to reboot/restart NN IP 2. Bring up new Shared storage for NameNode server Name filesystem image and NN IP Node -with same IP journal ("edit log") -or restart DataNodes 2ary Name (Secondary NN receives Node streamed journal and checkpoints filesystem image) Yahoo!: 22 NameNode failures on 25 clusters in 18 months = .99999 availability Page 7 © Hortonworks Inc. 2012
  • 8. NameNode failure costs • Manual intervention • Time to bring up service by hand • Remote applications receive errors and can fail. • MapReduce layer can fail. • Cost of spare server • Cost of 7x24 ops team Page 8 © Hortonworks Inc. 2012
  • 9. What to improve • Address costs of NameNode failure in Hadoop 1 • Add Live NN failover (HDFS 2.0) • Eliminate shared storage (HDFS 2.x) • Add resilience to the entire stack Page 9 © Hortonworks Inc. 2012
  • 10. Full Stack HA add resilience to planned/unplanned outages of layers underneath 10 © Hortonworks Inc. 2012
  • 11. Hadoop Full Stack HA Architecture Worker Nodes of Hadoop Cluster task task task task … Node w/ Node w/ Node w/ Node w/ DN & TT DN & TT DN & TT DN & TT Apps Job Running Submitting Outside Clients Failover JT into Safemode NN JT NN Server Server Server HA Cluster for Master Daemons 11 © Hortonworks Inc. 2012
  • 12. HA in Hadoop 1 (HDP1) Use existing HA clustering technologies to add cold failover of key manager services: VMWare vSphere HA RedHat HA Linux 12 © Hortonworks Inc. 2012
  • 13. vSphere : VMs are managed Page 13 © Hortonworks Inc. 2012
  • 14. “Canary” monitoring • IN-VM Monitor daemon for each service –NameNode: process, ports, http, DFS operations –JobTracker: process, ports, http, JT status + special handling of HDFS state on startup. • Probe failures reported to vSphere • Triggers VM kill and restart • Differentiating hung service from GC pause hard Page 14 © Hortonworks Inc. 2012
  • 15. Defining liveness • kill -0 `cat <path to .pid file>` • Port open <hostname, port> • HTTP GET <URL, response code range> • DFS list <filesystem, path> • DFS out of safe mode (filesystem) • JT cluster status <hostname, port> • JT out of safe mode <hostname, port> Page 15 © Hortonworks Inc. 2012
  • 16. RedHat HA Linux ToR Switch NN IP Name DataNode DataNode Node IP1 NN IP Name DataNode DataNode Node IP2 2NN IP 2ary Name DataNode DataNode IP3 Node JT IP (Job Tracker) DataNode DataNode IP4 HA Linux: heartbeats & failover Page 16 © Hortonworks Inc. 2012
  • 17. Linux HA Implementation • Replace init.d script with “Resource Agent” script • Hook up status probes as for vSphere • Detection & handling of hung process hard • Test in virtual + physical environments • Testing with physical clusters • ARP trouble on cross-switch failover -don't Page 17 © Hortonworks Inc. 2012
  • 18. Yes, but does it work? public void testHungNN() { assertRestartsHDFS { nnServer.kill(19, "/var/run/hadoop/hadoop-namenode.pid") } } Groovy JUnit tests + “chaos” library Need a home for this -Bigtop? Page 18 © Hortonworks Inc. 2012
  • 19. And how long does it take? 60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 1-3 minutes – Failure detection and Failover – 0.5 to 2 minutes – Namenode Startup (exit safemode) – 30 sec 180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 2-4 minutes – Failure detection and Failover – 0.5 to 2 minutes – Namenode Startup (exit safemode) – 110 sec vSphere: add 60s for OS bootup. Cold Failover is good enough for small/medium clusters 19 © Hortonworks Inc. 2012
  • 20. IPC client resilience Configurable retry & time to block ipc.client.connect.max.retries dfs.client.retry.policy.enabled 1. Set in core-site.xml for automatic pickup. 2. Failure-aware applications can tune/disable Blocking works for most clients Page 20 © Hortonworks Inc. 2012
  • 21. Job Tracker Resilience • “Safe Mode” –rejects new submissions –does not fail ongoing Jobs, blacklist Task Trackers • FS monitoring and automatic safe mode entry: mapreduce.jt.hdfs.monitor.enable • Queue rebuild on restart mapreduce.jt.hdfs.monitor.enable • Job Submission Client to retry: mapreduce.jobclient.retry.policy.enabled Page 21 © Hortonworks Inc. 2012
  • 22. Testing full stack functionality 1. Run existing test suites against a cluster being killed repeatedly: MapReduce, HBase, Pig 2. Specific test jobs performing DFS operations from inside MapReduce jobs to stress MR layer Results: NN outages did not cause failures two unrelated bugs in HDFS were found; Page 22 © Hortonworks Inc. 2012
  • 23. Putting it all together: Demo Page 23 © Hortonworks Inc. 2012
  • 24. HA in Hadoop HDFS 2 Page 24 © Hortonworks Inc. 2012
  • 25. Hadoop 2.0 HA Zoo- Keeper Standby Active IP1 Active Failure- DataNode Controller NN Zoo- Keeper Active Standby Standby Active Failure- DataNode Controller NN IP2 Zoo- Keeper Block reports to both NNs; Failure Controllers & Zookeeper co-ordinate Page 25 © Hortonworks Inc. 2012
  • 27. Hadoop 2.1 HA Zoo- Keeper Active IP1 Active Failure- DataNode Controller NN Journal Node Zoo- Journal Keeper Node Journal Node Standby Standby Failure- DataNode Controller NN IP2 Zoo- Keeper Quorum Journal Manager replaces shared storage BookKeeper Journal Manager uses Apache BookKeeper Page 27 © Hortonworks Inc. 2012
  • 28. When will HDFS 2 be ready? CDH : shipping Apache: beta-phase Page 28 © Hortonworks Inc. 2012
  • 29. Moving forward • Retry policies for all remote client protocols/libraries in the stack. • Monitor/restart for all services • Zookeeper service lookup everywhere • YARN needs HA of Resource Manager Page 29 © Hortonworks Inc. 2012
  • 30. Single Points of Failure There's always a SPOF Q. How do you find it? A. It finds you Page 30 © Hortonworks Inc. 2012
  • 31. Questions? CFP for H Summit EU open! Page 31 © Hortonworks Inc. 2012
  • 32. Monitoring Lifecycle block for upstream services dependencies e.g. JT on HDFS all dependencies live halt wait for probes live; booting fail fast on live  fail halt all probes live boot timeout failure of live probe live check all probes until halted or probe failure halt probe failure probe timeout failed 60s after heartbeats stop, halted vSphere restarts VM unhook from vSphere immediate process exit Page 32 © Hortonworks Inc. 2012
  • 33. vSphere versus RedHat HA vSphere ideal for small clusters –one VM/service –less physical hosts than service VMs –Obvious choice for sites with vSphere HA RedHat HA great for large clusters –turns a series of master nodes into a pool of servers with floating services & IP Addresses –downgrades “outages” to “transient events” Page 33 © Hortonworks Inc. 2012

Notes de l'éditeur

  1. Once you adopt Hadoop, it can rapidly become the biggest central storage point of data in an organisation. At which point you start caring about how well it looks after your data. this talk aims to answer these questions
  2. HDFS is built on the concept that in a large cluster, disk failure is inevitable. The system is designed to change the impact of this from the beeping of pagers to a background hum.Akey part of the HDFS design: copying the blocks across machines means that the loss of a disk, server or even entire rack keeps the data available.
  3. There&apos;s lots of checksumming going on of the data to pick up corruption -CRCs created at write time (and even verified end-to-end in a cross-machine write), scanned on read time.
  4. Rack failures can generate a lot of replication traffic, as every block that was stored in the rack needs to be replicated at least once. The replication still has to follow the constraints of no more than one block copy per server. Much of this traffic is intra-rack, but every block which already has 2x replicas on a single rack will be replicated to another rack if possible.This is what scares ops team. Important: there is no specific notion of &quot;mass failure&quot; or &quot;network partition&quot;. Here HDFS only sees that four machines have gone down.
  5. When the NameNode fails, the cluster is offline.client applications -including the MapReduce layer and HBase see this, and fail.
  6. When the NN fails, it&apos;s obvious to remote applications, which fail, and to the ops team, which have to fix it. It also adds cost to the system. Does it happen often? No, almost never, but having the ops team and hardware ready for it is the cost
  7. VMWare monitioring is &quot;canary based&quot; -when the bird stops singing, the service is in trouble.To implement this we have a new service running in the VM that monitors service health, and stops singing when it decides the service is down.If the service, the OS or the monitor fails, VSphere reacts by restarting the VM. If the host server fails, all VMs on it get restarted elsewhereOne troublespot: hung processes. Socket blocks. Something needs to detect that and make a decision as to when this is a service hang vs a long GC. There is no way pre-Java7 to know this, so it&apos;s just down to timeouts.
  8. One question is &quot;how long does this take&quot;For clusters of under a few hundred machines, with not that much of an edit log to replay, failure detection dominates the time. vSphere is the slowest there as it has to notice that the monitor has stopped sending heartbeats (~60s), then boot the OS (~60s) before bringing up the NN. This is why it&apos;s better for smaller clusters.LinuxHA fails over faster; if set to poll every 10-15 s then failover begins within 20s of the outage.
  9. The risk here isn&apos;t just the failover, it&apos;s all the other changes that went into HDFS 2; the bug reporting numbers for HDFS 2 are way down on the previous releases, which means that it was either much better, or hasn&apos;t been tested enough.the big issue here is that the data in your HDFS cluster may be one of the most valuable assets of an organization, you can&apos;t afford to lose it. It&apos;s why it&apos;s good to be cautious with any filesystem, be it linux or a layer above.
  10. This is the state machine used to monitor a service. All the time the state machine is not terminated heartbeats are being sent to vSphere every 10s.Underneath that we have a model of a service blocking for (remote) services before even starting, booting -which waits for all probes to go live in a bounded time, and fails fast if any live probe revertslive - everything must respond, timeouts of a probe are picked up too.halted -a graceful unhook from vSphere, so that heartbeat stopping isn&apos;t picked up