SlideShare une entreprise Scribd logo
1  sur  34
Hadoop Backup and Disaster
        Recovery
       Jai Ranganathan
         Cloudera Inc
What makes Hadoop different?

            Not much

            EXCEPT
    • Tera- to Peta-bytes of data
      • Commodity hardware
        • Highly distributed
     • Many different services
What needs protection?

  Data Sets:       Applications:       Configuration:
                        System            Knobs and
                   applications (JT,    configurations
Data & Meta-data
                     NN, Region        necessary to run
 about your data
                   Servers, etc) and     applications
     (Hive)
                   User applications
We will focus on….


              Data Sets
but not because the others aren’t important..

  Existing systems & processes can help
  manage Apps & Configuration (to some
                 extent)
Classes of Problems to Plan For
Hardware Failures
 • Data corruption on disk
 • Disk/Node crash
 • Rack failure


User/Application Error
 • Accidental or malicious data deletion
 • Corrupted data writes


Site Failures
 • Permanent site loss – fire, ice, etc
 • Temporary site loss – Network, Power, etc (more common)
Business goals must drive solutions
        RPOs and RTOs are awesome…
But plan for what you care about – how much is
               this data worth?
Failure mode          Risk           Cost

Disk failure          High           Low

Node failure          High           Low

Rack failure         Medium         Medium

Accidental deletes   Medium         Medium

Site loss             Low            High
Basics of HDFS*




          * From Hadoop documentation
Hardware failures – Data Corruption
  Data corruption on disk


 • Checksums metadata for each block stored
   with file
 • If checksums do not match, name node
   discards block and replaces with fresh copy
 • Name node can write metadata to multiple
   copies for safety – write to different file
   systems and make backups
Hardware Failures - Crashes
Disk/Node crash


• Synchronous replicationon disk day- first
     Data corruption saves the
  two replicas always on different hosts
• Hardware failure detected by heartbeat loss
• Name node HA for meta-data
• HDFS automatically re-replicates blocks
  without enough replicas through periodic
  process
Hardware Failures – Rack failure
 Rack failure


 • Configure corruption on diskprovide rack
       Data at least 3 replicas and
   information (
   topology.node.switch.mapping.impl or
   topology.script.file.name)
 • 3rd replica always in a different rack
 • 3rd is important – allows for time window
   between failure and detection to safely exist
Don’t forget metadata


   • Your data is defined by Hive metadata
• But this is easy! SQL backups as per usual for
                     Hive safety
Cool.. Basic hardware is under control
                   Not quite
      • Employ Monitoring to track node health
     • Examine data node block scanner reports
    (http://datanode:50075/blockScannerReport)
             • Hadoop fsck is your friend


Of course, your friendly neighborhood Hadoop vendor
  has tools – Cloudera Manager health checks FTW!
Phew.. Past the easy stuff
              One more small detail…

   Upgrades for HDFS should be treated with care
         On-disk layout changes are risky!

        • Save name node meta-data offsite
• Test upgrade on smaller cluster before pushing out
• Data layout upgrades support roll-back but be safe
• Making backups of all or important data to remote
               location before upgrade!
Application or user errors

                     Permissions scope
                  Users only have access to data they
                         must have access to
  Apply the
principle of
   least            Quota management
 privilege            Name quota: Limits number of
                             files rooted at dir
                      Space quota: Limit bytes of files
                                rooted at dir
Protecting against accidental deletes

                         Trash server
             When enabled, files are deleted into
                            trash
             Enable using fs.trash.interval to set
                        trash interval

                    Keep in mind:
• Trash deletion only works through fs shell –
  programmatic deletes will not employ Trash
• .Trash is a per user directory for restores
Accidental deletes – don’t forget
           metadata



  • Again, regular SQL backups is key
HDFS Snapshots
             What are snapshots?
Snapshots represent state of the system at a point
                    in time
Often implemented using copy-on-write semantics



• In HDFS, append-only fs means only deletes have
                  to be managed
   • Many of the problems with COW are gone!
HDFS Snapshots – coming to a distro
            near you

 Community is hard at work on HDFS snapshots
Expect availability in major distros within the year


    Some implementation details – NameNode
                  snapshotting:
         • Very fast snapping capability
            • Consistency guarantees
      • Restores need to perform data copy
• .snapshot directories for access to individual files
What can HDFS Snapshots do for you?


  • Handles user/application data corruption
         • Handles accidental deletes
   • Can also be used for Test/Dev purposes!
HBase snapshots

            Oh hello, HBase!
Very similar construct to HDFS snapshots
               COW model

               • Fast snaps
        • Consistent snapshots
      • Restores still need a copy
    (hey, at least we are consistent)
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Management of snapshots
Space considerations:

• % of cluster for snapshots
• Number of snapshots
• Alerting on space issues

Scheduling backups:

• Time based
• Workflow based
Great… Are we done?

        Don’t forget Roger Duronio!

Principle of least privilege still matters…
Disaster Recovery


  Datacenter A              Datacenter B




HDFS   Hive   HBase
Teeing vs Copying
         Teeing                     Copying

                               Data is copied from
 Send data during ingest
                            production to replica as a
 phase to production and
                               separate step after
     replica clusters
                                   processing
• Time delay is minimal
                           • Consistent data
  between clusters
                             between both sites
• Bandwidth required
                           • Process once only
  could be larger
                           • Time delay for RPO
• Requires re-processing
                             objectives to do
  data on both sides
                             incremental copy
• No consistency between
                           • More bandwidth
  sites
                             needed
Recommendations?


       Scenario dependent
                But
Generally prefer copying over teeing
How to replicate – per service


HDFS                   HBase                                 Hive
       Teeing:
                               Teeing:
       Flume and                                                            Teeing:
                               Application
       Sqoop support                                                        NA
                               level teeing
       teeing


       Copying:
                               Copying:                                     Copying:
       DistCP for
       copying                 HBase                                        Database
                               replication                                  import/export*




                                              * Database import/export isn’t the full story
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Key considerations for large data
                   movement
•   Is your data compressed?
     – None of the systems support compression on the wire natively
     – WAN accelerators can help but cost $$

•   Do you know your bandwidth needs?
     – Initial data load
     – Daily ingest rate – Maintain historical information

•   Do you know your network security setup?
     – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity

•   Have you configured security appropriately?
     – Kerberos support for cross-realm trust is challenging

•   What about cross-version copying?
     – Can’t always have both clusters be same version – but this is not trivial
Management of replications
Scheduling replication jobs

• Time based
• Workflow based – Kicked off from Oozie script?

Prioritization

• Keep replications in a separate scheduler group and
  dedicate capacity to replication jobs
• Don’t schedule more map tasks than can handle
  available network bandwidth between sites
Secondary configuration and usage
Hardware considerations
• Denser disk configurations acceptable on remote site
  depending on workload goals – 4 TB disks vs 2 TB disks, etc
• Fewer nodes are typical – consider replicating only critical
  data. Be careful playing with replication factors

Usage considerations
• Physical partitioning means a great place for ad-hoc
  analytics
• Production workloads continue to run on core cluster but
  ad-hoc analytics on replica cluster
• For HBase, all clusters can be used for data serving!
What about external systems?

• Backing up to external systems is a 1 way
  street with large data volumes

• Can’t do useful processing on the other side

• Cost of hadoop storage is fairly low, especially
  if you can drive work on it
Summary
• It can be done!

• Lots of gotchas and details to track in the process

• We haven’t even talked about applications and
  configuration!

• Failure workflows are important too – testing,
  testing, testing
Cloudera Enterprise BDR

CLOUDERA ENTERPRISE
CLOUDERA MANAGER

         SELECT                   CONFIGURE                  SYNCHRONIZE                   MONITOR


                                          DISASTER RECOVERY MODULE



CDH



                   HDFS DISTRIBUTED REPLICATION                            HIVE METASTORE REPLICATION
                      HIGH PERFORMANCE REPLICATION                         THE ONLY DISASTER RECOVERY SOLUTION
                            USING MAPREDUCE                                           FOR METADATA

         HDFS                                                   HIVE




                                                                                                                 34

Contenu connexe

Tendances

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 

Tendances (20)

Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera cluster
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 

Similaire à Hadoop Backup and Disaster Recovery

Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
NAVER D2
 

Similaire à Hadoop Backup and Disaster Recovery (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Hadoop Backup and Disaster Recovery

  • 1. Hadoop Backup and Disaster Recovery Jai Ranganathan Cloudera Inc
  • 2. What makes Hadoop different? Not much EXCEPT • Tera- to Peta-bytes of data • Commodity hardware • Highly distributed • Many different services
  • 3. What needs protection? Data Sets: Applications: Configuration: System Knobs and applications (JT, configurations Data & Meta-data NN, Region necessary to run about your data Servers, etc) and applications (Hive) User applications
  • 4. We will focus on…. Data Sets but not because the others aren’t important.. Existing systems & processes can help manage Apps & Configuration (to some extent)
  • 5. Classes of Problems to Plan For Hardware Failures • Data corruption on disk • Disk/Node crash • Rack failure User/Application Error • Accidental or malicious data deletion • Corrupted data writes Site Failures • Permanent site loss – fire, ice, etc • Temporary site loss – Network, Power, etc (more common)
  • 6. Business goals must drive solutions RPOs and RTOs are awesome… But plan for what you care about – how much is this data worth? Failure mode Risk Cost Disk failure High Low Node failure High Low Rack failure Medium Medium Accidental deletes Medium Medium Site loss Low High
  • 7. Basics of HDFS* * From Hadoop documentation
  • 8. Hardware failures – Data Corruption Data corruption on disk • Checksums metadata for each block stored with file • If checksums do not match, name node discards block and replaces with fresh copy • Name node can write metadata to multiple copies for safety – write to different file systems and make backups
  • 9. Hardware Failures - Crashes Disk/Node crash • Synchronous replicationon disk day- first Data corruption saves the two replicas always on different hosts • Hardware failure detected by heartbeat loss • Name node HA for meta-data • HDFS automatically re-replicates blocks without enough replicas through periodic process
  • 10. Hardware Failures – Rack failure Rack failure • Configure corruption on diskprovide rack Data at least 3 replicas and information ( topology.node.switch.mapping.impl or topology.script.file.name) • 3rd replica always in a different rack • 3rd is important – allows for time window between failure and detection to safely exist
  • 11. Don’t forget metadata • Your data is defined by Hive metadata • But this is easy! SQL backups as per usual for Hive safety
  • 12. Cool.. Basic hardware is under control Not quite • Employ Monitoring to track node health • Examine data node block scanner reports (http://datanode:50075/blockScannerReport) • Hadoop fsck is your friend Of course, your friendly neighborhood Hadoop vendor has tools – Cloudera Manager health checks FTW!
  • 13. Phew.. Past the easy stuff One more small detail… Upgrades for HDFS should be treated with care On-disk layout changes are risky! • Save name node meta-data offsite • Test upgrade on smaller cluster before pushing out • Data layout upgrades support roll-back but be safe • Making backups of all or important data to remote location before upgrade!
  • 14. Application or user errors Permissions scope Users only have access to data they must have access to Apply the principle of least Quota management privilege Name quota: Limits number of files rooted at dir Space quota: Limit bytes of files rooted at dir
  • 15. Protecting against accidental deletes Trash server When enabled, files are deleted into trash Enable using fs.trash.interval to set trash interval Keep in mind: • Trash deletion only works through fs shell – programmatic deletes will not employ Trash • .Trash is a per user directory for restores
  • 16. Accidental deletes – don’t forget metadata • Again, regular SQL backups is key
  • 17. HDFS Snapshots What are snapshots? Snapshots represent state of the system at a point in time Often implemented using copy-on-write semantics • In HDFS, append-only fs means only deletes have to be managed • Many of the problems with COW are gone!
  • 18. HDFS Snapshots – coming to a distro near you Community is hard at work on HDFS snapshots Expect availability in major distros within the year Some implementation details – NameNode snapshotting: • Very fast snapping capability • Consistency guarantees • Restores need to perform data copy • .snapshot directories for access to individual files
  • 19. What can HDFS Snapshots do for you? • Handles user/application data corruption • Handles accidental deletes • Can also be used for Test/Dev purposes!
  • 20. HBase snapshots Oh hello, HBase! Very similar construct to HDFS snapshots COW model • Fast snaps • Consistent snapshots • Restores still need a copy (hey, at least we are consistent)
  • 21. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 22. Management of snapshots Space considerations: • % of cluster for snapshots • Number of snapshots • Alerting on space issues Scheduling backups: • Time based • Workflow based
  • 23. Great… Are we done? Don’t forget Roger Duronio! Principle of least privilege still matters…
  • 24. Disaster Recovery Datacenter A Datacenter B HDFS Hive HBase
  • 25. Teeing vs Copying Teeing Copying Data is copied from Send data during ingest production to replica as a phase to production and separate step after replica clusters processing • Time delay is minimal • Consistent data between clusters between both sites • Bandwidth required • Process once only could be larger • Time delay for RPO • Requires re-processing objectives to do data on both sides incremental copy • No consistency between • More bandwidth sites needed
  • 26. Recommendations? Scenario dependent But Generally prefer copying over teeing
  • 27. How to replicate – per service HDFS HBase Hive Teeing: Teeing: Flume and Teeing: Application Sqoop support NA level teeing teeing Copying: Copying: Copying: DistCP for copying HBase Database replication import/export* * Database import/export isn’t the full story
  • 28. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 29. Key considerations for large data movement • Is your data compressed? – None of the systems support compression on the wire natively – WAN accelerators can help but cost $$ • Do you know your bandwidth needs? – Initial data load – Daily ingest rate – Maintain historical information • Do you know your network security setup? – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity • Have you configured security appropriately? – Kerberos support for cross-realm trust is challenging • What about cross-version copying? – Can’t always have both clusters be same version – but this is not trivial
  • 30. Management of replications Scheduling replication jobs • Time based • Workflow based – Kicked off from Oozie script? Prioritization • Keep replications in a separate scheduler group and dedicate capacity to replication jobs • Don’t schedule more map tasks than can handle available network bandwidth between sites
  • 31. Secondary configuration and usage Hardware considerations • Denser disk configurations acceptable on remote site depending on workload goals – 4 TB disks vs 2 TB disks, etc • Fewer nodes are typical – consider replicating only critical data. Be careful playing with replication factors Usage considerations • Physical partitioning means a great place for ad-hoc analytics • Production workloads continue to run on core cluster but ad-hoc analytics on replica cluster • For HBase, all clusters can be used for data serving!
  • 32. What about external systems? • Backing up to external systems is a 1 way street with large data volumes • Can’t do useful processing on the other side • Cost of hadoop storage is fairly low, especially if you can drive work on it
  • 33. Summary • It can be done! • Lots of gotchas and details to track in the process • We haven’t even talked about applications and configuration! • Failure workflows are important too – testing, testing, testing
  • 34. Cloudera Enterprise BDR CLOUDERA ENTERPRISE CLOUDERA MANAGER SELECT CONFIGURE SYNCHRONIZE MONITOR DISASTER RECOVERY MODULE CDH HDFS DISTRIBUTED REPLICATION HIVE METASTORE REPLICATION HIGH PERFORMANCE REPLICATION THE ONLY DISASTER RECOVERY SOLUTION USING MAPREDUCE FOR METADATA HDFS HIVE 34

Notes de l'éditeur

  1. Data movement is expensiveHardware more likely to failMore complex interactions in distributed environmentEach service requires different hand-holding
  2. Keep in mind that configuration may not even make sense to replicate – remote side may have different configuration options
  3. Data is split into blocks: Default 128 MBBlocks are replicated: Default: 3 timesHDFS is rack aware
  4. Cloudera Manager helps with replication by managing versions as well
  5. Cross-version managementImproveddistcpHive export/import with updatesSimple UI