Data protection for hadoop environments

•

2 likes•2,090 views

DataWorks Summit

Hadoop Summit 2015

Technology

1
DATA PROTECTION FOR
HADOOP ENVIRONMENTS
PETER MARELAS
PRINCIPAL SYSTEMS ENGINEER
DATA PROTECTION SOLUTIONS
EMC

2
• How to protect Data in Hadoop environments?
• Do we need Data Protection for Hadoop?
• What motivates people to question whether they need
to protect Hadoop?
HOW DID I GET HERE?

3
• Major backup vendors don’t have solutions
• Hadoop size and scale is a challenge
• Hadoop has inbuilt Data Protection properties
WHAT I FOUND

4
Are Hadoop’s inbuilt Data Protection
properties good enough?
QUESTION TO EXPLORE

5
ARCHITECTURE CONSTRAINTS
Traditional Enterprise Application Infrastructure

6
ARCHITECTURE CONSTRAINTS
Enterprise Hadoop Infrastructure

7
Efficient
Server-Centric
Data Protection
for
traditional
Hadoop architecture

8
Are Hadoop’s inbuilt
Data Protection
properties
good enough..

9
• Onboard Data Protection methods
– Built into HDFS
– Captive
• Offboard Data Protection methods
– Getting copies of data out of Hadoop
HADOOP INBUILT DATA PROTECTION

10
ONBOARD DATA PROTECTION
Access Layer
Redundancy
NameNode HA
Redundant
Storage Controllers
Persistence Layer
Redundancy
N-way Replication
RAID/EC Schemes

11
• Proactive Data Protection
• HDFS does not assume data stays correct
• Protects against data corruption
• Verify integrity and repair from replica copies
ONBOARD DATA PROTECTION

12
• HDFS Snapshots
• Read only
• Directory level
• Not consistent at time of snapshot
• Preserves consistency on file close (beware open files!)
• Data owner controls the snapshot
ONBOARD DATA PROTECTION

13
• HDFS Trash (recycle bin)
• Moves deleted files to user trash bin
• Deleted after predefined time
• Implemented in HDFS client
• Can be overridden by user
• Trash bin can be accessed or moved back
ONBOARD DATA PROTECTION

14
• Distributed Copy
• HDFS, S3, OpenStack Swift, FTP, Azure (2.7.0)
• Single file copy performance bound to one data node
• 10TB file @ 1 Gbe = 22 hours
OFFBOARD DATA PROTECTION

15
To answer the question..
Is Hadoop inbuilt data
protection good enough we
need to understand..
What are we protecting
against…

17
There is no such thing as software
that does not unexpectedly fail

18
In 2009 HortonWorks examined
HDFS’s data integrity at Yahoo!
HDFS lost 650 blocks out of
329 million blocks on 10 clusters
with 20,000 nodes
85% due to software bugs
15% due to single block replica

19
Condition that causes
blocks to be lost
HDFS-5042

20
HDFS now supports truncate()
No longer immutable
or write-once
HDFS-3107

21
Plan for software failures..
THE MORAL OF THE STORY
Plan for human failures..

22
Not all data is equal..
Protect what is valuable..
Protect what can’t be derived
in a reasonable timeframe..
THE MORAL OF THE STORY

24
Diversify
Loosely Coupling
DATA PROTECTION GUIDING PRINCIPALS

25
Logical Isolation
Physical Isolation
Separation of Concerns
DATA PROTECTION GUIDING PRINCIPALS

26
Frequently Verified
DATA PROTECTION GUIDING PRINCIPALS

27
DATA PROTECTION STRATEGY
Versioned
Copies
(HDFS->Other)
Copy/Repl
(HDFS->HDFS)
HDFS Snapshot
HDFS Trash
Critical
Essential
Necessary
Desirable
Protection Method Data Value

28
LIVE DEMO
Hadoop Data Protection
~
Scalable Versioned Copies
~
Data Domain Protection Storage

29
(b) www.beebotech.com.au
(t) @pmarelas
(e) peter.marelas@emc.com
THANK YOU

What's hot

Hadoop Security ArchitectureOwen O'Malley

Hadoop securityShivaji Dutta

Hadoop securityBiju Nair

Hadoop SecurityTimothy Spann

Hadoop security @ Philly Hadoop Meetup May 2015Shravan (Sean) Pabba

Hadoop Security: OverviewCloudera, Inc.

Big Data Security with HadoopCloudera, Inc.

Hdp security overview Hortonworks

Hadoop Security Today and TomorrowDataWorks Summit

Apache Sentry for Hadoop securitybigdatagurus_meetup

Deploying Enterprise-grade Security for HadoopCloudera, Inc.

Hadoop Security Today & Tomorrow with Apache KnoxVinay Shukla

The Future of Hadoop Security - Hadoop Summit 2014Cloudera, Inc.

Big data securityJoey Echeverria

How to Protect Big Data in a Containerized EnvironmentBlueData, Inc.

Hadoop Security Features that make your risk officer happyAnurag Shrivastava

April 2014 HUG : Apache SentryYahoo Developer Network

2014 sept 4_hadoop_securityAdam Muise

Protect your private data with ORC column encryptionOwen O'Malley

Securing the Hadoop EcosystemDataWorks Summit

What's hot (20)

Hadoop Security Architecture

Hadoop security

Hadoop Security

Hadoop security @ Philly Hadoop Meetup May 2015

Hadoop Security: Overview

Big Data Security with Hadoop

Hdp security overview

Hadoop Security Today and Tomorrow

Apache Sentry for Hadoop security

Deploying Enterprise-grade Security for Hadoop

Hadoop Security Today & Tomorrow with Apache Knox

The Future of Hadoop Security - Hadoop Summit 2014

Big data security

How to Protect Big Data in a Containerized Environment

Hadoop Security Features that make your risk officer happy

April 2014 HUG : Apache Sentry

2014 sept 4_hadoop_security

Protect your private data with ORC column encryption

Securing the Hadoop Ecosystem

Viewers also liked

Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran

Owler Special Report | May 26, 2015 | Granify, Remix, Karro, Rubrik and more.Owler

Commercial track 1_The Power of UDParcserve data protection

Inside hadoop-devSteve Loughran

Protecting enterprise Data in HadoopDataWorks Summit

Top 8 data protection officer resume samplesmeyerstephin

Importance of Big data for your Businessazuyo.com

Hadoop Application Architectures tutorial - Strata Londonhadooparchbook

How the latest trends in data security can help your data protection strategy...Ulf Mattsson

Architectural considerations for Hadoop Applicationshadooparchbook

Hadoop I/O AnalysisRichard McDougall

Introduction to Apache HadoopChristopher Pezza

Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz

Hadoop hive presentationArvind Kumar

Apache Hadoop YARN: best practicesDataWorks Summit

Introduction to Apache HiveAvkash Chauhan

New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit

Hadoop HDFS Detailed IntroductionHanborq Inc.

Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks

HadoopNishant Gandhi

Viewers also liked (20)

Availability and Integrity in hadoop (Strata EU Edition)

Owler Special Report | May 26, 2015 | Granify, Remix, Karro, Rubrik and more.

Commercial track 1_The Power of UDP

Inside hadoop-dev

Protecting enterprise Data in Hadoop

Top 8 data protection officer resume samples

Importance of Big data for your Business

Hadoop Application Architectures tutorial - Strata London

How the latest trends in data security can help your data protection strategy...

Architectural considerations for Hadoop Applications

Hadoop I/O Analysis

Introduction to Apache Hadoop

Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...

Hadoop hive presentation

Apache Hadoop YARN: best practices

Introduction to Apache Hive

New Data Transfer Tools for Hadoop: Sqoop 2

Hadoop HDFS Detailed Introduction

Apache Hadoop YARN - Enabling Next Generation Data Applications

Hadoop

Similar to Data protection for hadoop environments

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis

EMC config Hadoopsolarisyougood

Bigdata and hadoopAditi Yadav

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Hadoop_Its_Not_Just_Internal_Storage_V14John Sing

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

BIG DATAShashank Shetty

Hops - Distributed metadata for HadoopJim Dowling

Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin

Big data - Online TrainingLearntek1

Introduction to Hadoop and Big Data ProcessingSam Ng

Seminar pptRajatTripathi34

Big Data and Hadoop IntroductionDzung Nguyen

Big data and Hadoop introductionDzung Nguyen

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1

Hadoop seminarKrishnenduKrishh

Architecture of HadoopKnoldus Inc.

50 Shades of SQLDataWorks Summit

Similar to Data protection for hadoop environments (20)

IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop

EMC config Hadoop

Bigdata and hadoop

Hadoop in the cloud – The what, why and how from the experts

Hadoop_Its_Not_Just_Internal_Storage_V14

Introduction to Hadoop Administration

BIG DATA

Hops - Distributed metadata for Hadoop

Big Data in the Cloud - The What, Why and How from the Experts

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos

Big data - Online Training

Introduction to Hadoop and Big Data Processing

Seminar ppt

Big Data and Hadoop Introduction

Big data and Hadoop introduction

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

Hadoop seminar

Architecture of Hadoop

50 Shades of SQL

Recently uploaded

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

"ML in Production",Oleksandr BaganFwdays

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

CloudStudio User manual (basic edition):comworks

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Take control of your SAP testing with UiPath Test SuiteDianaGray10

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

"Debugging python applications inside k8s environment", Andrii Soldatenko

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

DevEX - reference for building teams, processes, and platforms

SAP Build Work Zone - Overview L2-L3.pptx

Dev Dives: Streamline document processing with UiPath Studio Web

"ML in Production",Oleksandr Bagan

Scanning the Internet for External Cloud Exposures via SSL Certs

The Ultimate Guide to Choosing WordPress Pros and Cons

DMCC Future of Trade Web3 - Special Edition

CloudStudio User manual (basic edition):

Connect Wave/ connectwave Pitch Deck Presentation

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

SIP trunking in Janus @ Kamailio World 2024

Anypoint Exchange: It’s Not Just a Repo!

Designing IA for AI - Information Architecture Conference 2024

Take control of your SAP testing with UiPath Test Suite

DSPy a system for AI to Write Prompts and Do Fine Tuning

TeamStation AI System Report LATAM IT Salaries 2024

Data protection for hadoop environments

1. 1 DATA PROTECTION FOR HADOOP ENVIRONMENTS PETER MARELAS PRINCIPAL SYSTEMS ENGINEER DATA PROTECTION SOLUTIONS EMC

2. 2 • How to protect Data in Hadoop environments? • Do we need Data Protection for Hadoop? • What motivates people to question whether they need to protect Hadoop? HOW DID I GET HERE?

3. 3 • Major backup vendors don’t have solutions • Hadoop size and scale is a challenge • Hadoop has inbuilt Data Protection properties WHAT I FOUND

4. 4 Are Hadoop’s inbuilt Data Protection properties good enough? QUESTION TO EXPLORE

5. 5 ARCHITECTURE CONSTRAINTS Traditional Enterprise Application Infrastructure

6. 6 ARCHITECTURE CONSTRAINTS Enterprise Hadoop Infrastructure

7. 7 Efficient Server-Centric Data Protection for traditional Hadoop architecture

8. 8 Are Hadoop’s inbuilt Data Protection properties good enough..

9. 9 • Onboard Data Protection methods – Built into HDFS – Captive • Offboard Data Protection methods – Getting copies of data out of Hadoop HADOOP INBUILT DATA PROTECTION

10. 10 ONBOARD DATA PROTECTION Access Layer Redundancy NameNode HA Redundant Storage Controllers Persistence Layer Redundancy N-way Replication RAID/EC Schemes

11. 11 • Proactive Data Protection • HDFS does not assume data stays correct • Protects against data corruption • Verify integrity and repair from replica copies ONBOARD DATA PROTECTION

12. 12 • HDFS Snapshots • Read only • Directory level • Not consistent at time of snapshot • Preserves consistency on file close (beware open files!) • Data owner controls the snapshot ONBOARD DATA PROTECTION

13. 13 • HDFS Trash (recycle bin) • Moves deleted files to user trash bin • Deleted after predefined time • Implemented in HDFS client • Can be overridden by user • Trash bin can be accessed or moved back ONBOARD DATA PROTECTION

14. 14 • Distributed Copy • HDFS, S3, OpenStack Swift, FTP, Azure (2.7.0) • Single file copy performance bound to one data node • 10TB file @ 1 Gbe = 22 hours OFFBOARD DATA PROTECTION

15. 15 To answer the question.. Is Hadoop inbuilt data protection good enough we need to understand.. What are we protecting against…

16. 16 DATA LOSS EVENT MATRIX

17. 17 There is no such thing as software that does not unexpectedly fail

18. 18 In 2009 HortonWorks examined HDFS’s data integrity at Yahoo! HDFS lost 650 blocks out of 329 million blocks on 10 clusters with 20,000 nodes 85% due to software bugs 15% due to single block replica

19. 19 Condition that causes blocks to be lost HDFS-5042

20. 20 HDFS now supports truncate() No longer immutable or write-once HDFS-3107

21. 21 Plan for software failures.. THE MORAL OF THE STORY Plan for human failures..

22. 22 Not all data is equal.. Protect what is valuable.. Protect what can’t be derived in a reasonable timeframe.. THE MORAL OF THE STORY

23. 23 DATA PROTECTION GUIDING PRINCIPALS

24. 24 Diversify Loosely Coupling DATA PROTECTION GUIDING PRINCIPALS

25. 25 Logical Isolation Physical Isolation Separation of Concerns DATA PROTECTION GUIDING PRINCIPALS

26. 26 Frequently Verified DATA PROTECTION GUIDING PRINCIPALS

27. 27 DATA PROTECTION STRATEGY Versioned Copies (HDFS->Other) Copy/Repl (HDFS->HDFS) HDFS Snapshot HDFS Trash Critical Essential Necessary Desirable Protection Method Data Value

28. 28 LIVE DEMO Hadoop Data Protection ~ Scalable Versioned Copies ~ Data Domain Protection Storage

29. 29 (b) www.beebotech.com.au (t) @pmarelas (e) peter.marelas@emc.com THANK YOU

Editor's Notes

Welcome. Peter Marelas Principal Systems Engineer for EMC Data Protection Solutions Division Today we will be learning about Data Protection for Hadoop Environments.
Share story on how got here, didn’t know Hadoop 12 months ago So my day job involves architecting solutions for Enterprise customers. Most of the time I’m architecting solutions for mission critical workloads, like EDW, CRM, ERP. But more recently customers have been asking us how to protect data in Hadoop environments. And some customers even asked us, do we need to protect Hadoop environments. So I didn’t have all the answers.
I spent about 1 month researching Hadoop Data Protection Here is what I found. All major backup vendors didn’t have solutions for Hadoop Hadoop’s size and scale is so daunting most customers don’t even know where to start. Hadoop has some interesting inbuilt data protection properties So I figured the first two points we could investigate and probably solve But I wanted to understand the last point before I did anything else
And so as part of my research I wanted to answer the question.. Are Hadoop’s inbuilt data protection properties good enough? Now before we explore that questions I want to take you through some of the constraints with traditional Hadoop architectures relative to Enterprise architectures in the context of data protection
This is a typical Enterprise application architecture. Blue boxes are the servers. Green boxes is the app storage. Two ways to create data protection copies Stream data via app servers to heterogeneous storage (grey boxes) That’s what most backup solutions do today for Enterprise apps assuming sufficient time and resources We call this a server-centric protection strategy Other option is to use versioned storage replication to create our copies and recovery points We call this a storage-centric protection strategy
Contrast this to a standard Hadoop architecture where storage and compute are combined We cannot use storage-centric methods to protect the data – plain disk, no intelligence So the constraint is we have to drive the process process using a server-centric approach.
Given this constraint another goal was to find an efficient method to protect Hadoop I am going to demo this approach at the end of this presentation
So lets go back to answer this question
Hadoop has two types of Data Protection properties that I have classified into onboard and off board methods. Onboard is concerned with protecting data without leaving the cluster Off board is about getting copies of data out of the cluster
If we look at onboard protection first Hadoop provides redundancy at the data access layer using a Highly Available NameNode This is like having redundant storage controllers in a storage system For the persistence layer the Hadoop file system implements N-way replication across nodes and racks This is equivalent to a RAID scheme for storage systems
Hadoop also provides proactive Data Protection HDFS does not trust disk storage Assumes disks will degrade and return the wrong data Protects against this it generates checksums, regularly verifies them and repairs corruption from replica copies
HDFS also supports read only snapshots There are 2 caveats with them They do not behave like storage system snapshots Storage snapshots are consistent for open and closed files HDFS snapshots are consistent for closed files only If you want consistent recovery points you need to ensure files are closed before taking a snapshot Also keep this in mind Snapshots can be deleted by data owners
HDFS has a trash feature that operates like a recycle bin Files move into trash once deleted and then removed after a predetermined time Keep in mind Implemented in HDFS client Can be emptied at any time by file owner Can be overridden by file owner If your deleting files some other way there is no trash
So those were the onboard data protection properties that come with Hadoop. Offboard data protection is provided by Hadoop distributed copy Lets you create copies of files to various targets. HDFS, S3, Openstack, FTP, Azure, etc. Distributed copy is great as it distributes the work amongst nodes, so it can scale with your cluster However, each file copy is mapped to one node Single file copy performance is bound by the network performance of one data node Keep this in mind
So now we know what Hadoop provides out of the box with respect to Data Protection We need to ask the question.. What are we protecting against? And how do Hadoop’s inbuilt methods fair?
This is a Data Loss Event Matrix I use to assess Data Protection strategies On the left we have the events that can lead to data loss To the right we have the rating Then to the right again we have features and properties applicable to the event. And to the far right concerns relative to the features relative to the event My conclusion. Hadoop fairs well when it comes to Data Corruption, Component failures, Infrastructure software failures – firmware Has risk when it comes to Operational Failures, Site Failure, User Accidents, App software failures, Malicious user events, Malware
I am a big believer that …. software is not immune to failure Some examples
Data integrity study @ Yahoo! 650 blocks lost out of 329 million That’s a phenomenal achievement but look at the causes 85% due to software bugs 15% due to single block replica – operator error Last one interesting. What I found is its difficult to enforce data protection standards in Hadoop You can set a default But Data owners can define their own and change them retrospectively
Although very rare I did find one known open condition in the Apache codebase that can cause blocks to be lost
And a new thing to keep in mind is as of 2.7 release HDFS now supports truncate operations In the past we assumed immutability = protection That assumption is no longer valid
Moral of the story Plan for software failures Plan for human failures
But be sensible in your approach Not all data is equal Only protect the data that is valuable And only protect what can’t be derived again in a reasonable timeframe
Here are my three Guiding Principles for Data Protection Strategies
Diversify your protection copies Analogy here is investments. We hedge our risk by spreading investments across asset classes We should do the same with data copies Message: Keep your protection copies on a system that is diverse and different to the source system
We want to maintain logical and physical isolation so that problems that impact the source system do not propagate to the target system For this to be successful we need to have separation of concerns Rule: the system protecting the source data should not trust the source system
We want our protection copies to be frequently verified We don’t want to assume data is written correctly or stays correct We need to regularly verify in an automated way
So here is a sample strategy that you can use that aligns the protection method to the value of data. Data that is desirable protect it with HDFS trash only Data that is necessary protect it with HDFS trash and snapshots Data that is essential protect it with trash, snapshot and distributed copy to another HDFS target Data that is critical protect it with all of the above plus versioned copies to a diverse and different storage target
Demo how you can use Hadoop distributed copy to create versioned copies to Data Domain which is diverse and different to a Hadoop cluster Data Domain is our protection storage platform Has few unique properties Does inline deduplication Strong data integrity properties Really fast at ingesting streaming data which is good for distributed copy What’s unique about this approach is we are going to use distributed copy with an efficient incremental forever technique to maintain versioned copies

Data protection for hadoop environments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data protection for hadoop environments

Similar to Data protection for hadoop environments (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Data protection for hadoop environments

Editor's Notes