2. 2
• How to protect Data in Hadoop environments?
• Do we need Data Protection for Hadoop?
• What motivates people to question whether they need
to protect Hadoop?
HOW DID I GET HERE?
3. 3
• Major backup vendors don’t have solutions
• Hadoop size and scale is a challenge
• Hadoop has inbuilt Data Protection properties
WHAT I FOUND
9. 9
• Onboard Data Protection methods
– Built into HDFS
– Captive
• Offboard Data Protection methods
– Getting copies of data out of Hadoop
HADOOP INBUILT DATA PROTECTION
10. 10
ONBOARD DATA PROTECTION
Access Layer
Redundancy
NameNode HA
Redundant
Storage Controllers
Persistence Layer
Redundancy
N-way Replication
RAID/EC Schemes
11. 11
• Proactive Data Protection
• HDFS does not assume data stays correct
• Protects against data corruption
• Verify integrity and repair from replica copies
ONBOARD DATA PROTECTION
12. 12
• HDFS Snapshots
• Read only
• Directory level
• Not consistent at time of snapshot
• Preserves consistency on file close (beware open files!)
• Data owner controls the snapshot
ONBOARD DATA PROTECTION
13. 13
• HDFS Trash (recycle bin)
• Moves deleted files to user trash bin
• Deleted after predefined time
• Implemented in HDFS client
• Can be overridden by user
• Trash bin can be accessed or moved back
ONBOARD DATA PROTECTION
14. 14
• Distributed Copy
• HDFS, S3, OpenStack Swift, FTP, Azure (2.7.0)
• Single file copy performance bound to one data node
• 10TB file @ 1 Gbe = 22 hours
OFFBOARD DATA PROTECTION
15. 15
To answer the question..
Is Hadoop inbuilt data
protection good enough we
need to understand..
What are we protecting
against…
17. 17
There is no such thing as software
that does not unexpectedly fail
18. 18
In 2009 HortonWorks examined
HDFS’s data integrity at Yahoo!
HDFS lost 650 blocks out of
329 million blocks on 10 clusters
with 20,000 nodes
85% due to software bugs
15% due to single block replica
Welcome.
Peter Marelas
Principal Systems Engineer for EMC
Data Protection Solutions Division
Today we will be learning about Data Protection for Hadoop Environments.
Share story on how got here, didn’t know Hadoop 12 months ago
So my day job involves architecting solutions for Enterprise customers.
Most of the time I’m architecting solutions for mission critical workloads, like EDW, CRM, ERP.
But more recently customers have been asking us how to protect data in Hadoop environments.
And some customers even asked us, do we need to protect Hadoop environments.
So I didn’t have all the answers.
I spent about 1 month researching Hadoop Data Protection
Here is what I found.
All major backup vendors didn’t have solutions for Hadoop
Hadoop’s size and scale is so daunting most customers don’t even know where to start.
Hadoop has some interesting inbuilt data protection properties
So I figured the first two points we could investigate and probably solve
But I wanted to understand the last point before I did anything else
And so as part of my research I wanted to answer the question..
Are Hadoop’s inbuilt data protection properties good enough?
Now before we explore that questions I want to take you through some of the constraints with traditional Hadoop architectures relative to Enterprise architectures in the context of data protection
This is a typical Enterprise application architecture.
Blue boxes are the servers. Green boxes is the app storage.
Two ways to create data protection copies
Stream data via app servers to heterogeneous storage (grey boxes)
That’s what most backup solutions do today for Enterprise apps assuming sufficient time and resources
We call this a server-centric protection strategy
Other option is to use versioned storage replication to create our copies and recovery points
We call this a storage-centric protection strategy
Contrast this to a standard Hadoop architecture where storage and compute are combined
We cannot use storage-centric methods to protect the data – plain disk, no intelligence
So the constraint is we have to drive the process process using a server-centric approach.
Given this constraint another goal was to find an efficient method to protect Hadoop
I am going to demo this approach at the end of this presentation
So lets go back to answer this question
Hadoop has two types of Data Protection properties that I have classified into onboard and off board methods.
Onboard is concerned with protecting data without leaving the cluster
Off board is about getting copies of data out of the cluster
If we look at onboard protection first
Hadoop provides redundancy at the data access layer using a Highly Available NameNode
This is like having redundant storage controllers in a storage system
For the persistence layer the Hadoop file system implements N-way replication across nodes and racks
This is equivalent to a RAID scheme for storage systems
Hadoop also provides proactive Data Protection
HDFS does not trust disk storage
Assumes disks will degrade and return the wrong data
Protects against this it generates checksums, regularly verifies them and repairs corruption from replica copies
HDFS also supports read only snapshots
There are 2 caveats with them
They do not behave like storage system snapshots
Storage snapshots are consistent for open and closed files
HDFS snapshots are consistent for closed files only
If you want consistent recovery points you need to ensure files are closed before taking a snapshot
Also keep this in mind
Snapshots can be deleted by data owners
HDFS has a trash feature that operates like a recycle bin
Files move into trash once deleted and then removed after a predetermined time
Keep in mind
Implemented in HDFS client
Can be emptied at any time by file owner
Can be overridden by file owner
If your deleting files some other way there is no trash
So those were the onboard data protection properties that come with Hadoop.
Offboard data protection is provided by Hadoop distributed copy
Lets you create copies of files to various targets.
HDFS, S3, Openstack, FTP, Azure, etc.
Distributed copy is great as it distributes the work amongst nodes, so it can scale with your cluster
However, each file copy is mapped to one node
Single file copy performance is bound by the network performance of one data node
Keep this in mind
So now we know what Hadoop provides out of the box with respect to Data Protection
We need to ask the question..
What are we protecting against?
And how do Hadoop’s inbuilt methods fair?
This is a Data Loss Event Matrix I use to assess Data Protection strategies
On the left we have the events that can lead to data loss
To the right we have the rating
Then to the right again we have features and properties applicable to the event.
And to the far right concerns relative to the features relative to the event
My conclusion.
Hadoop fairs well when it comes to
Data Corruption, Component failures, Infrastructure software failures – firmware
Has risk when it comes to
Operational Failures, Site Failure, User Accidents, App software failures, Malicious user events, Malware
I am a big believer that …. software is not immune to failure
Some examples
Data integrity study @ Yahoo!
650 blocks lost out of 329 million
That’s a phenomenal achievement but look at the causes
85% due to software bugs
15% due to single block replica – operator error
Last one interesting.
What I found is its difficult to enforce data protection standards in Hadoop
You can set a default
But Data owners can define their own and change them retrospectively
Although very rare
I did find one known open condition in the Apache codebase that can cause blocks to be lost
And a new thing to keep in mind is as of 2.7 release HDFS now supports truncate operations
In the past we assumed immutability = protection
That assumption is no longer valid
Moral of the story
Plan for software failures
Plan for human failures
But be sensible in your approach
Not all data is equal
Only protect the data that is valuable
And only protect what can’t be derived again in a reasonable timeframe
Here are my three Guiding Principles for Data Protection Strategies
Diversify your protection copies
Analogy here is investments.
We hedge our risk by spreading investments across asset classes
We should do the same with data copies
Message: Keep your protection copies on a system that is diverse and different to the source system
We want to maintain logical and physical isolation so that problems that impact the source system do not propagate to the target system
For this to be successful we need to have separation of concerns
Rule: the system protecting the source data should not trust the source system
We want our protection copies to be frequently verified
We don’t want to assume data is written correctly or stays correct
We need to regularly verify in an automated way
So here is a sample strategy that you can use that aligns the protection method to the value of data.
Data that is desirable protect it with HDFS trash only
Data that is necessary protect it with HDFS trash and snapshots
Data that is essential protect it with trash, snapshot and distributed copy to another HDFS target
Data that is critical protect it with all of the above plus versioned copies to a diverse and different storage target
Demo how you can use Hadoop distributed copy to create versioned copies to Data Domain which is diverse and different to a Hadoop cluster
Data Domain is our protection storage platform
Has few unique properties
Does inline deduplication
Strong data integrity properties
Really fast at ingesting streaming data which is good for distributed copy
What’s unique about this approach is we are going to use distributed copy with an efficient incremental forever technique to maintain versioned copies