3. About Me
• Software Engineer @Hortonworks
– HDFS development lead
• PMC Member at Apache Hadoop
– Started working on Hadoop in 2007
– One of the most active contributors/committers of HDFS
• Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit
– It was a World Record.
• Received Ph.D. from the University of Maryland, College Park
3
= 3.141592654…
4. Previous Storage Model
• Datanode: a single logical storage consisting of
one or more physical storage medias
Supporting Heterogeneous Storage 4
5. New Storage Model
• Datanode: a collection of storages,
each storage corresponding to a
physical storage media.
Supporting Heterogeneous Storage 5
6. Advantages of Multiple Storage Model
• Heterogeneity
– Individual storage may have different characteristic: capacity, speed, cost, etc.
• Flexibility
– Client could write to a particular storage type
• Scalability
– A datanode can support more storage drive (>100) since storage reports are sent individually.
Supporting Heterogeneous Storage 6
7. Storage Types
• ARCHIVE: archival drive
– slow but cheap
• DISK: hard disk drive
– default type
• SSD: solid state drive
– fast but expensive
• RAM_DISK: memory drive
– capacity is limited
– data could be lazily persisted
Supporting Heterogeneous Storage 7
8. Block Storage Policy
• Describe how to store a data block in HDFS
– When a new file/block is created, how to store the data?
– After a file/block is created, should it be replicated to some other storages?
– When there is not enough space in a particular storage type,
• Should the operation fail?
• Fallback to other storage type?
Supporting Heterogeneous Storage 8
9. Block Storage Policies
• HOT
– Store all replicas to DISK
– Fallback to ARCHIVE if necessary.
– Default policy
• WARM
– Store a replica to DISK and the other replicas to ARCHIVE
– May fallback to DISK or ARCHIVE
– For data need to be archived soon
• COLD
– Store all replicas to ARCHIVE
– Do not fallback to any other storage type
– For archival data
Supporting Heterogeneous Storage 9
11. More Block Storage Policies
• ONE_SSD
– Store a replica to SSD and the other replicas to DISK
– May fallback to SSD or DISK
– For data needed to be access locally after write
• ALL_SSD
– Store all replicas to SSD
– Fallback to DISK if necessary
– For frequently accessed data
• LAZY_PERSIST
– Store only one replica to memory
– The replica stored in memory is lazily persisted to DISK later on
– For transient data
Supporting Heterogeneous Storage 11
13. Enforcing Block Storage Requirements
• When a block is created,
– block placement tells how to store the replicas.
• When the blocks are stored in HDFS, datanode or network may fail so that
– Blocks may become under/over replicated
– Blocks may fail to satisfy rack requirement
– Blocks may fail to satisfy storage policy requirement
– Cluster may become imbalanced (some datanodes may be over utilized while some are under utilized)
HDFS needs to fix them!
Supporting Heterogeneous Storage 13
14. Maintaining Block Storage Requirements
• Replication Monitor
– Copy/delete replicas to satisfy replication and rack requirement
– Won’t move replicas around for fulfilling storage policy requirement or balancing the cluster.
• Balancer
– Move replicas around to balance the cluster
– Preserves replication and rack requirement
– Preserves storage type: always move replica from a storage to another storage with the same type.
• Mover (a new tool)
– Mover replicas around to satisfy storage policy requirement
– Preserves replication and rack requirement
Supporting Heterogeneous Storage 14
15. Feature Development
• HDFS-2832: Heterogeneous Storage
– 66 subtasks (Mainly contributed by Hortonworks)
– Committed to Hadoop 2.3.0
• HDFS-6584: Archival Storage
– 33 subtasks (Mainly contributed by Hortonworks)
– Committed to Hadoop 2.6.0
• HDFS-7584: Quota Support for Storage Types
– 11 subtasks (Mainly contributed by Hortonworks)
– Committed to Hadoop 2.7.0
• HDFS-7197: Enhancements to Mover
– 10 subtasks (Mainly contributed by eBay)
Supporting Heterogeneous Storage 15
21. Data at eBay’ Clusters
• What’s Stored ?
– Different types of User behavior Data
– Experimentation results
– Buyer and Seller data
• How is it used ?
HDFS Tiered Storage 21
23. Many options for Archival Storage
• Create another HDFS cluster
– Move data using Distcp
• Non HDFS cheap storage
• Tiered Storage
– DISK
– ARCHIVE
HDFS Tiered Storage 23
24. Benefits of using Tiered Storage
• Data resides in one cluster
– No change for applications
– Security of datasets remains unchanged
– Audit Log History can be aggregated from one cluster
– Data can be moved back to Compute nodes upon demand
– Operational simplicity
HDFS Tiered Storage 24
26. Disk Vs Archival Node
HDFS Tiered Storage 26
Regular Archive
Storage Capacity 40 TB 210 TB
# of drives 12 60
# of processing units 32 4
Main Memory 128 GB 64GB
cost / GB X¢ (varies ) X/5 ¢ (varies)
Runs Node Manager Yes No
27. Archival Data Node Configuration
• Dfs.datanode.data.dir
– [ARCHIVE]/hadoop/1/data, [ARCHIVE]/hadoop/2/data,…
• dfs.datanode.balance.max.concurrent.moves
– 400 (10 for DISK)
• dfs.datanode.balance.bandwidthPerSec
– 400 MB ( 10MB for DISK)
HDFS Tiered Storage 27
28. Archival Process
• For each dataset
– Scan HDFS Audit logs to identify the data access pattern
– Derive an Archival Policy. Example:
– Set up storage policy on sub directories based on Archival Policy.
– Run Mover for the dataset
HDFS Tiered Storage 28
Time Storage Policy Block Placement
< 90 days HOT 3 replicas in DISK
> 90 days < 270 days WARM 1 DISK , 2 ARCHIVE
> 270 days COLD 3 replica in ARCHIVE
29. Notes
• The Mover speed goes up with more sources and destinations
– With 2000 > 48 nodes, 1 PB gets moved in 12 hours
• FSCK displays information on types and storage policies for a directory.
• The detailed storage tier information on Namenode UI is in progress.
HDFS Tiered Storage 29
30. Summary
• eBay uses Tiered Storage to store rarely used data
– Reduces storage costs by using big storage with limited computing
• Tiered storage can be operated using storage types and storage policies.
• An Archival policy needs to be setup for datasets based on their access pattern.
– Keep Audit Logs to analyze the access pattern.
• Tiered storage information is available via FSCK.
HDFS Tiered Storage 30
In this presentation, I will first describe the motivation behind archiving some of our datasets and benefits of using tiered storage to archive the data.
Then we will go over the setup of tiered storage on one of our clusters.
I’ll describe our Archival process to identify the data to archive and their movement to archival storage.
Let us first review eBay’s requirements for Archival Storage by taking a look at our clusters.
The Apollo cluster was setup in 2012 with around 20 Petabytes of storage, then we needed more storage and computing, increased the capacity to 36 Peta bytes in 2013.
Towards the end of 2013, we were getting close to capacity, again . So we added some more data nodes and got the capacity up to 40 Peta bytes.
Our computing capacity requirements for Apollo Cluster were satisfied at this point, but data kept on growing.
So it was evident that we needed to increase the storage capacity, to store the the existing data as well as the new data that’s coming in.
So far, to increase capacity, we added the machines which has both storage and computing power. But this time we wanted to take advantage of the condition that we needed more storage , but no compute. We wanted to reduce costs by paying only for storage.
We decided to take a look at the data we store in our cluster to identify the data which are rarely used.
A major class of data that we store in the Apollo cluster is User behavior data. There are different types of user behavior data. There is experimentation data, buyer and seller related data etc.These data will be used by various teams at eBay. They write their own map reduce jobs, hive queries, spark jobs etc to aggregate/summarize the data to derive insights or make decisions. For most of the datasets mentioned here, we expected that the data is heavily used initially and then the usage would go down.
. We decided to verify this assumption by scanning our audit logs. The audit logs show who accessed which data and when. At eBay, we store our HDFS Audit logs in HDFS itself and we keep it for a long time.
This chart shows the access pattern of files under a dataset which has information about user’s sessions. The session data was stored in HDFS in early January , 2014. The X axis shows the age of the data in months and Y axis shows the access count. In the first months, there were 35K accesses for files under the directory. This dropped to 30 K in February. In March , it was around 7.5 K. Starting from the 8th month onwards, there was not much access.
For this data, there is not much computation happening after 8 months and so there is no benefit to keep it on a node which has computing power. We can move it to a data node which has little computing power
There are many options for Archival storage.
We could create an HDFS only cluster using machines with low computing power. The data can be moved using distcp.
Another option is to have a NON-HDFS cheap storage and copy the data from the original cluster to this external storage.
But if you want to keep the data in same cluster, HDFS tiered storage is the right choice. HDFS tiered storage introduces the concept of Storage types. There is a DISK storage type and Archive storage type. The default DISK storage type refers to normal storage where datanode has both compute and storage so that we can take advantage of data locality for data processing.
The ARCHIVE storage type refers to data nodes which does not have computing power. Since the computing power is less, we pay mainly for the storage only and not for computing. This lowers the cost per GB for archival node compared to a DISK node.
Once you have DISK and ARCHIVE storage types in the cluster, you archive data by moving the blocks from DISK Storage to ARCHIVE Storage.
There are many benefits of Archiving using tiered storage . Most benefits are due to the fact that the dataset continues to reside on the HDFS in the original cluster.
The applications which operate on the data does not need to change at all. The hdfs location remains same.
The attributes on the dataset like ACLS or permissions remains same. The Audit Logs will continue record the information as if nothing has changed for these datasets. So the audit log history will contain the full history from the time it was stored till you delete it.
Now in case, data access pattern changes and archived data gets heavily used , then we can move the data back to compute nodes to achieve data locality during computation.
There is also operational simplicity since you do not need to manage a separate form of Archival Storage.
Setting up Tiered storage on our Apollo Cluster.
As I mentioned we already had 40 PB of storage with storage Type DISK. This comprised of 2000 datanodes.
We added a new set of data nodes which are heavy in storage, but with limited computing power. There were 48 nodes with a total storage capacity of 10 Peta bytes.
This formed our tiered storage, which is 40 peta bytes of DISK and 10 peta bytes of ARCHIVE storage.
Let us do comparison between a Disk data node and an Archive datanode.Our regular Disk nodes have a capacity ranging from 20 TB to 40 TB. We have a lot of 20 TB and few 40 TB nodes. But the details in slide refers to 40TB data node since they are the recent ones.The ARCHIVE nodes have 210 TB storage capacity on one datanode.
The DISK nodes have 12 drives whereas ARCHIVE nodes have 60 drives. The regular DISK nodes run computations on the data. There are a total of 32 processing units on the regular DISK nodes. The ARCHIVE nodes are to store data only. The processing power is needed to mainly run the operating system and datanode process. So they have only 4 processing units. The regular datanode have 128GB of main memory where as ARCHIVAL nodes havea main memory of only 64 GB. The cost per GB of Archival storage is about one fifth of that DISK storage. There is no node manager running in the Archival datanode since the data is rarely accessed. It runs only the data node . The DISK nodes run both Node Manager to provision containers and data node.
The Archival nodes run only the data node process on them. There are few key differences in the data node configuration.
The datanode directories should be marked as ARCHIVE.
The major operation that happens on these data nodes is the move operation from a DISK storage. So we need to tune the Archival data node to support the move operations efficiently.
There are two parameters which influence the move operations on a data node. dfs.datanode.balance.max.concurrent.moves specifies how many moves are concurrently allowed to or from a data node. The second parameter is the allowed bandwidth for block moves to or from a data node.
In our cluster, we had 2000 DISK data nodes moving blocks to 48 ARCHIVE data nodes. The ratio between the source DISK nodes to the destination ARCHIVE nodes is 1:40
In our clusters, for DISK data nodes we have set concurent moves to 10. For ARCHIVE nodes, we have set this value to 400, following he 1:40 ration
In our clusters, for DISK data nodes we have set this value to 10 MB.For Archival nodes, we have set this value 400 MB.
With these values , we moved around 1 Peta byte of data from DISK to ARCHIVE storage in 12 hours without impacting the normal cluster operations.
So far we have moved around 8 petabytes of data to Archival storage. Going forward, we plan to derive an Archival policy for each dataset. To derive an Archive policy, we analyze the HDFS audit logs to determine the data access pattern for the dataset. Consider a dataset which is accessed heavily in the initial 3 months, then accessed moderately in the next 6 months and then rarely accessed beyond 9 months. For this dataset, we will define an archival policy where the directory’s storage policy set to HOT initially , then changed to WARM for the next 6 months and after 9 months , the policy is set to cold so that all replicas are moved to the ARCHIVAL storage.
Our archival process will have the following steps : for each dataset.
Identify data access pattern, Derive the Archival policy based on the data access pattern, Have a process which sets the storage policies according to the derived Archival policy and run mover to enforce the storage policies.
A few additional points to note about the tiered storage.
The speed of data movement from DISK nodes to Archive nodes is proportional to the number of DISK nodes and ARCHIVE nodes.
With 2000 DISK nodes and 48 ARCHIVE nodes, we could archive 1 PB of data in 12 hours.
FSCK provides vital information for a given directory. We have enhanced FSCK so that it provides information on the storage tiers for the data blocks belonging to the directory.
The work on displaying storage tier information on the Name node UI is in progress.
Summarizing my presentation,
eBay uses tiered storage to store hot data and cold data. The storage costs are reduced by lowering compute capability for nodes which store cold data.
HDFS tiered storage is operated via storage types and storage policies associated to directories.
An archival policy needs to be setup for datasets based on their access patterns. Audit logs are useful identifying the access patterns. So collect them and store them for a long time. The archive process should change storage policies on directories to trigger their movement.
FSCK provides information about tiered storage.