SlideShare une entreprise Scribd logo
1  sur  32
Reduce Storage Costs by 5x
Using The New HDFS Tiered Storage Feature
Tsz-Wo Nicholas Sze @Hortonworks
Benoy Antony @eBay
June 9, 2015
Supporting Heterogeneous Storage
About Me
• Software Engineer @Hortonworks
– HDFS development lead
• PMC Member at Apache Hadoop
– Started working on Hadoop in 2007
– One of the most active contributors/committers of HDFS
• Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit
– It was a World Record.
• Received Ph.D. from the University of Maryland, College Park
3
= 3.141592654…
Previous Storage Model
• Datanode: a single logical storage consisting of
one or more physical storage medias
Supporting Heterogeneous Storage 4
New Storage Model
• Datanode: a collection of storages,
each storage corresponding to a
physical storage media.
Supporting Heterogeneous Storage 5
Advantages of Multiple Storage Model
• Heterogeneity
– Individual storage may have different characteristic: capacity, speed, cost, etc.
• Flexibility
– Client could write to a particular storage type
• Scalability
– A datanode can support more storage drive (>100) since storage reports are sent individually.
Supporting Heterogeneous Storage 6
Storage Types
• ARCHIVE: archival drive
– slow but cheap
• DISK: hard disk drive
– default type
• SSD: solid state drive
– fast but expensive
• RAM_DISK: memory drive
– capacity is limited
– data could be lazily persisted
Supporting Heterogeneous Storage 7
Block Storage Policy
• Describe how to store a data block in HDFS
– When a new file/block is created, how to store the data?
– After a file/block is created, should it be replicated to some other storages?
– When there is not enough space in a particular storage type,
• Should the operation fail?
• Fallback to other storage type?
Supporting Heterogeneous Storage 8
Block Storage Policies
• HOT
– Store all replicas to DISK
– Fallback to ARCHIVE if necessary.
– Default policy
• WARM
– Store a replica to DISK and the other replicas to ARCHIVE
– May fallback to DISK or ARCHIVE
– For data need to be archived soon
• COLD
– Store all replicas to ARCHIVE
– Do not fallback to any other storage type
– For archival data
Supporting Heterogeneous Storage 9
Example 1
Supporting Heterogeneous Storage 10
More Block Storage Policies
• ONE_SSD
– Store a replica to SSD and the other replicas to DISK
– May fallback to SSD or DISK
– For data needed to be access locally after write
• ALL_SSD
– Store all replicas to SSD
– Fallback to DISK if necessary
– For frequently accessed data
• LAZY_PERSIST
– Store only one replica to memory
– The replica stored in memory is lazily persisted to DISK later on
– For transient data
Supporting Heterogeneous Storage 11
Example 2
Supporting Heterogeneous Storage 12
Enforcing Block Storage Requirements
• When a block is created,
– block placement tells how to store the replicas.
• When the blocks are stored in HDFS, datanode or network may fail so that
– Blocks may become under/over replicated
– Blocks may fail to satisfy rack requirement
– Blocks may fail to satisfy storage policy requirement
– Cluster may become imbalanced (some datanodes may be over utilized while some are under utilized)
HDFS needs to fix them!
Supporting Heterogeneous Storage 13
Maintaining Block Storage Requirements
• Replication Monitor
– Copy/delete replicas to satisfy replication and rack requirement
– Won’t move replicas around for fulfilling storage policy requirement or balancing the cluster.
• Balancer
– Move replicas around to balance the cluster
– Preserves replication and rack requirement
– Preserves storage type: always move replica from a storage to another storage with the same type.
• Mover (a new tool)
– Mover replicas around to satisfy storage policy requirement
– Preserves replication and rack requirement
Supporting Heterogeneous Storage 14
Feature Development
• HDFS-2832: Heterogeneous Storage
– 66 subtasks (Mainly contributed by Hortonworks)
– Committed to Hadoop 2.3.0
• HDFS-6584: Archival Storage
– 33 subtasks (Mainly contributed by Hortonworks)
– Committed to Hadoop 2.6.0
• HDFS-7584: Quota Support for Storage Types
– 11 subtasks (Mainly contributed by Hortonworks)
– Committed to Hadoop 2.7.0
• HDFS-7197: Enhancements to Mover
– 10 subtasks (Mainly contributed by eBay)
Supporting Heterogeneous Storage 15
HDFS Tiered Storage @eBay
About Me
• Software Engineer @eBay
• Apache Hadoop Committer
– working on Hadoop since 2011
• Focus on HDFS & Security
17
Outline
•Requirements for Archival storage
•Setting up Tiered storage
•Archival Process
HDFS Tiered Storage 18
eBay’s Apollo Cluster HDFS Capacity
HDFS Tiered Storage 19
0
5
10
15
20
25
30
35
40
45
2012 2013 2014
Cluster is getting Full
HDFS Tiered Storage 20
Data at eBay’ Clusters
• What’s Stored ?
– Different types of User behavior Data
– Experimentation results
– Buyer and Seller data
• How is it used ?
HDFS Tiered Storage 21
Access Pattern over months - Session
HDFS Tiered Storage 22
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
session
session
Many options for Archival Storage
• Create another HDFS cluster
– Move data using Distcp
• Non HDFS cheap storage
• Tiered Storage
– DISK
– ARCHIVE
HDFS Tiered Storage 23
Benefits of using Tiered Storage
• Data resides in one cluster
– No change for applications
– Security of datasets remains unchanged
– Audit Log History can be aggregated from one cluster
– Data can be moved back to Compute nodes upon demand
– Operational simplicity
HDFS Tiered Storage 24
Tiered Storage on Apollo Cluster
HDFS Tiered Storage 25
0
10
20
30
40
50
60
2012 2013 2014 2014
Disk Vs Archival Node
HDFS Tiered Storage 26
Regular Archive
Storage Capacity 40 TB 210 TB
# of drives 12 60
# of processing units 32 4
Main Memory 128 GB 64GB
cost / GB X¢ (varies ) X/5 ¢ (varies)
Runs Node Manager Yes No
Archival Data Node Configuration
• Dfs.datanode.data.dir
– [ARCHIVE]/hadoop/1/data, [ARCHIVE]/hadoop/2/data,…
• dfs.datanode.balance.max.concurrent.moves
– 400 (10 for DISK)
• dfs.datanode.balance.bandwidthPerSec
– 400 MB ( 10MB for DISK)
HDFS Tiered Storage 27
Archival Process
• For each dataset
– Scan HDFS Audit logs to identify the data access pattern
– Derive an Archival Policy. Example:
– Set up storage policy on sub directories based on Archival Policy.
– Run Mover for the dataset
HDFS Tiered Storage 28
Time Storage Policy Block Placement
< 90 days HOT 3 replicas in DISK
> 90 days < 270 days WARM 1 DISK , 2 ARCHIVE
> 270 days COLD 3 replica in ARCHIVE
Notes
• The Mover speed goes up with more sources and destinations
– With 2000 > 48 nodes, 1 PB gets moved in 12 hours
• FSCK displays information on types and storage policies for a directory.
• The detailed storage tier information on Namenode UI is in progress.
HDFS Tiered Storage 29
Summary
• eBay uses Tiered Storage to store rarely used data
– Reduces storage costs by using big storage with limited computing
• Tiered storage can be operated using storage types and storage policies.
• An Archival policy needs to be setup for datasets based on their access pattern.
– Keep Audit Logs to analyze the access pattern.
• Tiered storage information is available via FSCK.
HDFS Tiered Storage 30
Thank you!
Questions?

Contenu connexe

Tendances

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...SindhuVasireddy1
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoApache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoDatabricks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4Vepsun Technologies
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
 

Tendances (20)

Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan FlonenkoApache Spark on K8S and HDFS Security with Ilan Flonenko
Apache Spark on K8S and HDFS Security with Ilan Flonenko
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 

Similaire à Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature

Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePete Kisich
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Hadoop HDFS and Oracle
Hadoop HDFS and OracleHadoop HDFS and Oracle
Hadoop HDFS and OracleJohan Louwers
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Red Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open StackRed Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open StackRed_Hat_Storage
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptxSwarnaSLcse
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Kevin Crocker
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxiothelabdude
 

Similaire à Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature (20)

Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS Storage
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop HDFS and Oracle
Hadoop HDFS and OracleHadoop HDFS and Oracle
Hadoop HDFS and Oracle
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Red Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open StackRed Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open Stack
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
Giraffa - November 2014
Giraffa - November 2014Giraffa - November 2014
Giraffa - November 2014
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature

  • 1. Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Tsz-Wo Nicholas Sze @Hortonworks Benoy Antony @eBay June 9, 2015
  • 3. About Me • Software Engineer @Hortonworks – HDFS development lead • PMC Member at Apache Hadoop – Started working on Hadoop in 2007 – One of the most active contributors/committers of HDFS • Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit – It was a World Record. • Received Ph.D. from the University of Maryland, College Park 3 = 3.141592654…
  • 4. Previous Storage Model • Datanode: a single logical storage consisting of one or more physical storage medias Supporting Heterogeneous Storage 4
  • 5. New Storage Model • Datanode: a collection of storages, each storage corresponding to a physical storage media. Supporting Heterogeneous Storage 5
  • 6. Advantages of Multiple Storage Model • Heterogeneity – Individual storage may have different characteristic: capacity, speed, cost, etc. • Flexibility – Client could write to a particular storage type • Scalability – A datanode can support more storage drive (>100) since storage reports are sent individually. Supporting Heterogeneous Storage 6
  • 7. Storage Types • ARCHIVE: archival drive – slow but cheap • DISK: hard disk drive – default type • SSD: solid state drive – fast but expensive • RAM_DISK: memory drive – capacity is limited – data could be lazily persisted Supporting Heterogeneous Storage 7
  • 8. Block Storage Policy • Describe how to store a data block in HDFS – When a new file/block is created, how to store the data? – After a file/block is created, should it be replicated to some other storages? – When there is not enough space in a particular storage type, • Should the operation fail? • Fallback to other storage type? Supporting Heterogeneous Storage 8
  • 9. Block Storage Policies • HOT – Store all replicas to DISK – Fallback to ARCHIVE if necessary. – Default policy • WARM – Store a replica to DISK and the other replicas to ARCHIVE – May fallback to DISK or ARCHIVE – For data need to be archived soon • COLD – Store all replicas to ARCHIVE – Do not fallback to any other storage type – For archival data Supporting Heterogeneous Storage 9
  • 11. More Block Storage Policies • ONE_SSD – Store a replica to SSD and the other replicas to DISK – May fallback to SSD or DISK – For data needed to be access locally after write • ALL_SSD – Store all replicas to SSD – Fallback to DISK if necessary – For frequently accessed data • LAZY_PERSIST – Store only one replica to memory – The replica stored in memory is lazily persisted to DISK later on – For transient data Supporting Heterogeneous Storage 11
  • 13. Enforcing Block Storage Requirements • When a block is created, – block placement tells how to store the replicas. • When the blocks are stored in HDFS, datanode or network may fail so that – Blocks may become under/over replicated – Blocks may fail to satisfy rack requirement – Blocks may fail to satisfy storage policy requirement – Cluster may become imbalanced (some datanodes may be over utilized while some are under utilized) HDFS needs to fix them! Supporting Heterogeneous Storage 13
  • 14. Maintaining Block Storage Requirements • Replication Monitor – Copy/delete replicas to satisfy replication and rack requirement – Won’t move replicas around for fulfilling storage policy requirement or balancing the cluster. • Balancer – Move replicas around to balance the cluster – Preserves replication and rack requirement – Preserves storage type: always move replica from a storage to another storage with the same type. • Mover (a new tool) – Mover replicas around to satisfy storage policy requirement – Preserves replication and rack requirement Supporting Heterogeneous Storage 14
  • 15. Feature Development • HDFS-2832: Heterogeneous Storage – 66 subtasks (Mainly contributed by Hortonworks) – Committed to Hadoop 2.3.0 • HDFS-6584: Archival Storage – 33 subtasks (Mainly contributed by Hortonworks) – Committed to Hadoop 2.6.0 • HDFS-7584: Quota Support for Storage Types – 11 subtasks (Mainly contributed by Hortonworks) – Committed to Hadoop 2.7.0 • HDFS-7197: Enhancements to Mover – 10 subtasks (Mainly contributed by eBay) Supporting Heterogeneous Storage 15
  • 17. About Me • Software Engineer @eBay • Apache Hadoop Committer – working on Hadoop since 2011 • Focus on HDFS & Security 17
  • 18. Outline •Requirements for Archival storage •Setting up Tiered storage •Archival Process HDFS Tiered Storage 18
  • 19. eBay’s Apollo Cluster HDFS Capacity HDFS Tiered Storage 19 0 5 10 15 20 25 30 35 40 45 2012 2013 2014
  • 20. Cluster is getting Full HDFS Tiered Storage 20
  • 21. Data at eBay’ Clusters • What’s Stored ? – Different types of User behavior Data – Experimentation results – Buyer and Seller data • How is it used ? HDFS Tiered Storage 21
  • 22. Access Pattern over months - Session HDFS Tiered Storage 22 0 5000 10000 15000 20000 25000 30000 35000 40000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 session session
  • 23. Many options for Archival Storage • Create another HDFS cluster – Move data using Distcp • Non HDFS cheap storage • Tiered Storage – DISK – ARCHIVE HDFS Tiered Storage 23
  • 24. Benefits of using Tiered Storage • Data resides in one cluster – No change for applications – Security of datasets remains unchanged – Audit Log History can be aggregated from one cluster – Data can be moved back to Compute nodes upon demand – Operational simplicity HDFS Tiered Storage 24
  • 25. Tiered Storage on Apollo Cluster HDFS Tiered Storage 25 0 10 20 30 40 50 60 2012 2013 2014 2014
  • 26. Disk Vs Archival Node HDFS Tiered Storage 26 Regular Archive Storage Capacity 40 TB 210 TB # of drives 12 60 # of processing units 32 4 Main Memory 128 GB 64GB cost / GB X¢ (varies ) X/5 ¢ (varies) Runs Node Manager Yes No
  • 27. Archival Data Node Configuration • Dfs.datanode.data.dir – [ARCHIVE]/hadoop/1/data, [ARCHIVE]/hadoop/2/data,… • dfs.datanode.balance.max.concurrent.moves – 400 (10 for DISK) • dfs.datanode.balance.bandwidthPerSec – 400 MB ( 10MB for DISK) HDFS Tiered Storage 27
  • 28. Archival Process • For each dataset – Scan HDFS Audit logs to identify the data access pattern – Derive an Archival Policy. Example: – Set up storage policy on sub directories based on Archival Policy. – Run Mover for the dataset HDFS Tiered Storage 28 Time Storage Policy Block Placement < 90 days HOT 3 replicas in DISK > 90 days < 270 days WARM 1 DISK , 2 ARCHIVE > 270 days COLD 3 replica in ARCHIVE
  • 29. Notes • The Mover speed goes up with more sources and destinations – With 2000 > 48 nodes, 1 PB gets moved in 12 hours • FSCK displays information on types and storage policies for a directory. • The detailed storage tier information on Namenode UI is in progress. HDFS Tiered Storage 29
  • 30. Summary • eBay uses Tiered Storage to store rarely used data – Reduces storage costs by using big storage with limited computing • Tiered storage can be operated using storage types and storage policies. • An Archival policy needs to be setup for datasets based on their access pattern. – Keep Audit Logs to analyze the access pattern. • Tiered storage information is available via FSCK. HDFS Tiered Storage 30

Notes de l'éditeur

  1. In this presentation, I will first describe the motivation behind archiving some of our datasets and benefits of using tiered storage to archive the data. Then we will go over the setup of tiered storage on one of our clusters. I’ll describe our Archival process to identify the data to archive and their movement to archival storage. Let us first review eBay’s requirements for Archival Storage by taking a look at our clusters.
  2. The Apollo cluster was setup in 2012 with around 20 Petabytes of storage, then we needed more storage and computing, increased the capacity to 36 Peta bytes in 2013. Towards the end of 2013, we were getting close to capacity, again . So we added some more data nodes and got the capacity up to 40 Peta bytes.
  3. Our computing capacity requirements for Apollo Cluster were satisfied at this point, but data kept on growing. So it was evident that we needed to increase the storage capacity, to store the the existing data as well as the new data that’s coming in. So far, to increase capacity, we added the machines which has both storage and computing power. But this time we wanted to take advantage of the condition that we needed more storage , but no compute. We wanted to reduce costs by paying only for storage. We decided to take a look at the data we store in our cluster to identify the data which are rarely used.
  4. A major class of data that we store in the Apollo cluster is User behavior data. There are different types of user behavior data. There is experimentation data, buyer and seller related data etc.These data will be used by various teams at eBay. They write their own map reduce jobs, hive queries, spark jobs etc to aggregate/summarize the data to derive insights or make decisions. For most of the datasets mentioned here, we expected that the data is heavily used initially and then the usage would go down. . We decided to verify this assumption by scanning our audit logs. The audit logs show who accessed which data and when. At eBay, we store our HDFS Audit logs in HDFS itself and we keep it for a long time.
  5. This chart shows the access pattern of files under a dataset which has information about user’s sessions. The session data was stored in HDFS in early January , 2014. The X axis shows the age of the data in months and Y axis shows the access count. In the first months, there were 35K accesses for files under the directory. This dropped to 30 K in February. In March , it was around 7.5 K. Starting from the 8th month onwards, there was not much access. For this data, there is not much computation happening after 8 months and so there is no benefit to keep it on a node which has computing power. We can move it to a data node which has little computing power
  6. There are many options for Archival storage. We could create an HDFS only cluster using machines with low computing power. The data can be moved using distcp. Another option is to have a NON-HDFS cheap storage and copy the data from the original cluster to this external storage. But if you want to keep the data in same cluster, HDFS tiered storage is the right choice. HDFS tiered storage introduces the concept of Storage types. There is a DISK storage type and Archive storage type. The default DISK storage type refers to normal storage where datanode has both compute and storage so that we can take advantage of data locality for data processing. The ARCHIVE storage type refers to data nodes which does not have computing power. Since the computing power is less, we pay mainly for the storage only and not for computing. This lowers the cost per GB for archival node compared to a DISK node. Once you have DISK and ARCHIVE storage types in the cluster, you archive data by moving the blocks from DISK Storage to ARCHIVE Storage.
  7. There are many benefits of Archiving using tiered storage . Most benefits are due to the fact that the dataset continues to reside on the HDFS in the original cluster. The applications which operate on the data does not need to change at all. The hdfs location remains same. The attributes on the dataset like ACLS or permissions remains same. The Audit Logs will continue record the information as if nothing has changed for these datasets. So the audit log history will contain the full history from the time it was stored till you delete it. Now in case, data access pattern changes and archived data gets heavily used , then we can move the data back to compute nodes to achieve data locality during computation. There is also operational simplicity since you do not need to manage a separate form of Archival Storage.
  8. Setting up Tiered storage on our Apollo Cluster. As I mentioned we already had 40 PB of storage with storage Type DISK. This comprised of 2000 datanodes. We added a new set of data nodes which are heavy in storage, but with limited computing power. There were 48 nodes with a total storage capacity of 10 Peta bytes. This formed our tiered storage, which is 40 peta bytes of DISK and 10 peta bytes of ARCHIVE storage.
  9. Let us do comparison between a Disk data node and an Archive datanode.Our regular Disk nodes have a capacity ranging from 20 TB to 40 TB. We have a lot of 20 TB and few 40 TB nodes. But the details in slide refers to 40TB data node since they are the recent ones.The ARCHIVE nodes have 210 TB storage capacity on one datanode. The DISK nodes have 12 drives whereas ARCHIVE nodes have 60 drives. The regular DISK nodes run computations on the data. There are a total of 32 processing units on the regular DISK nodes. The ARCHIVE nodes are to store data only. The processing power is needed to mainly run the operating system and datanode process. So they have only 4 processing units. The regular datanode have 128GB of main memory where as ARCHIVAL nodes havea main memory of only 64 GB. The cost per GB of Archival storage is about one fifth of that DISK storage. There is no node manager running in the Archival datanode since the data is rarely accessed. It runs only the data node . The DISK nodes run both Node Manager to provision containers and data node.
  10. The Archival nodes run only the data node process on them. There are few key differences in the data node configuration. The datanode directories should be marked as ARCHIVE. The major operation that happens on these data nodes is the move operation from a DISK storage. So we need to tune the Archival data node to support the move operations efficiently. There are two parameters which influence the move operations on a data node. dfs.datanode.balance.max.concurrent.moves specifies how many moves are concurrently allowed to or from a data node. The second parameter is the allowed bandwidth for block moves to or from a data node. In our cluster, we had 2000 DISK data nodes moving blocks to 48 ARCHIVE data nodes. The ratio between the source DISK nodes to the destination ARCHIVE nodes is 1:40 In our clusters, for DISK data nodes we have set concurent moves to 10. For ARCHIVE nodes, we have set this value to 400, following he 1:40 ration In our clusters, for DISK data nodes we have set this value to 10 MB.For Archival nodes, we have set this value 400 MB. With these values , we moved around 1 Peta byte of data from DISK to ARCHIVE storage in 12 hours without impacting the normal cluster operations.
  11. So far we have moved around 8 petabytes of data to Archival storage. Going forward, we plan to derive an Archival policy for each dataset. To derive an Archive policy, we analyze the HDFS audit logs to determine the data access pattern for the dataset. Consider a dataset which is accessed heavily in the initial 3 months, then accessed moderately in the next 6 months and then rarely accessed beyond 9 months. For this dataset, we will define an archival policy where the directory’s storage policy set to HOT initially , then changed to WARM for the next 6 months and after 9 months , the policy is set to cold so that all replicas are moved to the ARCHIVAL storage. Our archival process will have the following steps : for each dataset. Identify data access pattern, Derive the Archival policy based on the data access pattern, Have a process which sets the storage policies according to the derived Archival policy and run mover to enforce the storage policies.
  12. A few additional points to note about the tiered storage. The speed of data movement from DISK nodes to Archive nodes is proportional to the number of DISK nodes and ARCHIVE nodes. With 2000 DISK nodes and 48 ARCHIVE nodes, we could archive 1 PB of data in 12 hours. FSCK provides vital information for a given directory. We have enhanced FSCK so that it provides information on the storage tiers for the data blocks belonging to the directory. The work on displaying storage tier information on the Name node UI is in progress.
  13. Summarizing my presentation, eBay uses tiered storage to store hot data and cold data. The storage costs are reduced by lowering compute capability for nodes which store cold data. HDFS tiered storage is operated via storage types and storage policies associated to directories. An archival policy needs to be setup for datasets based on their access patterns. Audit logs are useful identifying the access patterns. So collect them and store them for a long time. The archive process should change storage policies on directories to trigger their movement. FSCK provides information about tiered storage.