SlideShare une entreprise Scribd logo
1  sur  38
1
HDFS Tiered Storage:
Mounting Object Stores
in HDFS
Ewan Higgs
Thomas Demoor
4/28/2018©2018 Western Digital Corporation or its affiliates. All rights reserved.
• Product Owner Object Storage Access
– S3-compatible features (versioning, lifecycle management, async. replication)
– Hadoop integration
• Apache Hadoop contributions:
– S3a filesystem improvements (Hadoop 2.6–2.8+)
– Object store optimizations: rename-free committer (HADOOP-13786)
– HDFS Tiering: Provided Storage (HDFS-9806)
• Ex: Queueing Theory PhD @ Ghent Uni, Belgium
• Tweets @thodemoor
Thomas Demoor
4/28/2018 2©2018 Western Digital Corporation or its affiliates. All rights reserved.
• Software Architect
– Hadoop integration
• Apache Hadoop contributions:
– HDFS Tiering: Provided Storage
• Ex:
– Embedded systems
– Managing market data in finance
– HPC Systems administrator
Ewan Higgs
4/28/2018 3©2018 Western Digital Corporation or its affiliates. All rights reserved.
HDFS + Object Store
• External Tier for Hadoop: Object Store or HDFS or …
• Your trusted HDFS workflow
 Hadoop Apps interact with HDFS
 Data Tiering Async in background
 Admin controls data placement through usual CLI
• HDFS Storage Policy
 DISK, SSD, RAM, ARCHIVE, PROVIDED
• Read Path (HDFS-9806) merged in December 🎉
 Already being used in production at Microsoft.
4/28/2018 4©2018 Western Digital Corporation or its affiliates. All rights reserved.
Demo
• AS AN Administrator,
• I CAN configure HDFS with an S3a backend with an appropriate
Storage Policy
 hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log
 hdfs syncservice -create -backupOnly -name activescale /var/logs
s3a://hadoop-logs/
• SO THAT when a user copies files to HDFS they are asynchronously
copied to to the synchronization endpoint
4/28/2018 5©2018 Western Digital Corporation or its affiliates. All rights reserved.
WIP – Mock Demo
• AS AN Administrator
• I CAN set the Storage Policy to be PROVIDED_ONLY
 hdfs –setStoragePolicy -policy PROVIDED_ONLY -path /var/log
• SO THAT data is no longer in the Datanode but is transparently read
through from the synchronization endpoint on access.
4/28/2018 6©2018 Western Digital Corporation or its affiliates. All rights reserved.
• S3a = Hadoop Compatible FileSystem
– (cfr. wasb://, adl://, gcs://, ceph://, …)
– Direct IO between Hadoop apps and object store: s3a://bucket/object
– Scalable & Resilient: NameNode functions taken up by object store
– Functional since Hadoop 2.7, improvements in 2.8 & ongoing (2x write)
• Our contributions:
– Non-AWS endpoints
– YARN/Hadoop2
– Streaming upload
– …
• S3a in action:
– Hortonworks Data Cloud: https://hortonworks.com/products/cloud/aws/
– Databricks Cloud: https://databricks.com/product/databricks
– Backup entire HDFS clusters
– Archive cold data from HDFS clusters
S3a: quick recap
4/28/2018 7©2018 Western Digital Corporation or its affiliates. All rights reserved.
APP
HADOOP CLUSTER
READWRITE
Hortonworks evolution of Hadoop+Cloud
4/28/2018 8©2018 Western Digital Corporation or its affiliates. All rights reserved.
Evolution towards Cloud Storage as the Primary Data Lake
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp
Hortonworks evolution of Hadoop+Cloud: Next
4/28/2018 9©2018 Western Digital Corporation or its affiliates. All rights reserved.
Application
HDFS
Backup Restore
Input Output
Application
HDFSInput
Output
Copy
Application
HDFS
Input
Output
tmp
Application
HDFS
Write
Through
Load
On-Demand
Input Output
Evolution towards Cloud Storage as the Primary Data Lake
WD Active Archive Object Storage
• Western Digital moving up the stack
• Scale-out object storage system for Private & Public Cloud
• Key features:
 Compatible with Amazon S3 API
 Strong consistency (not eventual!)
 Erasure coding for efficient storage
• Scale:
 Petabytes per rack
 Billions of objects per rack
 Linear scalability in # of racks
• More info at http://www.hgst.com/products/systems
4/28/2018 10©2018 Western Digital Corporation or its affiliates. All rights reserved.
• Hadoop-compatible does not mean identical to HDFS
– S3a is not a FileSystem
• no notion of directories
• no support for flush/ append (breaks Apache HBase: KV-store)
– No Data Locality:
• less performant for hot/real-time data
– Hadoop admin tools require HDFS:
• permissions/quota/security/…
• Goal: integrate better with HDFS
– Data-locality for hot data + object storage for cold data
– Offer familiar HDFS abstractions, admin tools, …
S3a doesn’t fit all use cases
4/28/2018 11©2018 Western Digital Corporation or its affiliates. All rights reserved.
APP
HADOOP CLUSTER
READWRITE
HDFS
• Use HDFS to manage remote storage
Solution: “Mount” remote storage in HDFS
4/28/2018 12©2018 Western Digital Corporation or its affiliates. All rights reserved.
• Use HDFS to manage remote storage
– HDFS coordinates reads/writes to remote store
– Mount remote store as a PROVIDED tier in HDFS
• Details later in the talk
– Set StoragePolicy to move data between the tiers
… …
/
a b
HDFS
Namespace
… …
/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
READWRITE
Alias Map
HDFS Block->Remote location
• HDFS + S3A/WASB/…
• HDFS for streaming append workloads and data locality
• Object Storage for dense resilient storage of arbitrarily large scale
• Can also do HDFS on HDFS
Best of Both Worlds
4/28/2018 13©2018 Western Digital Corporation or its affiliates. All rights reserved.
• This design means HDFS could be used as a cache for Object Storage
• Similar to Alluxio, Ignite, Rubix?
– These are all caching systems that dispatch between storage systems horizontally
– We want to tier the storage systems vertically
– We want HDFS to remain source of truth for Hadoop
Caching
4/28/2018 14©2018 Western Digital Corporation or its affiliates. All rights reserved.
Compute
Alluxio
☁️
Compute
Alluxio
Compute
Alluxio
HDFS HDFS HDFS
Compute
☁️
Compute Compute
HDFS HDFS HDFS
• Preserve file-object mapping
– AliasMap (last year’s talk): used to synchronize namespaces
– Multiple Datanodes collaborate to move blocks which together form a file/object in
destination system
• Minimize impact on frontend traffic / Efficient data transfer
– Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring
– Efficient: Transfer directly copies block per block outside of cluster using
• S3: multipart upload
• WASB: append blobs
• HDFS: tmpdir + concat
• Deployment model
– In Namenode is easy to deploy but adds resource pressure
– External service is more difficult to deploy for some sites but reduces resource pressure
• See AliasMap from HDFS-9806 and SPS service from HDFS-10825
• Would like to re-use SPS protocols
Requirements
4/28/2018 15©2018 Western Digital Corporation or its affiliates. All rights reserved.
Tech Details
4/28/2018 16©2018 Western Digital Corporation or its affiliates. All rights reserved.
• Applications write to HDFS
– First to DISK, then SyncService asynchronously copies to synchronization endpoint
– When files have been copied, the extraneous disk replicas can be removed
Tech Details
4/28/2018 17©2018 Western Digital Corporation or its affiliates. All rights reserved.
Namenode
Datanode
Datanode
Datanode
External
Store
File
Block
Client
Block
Block
Write File
Multipart InitMultipart Complete
Multipart PutPart
Extraneous Blocks are
removed.
• MountManager
– Manages all the local mount points
• SyncServiceSatisfier
– Service in Namenode that periodically creates a diff by comparing Snapshots
• PhasedPlanFactory
– Generates the plan for ordering the operations in the diff
• E.g. Create, Modify, Rename, Delete of Files and Directories
• SyncTask Tracker
– Runs the work and tracks it
• Namespace operations (MetadataSyncTask) run from Namenode
• Data operations (BlockSyncTask) run in Datanode (❤️-beat mechanism)
Satisfy Synchronization: How it works
4/28/2018 18©2018 Western Digital Corporation or its affiliates. All rights reserved.
Tech Details – Namenode Components
4/28/2018 19©2018 Western Digital Corporation or its affiliates. All rights reserved.
MountManager SyncServiceSatisfier PhasedPlanFactory
SyncTask TrackerSnapshotDiffReport
SnapshotDiffReport
M d .
+ d ./a
+ f ./f1.bin
Example Diff – Simple Case
4/28/2018 20©2018 Western Digital Corporation or its affiliates. All rights reserved.
Commands
#given /basic-test
mkdir -p /basic-test/a/b/c
touch /basic-test/a/b/c/d/f1.bin
touch /basic-test/f1.bin
Simple Case - New dirs; new files
SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
4/28/2018 21©2018 Western Digital Corporation or its affiliates. All rights reserved.
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
• DistCpSync Algorithm:
1. Rewrite Target Destinations based on other Renames
2. Renames and Deletes to temp directory
3. Renames to final location (mkdirs if directory doesn’t exist yet)
4. Drop temp directory (and the deletes with them)
5. Creates and Modifies
• PhasedPlan Algorithm:
1. Rewrite Target Destinations based on other Renames
2. Renames and temp directory
3. Delete Files and Directories
4. Renames to final location (mkdirs if directory doesn’t exist yet)
5. Drop temp directory
6. Create Directories on end point
7. Create Files on end point
Use solution in DistCpSync for PhasedPlan
4/28/2018 22©2018 Western Digital Corporation or its affiliates. All rights reserved.
SnapshotDiffReport
M d .
R f ./a.bin -> ./b.bin
R f ./b.bin -> ./a.bin
Example Diff – Harder Case
4/28/2018 23©2018 Western Digital Corporation or its affiliates. All rights reserved.
Commands
#given /swap-test/a.bin
#given /swap-test/b.bin
mv /swap-test/a.bin /swap-test/tmp
mv /swap-test/b.bin /swap-test/a.bin
mv /swap-test/tmp /swap-test/b.bin
Harder Case - Cycle
PhasedPlan
M d .
R f ./a.bin -> ./tmp/b.bin
R f ./b.bin -> ./tmp/a.bin
R f ./tmp/b.bin -> b.bin
R f ./tmp/a.bin -> a.bin
Tech Details – Namenode Components
4/28/2018 24©2018 Western Digital Corporation or its affiliates. All rights reserved.
MountManager SyncServiceSatisfier PhasedPlanFactory
SyncTask TrackerSnapshotDiffReport
• Common concept in Object Storage
– Supported by S3, WASB
• Usage in Hadoop
– S3A uses it – see Steve Loughran’s talk
– New to HDFS – HDFS-13186
• Three phases
– UploadHandle initMultipart(Path filePath)
– PartHandle putPart(Path filePath, InputStream inputStream,
int partNumber, UploadHandle uploadId, long lengthInBytes)
– void complete(Path filePath, List<Pair<Integer, PartHandle>> handles,
UploadHandle multipartUploadId)
• Benefits:
– Object/File Isolation – you only see the results when it’s done
– Can be written in parallel across multiple nodes
MultipartUploader
4/28/2018 25©2018 Western Digital Corporation or its affiliates. All rights reserved.
MultipartUploader in SyncService
4/28/2018 26©2018 Western Digital Corporation or its affiliates. All rights reserved.
• All this work requires a synchronization point
• Namenode is the obvious place
• Namenode is too busy in large systems
• Optionally push SyncService to external service
– Precedents: AliasMap, StoragePolicySatisfier
Namenode work reduction
4/28/2018 27©2018 Western Digital Corporation or its affiliates. All rights reserved.
• Tiered Storage HDFS-12090 [issues.apache.org]
– Design documentation
– List of subtasks, lots of linked tickets – take one!
– Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
– {thomas.demoor, ewan.higgs}@wdc.om
– {cdoug,vijala}@microsoft.com
Resources
4/28/2018 28©2018 Western Digital Corporation or its affiliates. All rights reserved.
Virajith Jalaparthi, Chris Douglas, Ewan Higgs, Kasper Janssens, Bert
Verslyppe, Hendrik Depauw, Rakesh R, Daryn Sharp, Steve Loughran, Thomas
Demoor, Allen Wittenauer, and more!
Thanks!
4/28/2018 29©2018 Western Digital Corporation or its affiliates. All rights reserved.
4/28/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 30
Extra Slides
4/28/2018 31©2018 Western Digital Corporation or its affiliates. All rights reserved.
Create Directory
4/28/2018 32©2018 Western Digital Corporation or its affiliates. All rights reserved.
Delete Directory
4/28/2018 33©2018 Western Digital Corporation or its affiliates. All rights reserved.
Rename Directory
4/28/2018 34©2018 Western Digital Corporation or its affiliates. All rights reserved.
Create File
4/28/2018 35©2018 Western Digital Corporation or its affiliates. All rights reserved.
Delete File
4/28/2018 36©2018 Western Digital Corporation or its affiliates. All rights reserved.
Rename File
4/28/2018 37©2018 Western Digital Corporation or its affiliates. All rights reserved.
Modify File
4/28/2018 38©2018 Western Digital Corporation or its affiliates. All rights reserved.

Contenu connexe

Tendances

Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...DataWorks Summit
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopDataWorks Summit
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeDataWorks Summit
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...DataWorks Summit/Hadoop Summit
 
Building a data-driven authorization framework
Building a data-driven authorization frameworkBuilding a data-driven authorization framework
Building a data-driven authorization frameworkDataWorks Summit
 
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...DataWorks Summit
 
Benefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleBenefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleHortonworks
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Operating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environmentOperating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environmentDataWorks Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the GlobeDataWorks Summit
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...DataWorks Summit
 
From an experiment to a real production environment
From an experiment to a real production environmentFrom an experiment to a real production environment
From an experiment to a real production environmentDataWorks Summit
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Eric Rice
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Compute-based sizing and system dashboard
Compute-based sizing and system dashboardCompute-based sizing and system dashboard
Compute-based sizing and system dashboardDataWorks Summit
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
 

Tendances (20)

Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on Hadoop
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 
Building a data-driven authorization framework
Building a data-driven authorization frameworkBuilding a data-driven authorization framework
Building a data-driven authorization framework
 
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
O2’s Financial Data Hub: going beyond IFRS compliance to support digital tran...
 
Benefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at ScaleBenefits of Transferring Real-Time Data to Hadoop at Scale
Benefits of Transferring Real-Time Data to Hadoop at Scale
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Operating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environmentOperating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environment
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
 
From an experiment to a real production environment
From an experiment to a real production environmentFrom an experiment to a real production environment
From an experiment to a real production environment
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Compute-based sizing and system dashboard
Compute-based sizing and system dashboardCompute-based sizing and system dashboard
Compute-based sizing and system dashboard
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 

Similaire à HDFS tiered storage: mounting object stores in HDFS

Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Hadoop and Cloudian HyperStore
Hadoop and Cloudian HyperStoreHadoop and Cloudian HyperStore
Hadoop and Cloudian HyperStoreCloudian
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Alluxio, Inc.
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Hadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStoreHadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStoreCLOUDIAN KK
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
 
Enabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid CloudEnabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid CloudAlluxio, Inc.
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
 
Red Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open StackRed Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open StackRed_Hat_Storage
 

Similaire à HDFS tiered storage: mounting object stores in HDFS (20)

HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Hadoop and Cloudian HyperStore
Hadoop and Cloudian HyperStoreHadoop and Cloudian HyperStore
Hadoop and Cloudian HyperStore
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Hadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStoreHadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStore
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Enabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid CloudEnabling Apache Spark for Hybrid Cloud
Enabling Apache Spark for Hybrid Cloud
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
 
Red Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open StackRed Hat Storage Server Roadmap & Integration With Open Stack
Red Hat Storage Server Roadmap & Integration With Open Stack
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

HDFS tiered storage: mounting object stores in HDFS

  • 1. 1 HDFS Tiered Storage: Mounting Object Stores in HDFS Ewan Higgs Thomas Demoor 4/28/2018©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 2. • Product Owner Object Storage Access – S3-compatible features (versioning, lifecycle management, async. replication) – Hadoop integration • Apache Hadoop contributions: – S3a filesystem improvements (Hadoop 2.6–2.8+) – Object store optimizations: rename-free committer (HADOOP-13786) – HDFS Tiering: Provided Storage (HDFS-9806) • Ex: Queueing Theory PhD @ Ghent Uni, Belgium • Tweets @thodemoor Thomas Demoor 4/28/2018 2©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 3. • Software Architect – Hadoop integration • Apache Hadoop contributions: – HDFS Tiering: Provided Storage • Ex: – Embedded systems – Managing market data in finance – HPC Systems administrator Ewan Higgs 4/28/2018 3©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 4. HDFS + Object Store • External Tier for Hadoop: Object Store or HDFS or … • Your trusted HDFS workflow  Hadoop Apps interact with HDFS  Data Tiering Async in background  Admin controls data placement through usual CLI • HDFS Storage Policy  DISK, SSD, RAM, ARCHIVE, PROVIDED • Read Path (HDFS-9806) merged in December 🎉  Already being used in production at Microsoft. 4/28/2018 4©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 5. Demo • AS AN Administrator, • I CAN configure HDFS with an S3a backend with an appropriate Storage Policy  hdfs storagepolicies -setStoragePolicy -policy PROVIDED -path /var/log  hdfs syncservice -create -backupOnly -name activescale /var/logs s3a://hadoop-logs/ • SO THAT when a user copies files to HDFS they are asynchronously copied to to the synchronization endpoint 4/28/2018 5©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 6. WIP – Mock Demo • AS AN Administrator • I CAN set the Storage Policy to be PROVIDED_ONLY  hdfs –setStoragePolicy -policy PROVIDED_ONLY -path /var/log • SO THAT data is no longer in the Datanode but is transparently read through from the synchronization endpoint on access. 4/28/2018 6©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 7. • S3a = Hadoop Compatible FileSystem – (cfr. wasb://, adl://, gcs://, ceph://, …) – Direct IO between Hadoop apps and object store: s3a://bucket/object – Scalable & Resilient: NameNode functions taken up by object store – Functional since Hadoop 2.7, improvements in 2.8 & ongoing (2x write) • Our contributions: – Non-AWS endpoints – YARN/Hadoop2 – Streaming upload – … • S3a in action: – Hortonworks Data Cloud: https://hortonworks.com/products/cloud/aws/ – Databricks Cloud: https://databricks.com/product/databricks – Backup entire HDFS clusters – Archive cold data from HDFS clusters S3a: quick recap 4/28/2018 7©2018 Western Digital Corporation or its affiliates. All rights reserved. APP HADOOP CLUSTER READWRITE
  • 8. Hortonworks evolution of Hadoop+Cloud 4/28/2018 8©2018 Western Digital Corporation or its affiliates. All rights reserved. Evolution towards Cloud Storage as the Primary Data Lake Application HDFS Backup Restore Input Output Application HDFSInput Output Copy Application HDFS Input Output tmp
  • 9. Hortonworks evolution of Hadoop+Cloud: Next 4/28/2018 9©2018 Western Digital Corporation or its affiliates. All rights reserved. Application HDFS Backup Restore Input Output Application HDFSInput Output Copy Application HDFS Input Output tmp Application HDFS Write Through Load On-Demand Input Output Evolution towards Cloud Storage as the Primary Data Lake
  • 10. WD Active Archive Object Storage • Western Digital moving up the stack • Scale-out object storage system for Private & Public Cloud • Key features:  Compatible with Amazon S3 API  Strong consistency (not eventual!)  Erasure coding for efficient storage • Scale:  Petabytes per rack  Billions of objects per rack  Linear scalability in # of racks • More info at http://www.hgst.com/products/systems 4/28/2018 10©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 11. • Hadoop-compatible does not mean identical to HDFS – S3a is not a FileSystem • no notion of directories • no support for flush/ append (breaks Apache HBase: KV-store) – No Data Locality: • less performant for hot/real-time data – Hadoop admin tools require HDFS: • permissions/quota/security/… • Goal: integrate better with HDFS – Data-locality for hot data + object storage for cold data – Offer familiar HDFS abstractions, admin tools, … S3a doesn’t fit all use cases 4/28/2018 11©2018 Western Digital Corporation or its affiliates. All rights reserved. APP HADOOP CLUSTER READWRITE
  • 12. HDFS • Use HDFS to manage remote storage Solution: “Mount” remote storage in HDFS 4/28/2018 12©2018 Western Digital Corporation or its affiliates. All rights reserved. • Use HDFS to manage remote storage – HDFS coordinates reads/writes to remote store – Mount remote store as a PROVIDED tier in HDFS • Details later in the talk – Set StoragePolicy to move data between the tiers … … / a b HDFS Namespace … … / d e f Remote Namespace Mount remote namespace c d e f Mount point REMOTE STORE APP HADOOP CLUSTER WRITE THROUGH LOAD ON-DEMAND READWRITE Alias Map HDFS Block->Remote location
  • 13. • HDFS + S3A/WASB/… • HDFS for streaming append workloads and data locality • Object Storage for dense resilient storage of arbitrarily large scale • Can also do HDFS on HDFS Best of Both Worlds 4/28/2018 13©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 14. • This design means HDFS could be used as a cache for Object Storage • Similar to Alluxio, Ignite, Rubix? – These are all caching systems that dispatch between storage systems horizontally – We want to tier the storage systems vertically – We want HDFS to remain source of truth for Hadoop Caching 4/28/2018 14©2018 Western Digital Corporation or its affiliates. All rights reserved. Compute Alluxio ☁️ Compute Alluxio Compute Alluxio HDFS HDFS HDFS Compute ☁️ Compute Compute HDFS HDFS HDFS
  • 15. • Preserve file-object mapping – AliasMap (last year’s talk): used to synchronize namespaces – Multiple Datanodes collaborate to move blocks which together form a file/object in destination system • Minimize impact on frontend traffic / Efficient data transfer – Obvious: Read all blocks into a single Datanode to reconstruct a file before transferring – Efficient: Transfer directly copies block per block outside of cluster using • S3: multipart upload • WASB: append blobs • HDFS: tmpdir + concat • Deployment model – In Namenode is easy to deploy but adds resource pressure – External service is more difficult to deploy for some sites but reduces resource pressure • See AliasMap from HDFS-9806 and SPS service from HDFS-10825 • Would like to re-use SPS protocols Requirements 4/28/2018 15©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 16. Tech Details 4/28/2018 16©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 17. • Applications write to HDFS – First to DISK, then SyncService asynchronously copies to synchronization endpoint – When files have been copied, the extraneous disk replicas can be removed Tech Details 4/28/2018 17©2018 Western Digital Corporation or its affiliates. All rights reserved. Namenode Datanode Datanode Datanode External Store File Block Client Block Block Write File Multipart InitMultipart Complete Multipart PutPart Extraneous Blocks are removed.
  • 18. • MountManager – Manages all the local mount points • SyncServiceSatisfier – Service in Namenode that periodically creates a diff by comparing Snapshots • PhasedPlanFactory – Generates the plan for ordering the operations in the diff • E.g. Create, Modify, Rename, Delete of Files and Directories • SyncTask Tracker – Runs the work and tracks it • Namespace operations (MetadataSyncTask) run from Namenode • Data operations (BlockSyncTask) run in Datanode (❤️-beat mechanism) Satisfy Synchronization: How it works 4/28/2018 18©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 19. Tech Details – Namenode Components 4/28/2018 19©2018 Western Digital Corporation or its affiliates. All rights reserved. MountManager SyncServiceSatisfier PhasedPlanFactory SyncTask TrackerSnapshotDiffReport
  • 20. SnapshotDiffReport M d . + d ./a + f ./f1.bin Example Diff – Simple Case 4/28/2018 20©2018 Western Digital Corporation or its affiliates. All rights reserved. Commands #given /basic-test mkdir -p /basic-test/a/b/c touch /basic-test/a/b/c/d/f1.bin touch /basic-test/f1.bin Simple Case - New dirs; new files
  • 21. SnapshotDiffReport M d . R f ./a.bin -> ./b.bin R f ./b.bin -> ./a.bin Example Diff – Harder Case 4/28/2018 21©2018 Western Digital Corporation or its affiliates. All rights reserved. Commands #given /swap-test/a.bin #given /swap-test/b.bin mv /swap-test/a.bin /swap-test/tmp mv /swap-test/b.bin /swap-test/a.bin mv /swap-test/tmp /swap-test/b.bin Harder Case - Cycle
  • 22. • DistCpSync Algorithm: 1. Rewrite Target Destinations based on other Renames 2. Renames and Deletes to temp directory 3. Renames to final location (mkdirs if directory doesn’t exist yet) 4. Drop temp directory (and the deletes with them) 5. Creates and Modifies • PhasedPlan Algorithm: 1. Rewrite Target Destinations based on other Renames 2. Renames and temp directory 3. Delete Files and Directories 4. Renames to final location (mkdirs if directory doesn’t exist yet) 5. Drop temp directory 6. Create Directories on end point 7. Create Files on end point Use solution in DistCpSync for PhasedPlan 4/28/2018 22©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 23. SnapshotDiffReport M d . R f ./a.bin -> ./b.bin R f ./b.bin -> ./a.bin Example Diff – Harder Case 4/28/2018 23©2018 Western Digital Corporation or its affiliates. All rights reserved. Commands #given /swap-test/a.bin #given /swap-test/b.bin mv /swap-test/a.bin /swap-test/tmp mv /swap-test/b.bin /swap-test/a.bin mv /swap-test/tmp /swap-test/b.bin Harder Case - Cycle PhasedPlan M d . R f ./a.bin -> ./tmp/b.bin R f ./b.bin -> ./tmp/a.bin R f ./tmp/b.bin -> b.bin R f ./tmp/a.bin -> a.bin
  • 24. Tech Details – Namenode Components 4/28/2018 24©2018 Western Digital Corporation or its affiliates. All rights reserved. MountManager SyncServiceSatisfier PhasedPlanFactory SyncTask TrackerSnapshotDiffReport
  • 25. • Common concept in Object Storage – Supported by S3, WASB • Usage in Hadoop – S3A uses it – see Steve Loughran’s talk – New to HDFS – HDFS-13186 • Three phases – UploadHandle initMultipart(Path filePath) – PartHandle putPart(Path filePath, InputStream inputStream, int partNumber, UploadHandle uploadId, long lengthInBytes) – void complete(Path filePath, List<Pair<Integer, PartHandle>> handles, UploadHandle multipartUploadId) • Benefits: – Object/File Isolation – you only see the results when it’s done – Can be written in parallel across multiple nodes MultipartUploader 4/28/2018 25©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 26. MultipartUploader in SyncService 4/28/2018 26©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 27. • All this work requires a synchronization point • Namenode is the obvious place • Namenode is too busy in large systems • Optionally push SyncService to external service – Precedents: AliasMap, StoragePolicySatisfier Namenode work reduction 4/28/2018 27©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 28. • Tiered Storage HDFS-12090 [issues.apache.org] – Design documentation – List of subtasks, lots of linked tickets – take one! – Discussion of scope, implementation, and feedback • Joint work Microsoft – Western Digital – {thomas.demoor, ewan.higgs}@wdc.om – {cdoug,vijala}@microsoft.com Resources 4/28/2018 28©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 29. Virajith Jalaparthi, Chris Douglas, Ewan Higgs, Kasper Janssens, Bert Verslyppe, Hendrik Depauw, Rakesh R, Daryn Sharp, Steve Loughran, Thomas Demoor, Allen Wittenauer, and more! Thanks! 4/28/2018 29©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 30. 4/28/2018©2018 Western Digital Corporation or its affiliates. All rights reserved. 30
  • 31. Extra Slides 4/28/2018 31©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 32. Create Directory 4/28/2018 32©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 33. Delete Directory 4/28/2018 33©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 34. Rename Directory 4/28/2018 34©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 35. Create File 4/28/2018 35©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 36. Delete File 4/28/2018 36©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 37. Rename File 4/28/2018 37©2018 Western Digital Corporation or its affiliates. All rights reserved.
  • 38. Modify File 4/28/2018 38©2018 Western Digital Corporation or its affiliates. All rights reserved.

Notes de l'éditeur

  1. StoragePolicy - cf SSD, DISK, …
  2. hdfs syncservice –create –backupOnly –name activescale /var/logs s3a://hadoop-logs/ hdfs copyFromLocal :subdir logs per day (14 days) apps interact directly with hdfs use aws cli to show objects pop up in backend? hdfs show storage policy
  3. Show progress of backup, wait to finish hdfs cat file which is in provided only (seamless, fast)
  4. S3A Recap Difference between HDFS and S3A Use cases where S3A falls down Best of both worlds HDFS + S3A/WASB/… Other mixed systems HDFS + HDFS
  5. Alluxio Virtual central overlay across all your storages aka src of truth -We see hdfs as the hadoop overlay/cache of your data lake / object store They are the intersection of what all storages can do / mappings that can be made (limited s3 rest api, ...) - We get the best of both worlds Caveat: we are not expects on each of these projects.
  6. https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs
  7. PhasedPlanFactory - Could refactor DistCpSync to reuse code but it’s undergoing a lot of work for 3.1
  8. While it might seem like you can simply run the commands in a particular order to get the correct results, it’s not that simple!
  9. We can continue to make arbitrary rings. Detecting Cycles is not nice!
  10. Q: Does anyone see a problem here? A: Renaming files for an object store is not pleasant; doing it twice is worse.
  11. We can continue to make arbitrary rings. Detecting Cycles is not nice!
  12. Major design consideration: Code in Hadoop-hdfs project needs to make multipartupload to Object Store (e.g. S3a, WASB). So the Multipart upload ID or Part information cannot resolve to actual data. To solve this, we wrap a byte[] as UploadHandle and PartHandle that contain the actual information required to pass to the underlying API. GCS supports Composite Objects, which are similar to how HDFS will work: it’s a concatenation of existing objects/files.
  13. Tough to see!!
  14. Same as create file because it must be idempotent.