Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to HDFS allowing mounting external namespaces, both object stores and other HDFS clusters. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. This talk, which corresponds to the work in progress under HDFS-12090, will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speaker
Thomas Demoor, Object Storage Architect, Western Digital
Ewan Higgs, Software Engineer, Western Digital
hdfs syncservice –create –backupOnly –name activescale /var/logs s3a://hadoop-logs/
hdfs copyFromLocal :subdir logs per day (14 days)
apps interact directly with hdfs
use aws cli to show objects pop up in backend?
hdfs show storage policy
Show progress of backup, wait to finish
hdfs cat file which is in provided only (seamless, fast)
S3A Recap
Difference between HDFS and S3A
Use cases where S3A falls down
Best of both worlds
HDFS + S3A/WASB/…
Other mixed systems
HDFS + HDFS
Alluxio
Virtual central overlay across all your storages aka src of truth -We see hdfs as the hadoop overlay/cache of your data lake / object store
They are the intersection of what all storages can do / mappings that can be made (limited s3 rest api, ...) - We get the best of both worlds
Caveat: we are not expects on each of these projects.
PhasedPlanFactory - Could refactor DistCpSync to reuse code but it’s undergoing a lot of work for 3.1
While it might seem like you can simply run the commands in a particular order to get the correct results, it’s not that simple!
We can continue to make arbitrary rings.
Detecting Cycles is not nice!
Q: Does anyone see a problem here?
A: Renaming files for an object store is not pleasant; doing it twice is worse.
We can continue to make arbitrary rings.
Detecting Cycles is not nice!
Major design consideration: Code in Hadoop-hdfs project needs to make multipartupload to Object Store (e.g. S3a, WASB). So the Multipart upload ID or Part information cannot resolve to actual data.
To solve this, we wrap a byte[] as UploadHandle and PartHandle that contain the actual information required to pass to the underlying API.
GCS supports Composite Objects, which are similar to how HDFS will work: it’s a concatenation of existing objects/files.
Tough to see!!
Same as create file because it must be idempotent.