Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
2. >id
Thomas Demoor
• PO/Architect @ Western Digital
• S3-compatible object storage
• Hadoop:
̶ S3a optimizations
• Fast uploader (stream from mem)
• Hadoop2/YARN support
• Coming up: object-store committer
̶ HDFS Tiered Storage
Virajith Jalaparti
• Scientist @ Microsoft CISL
• Hadoop
̶ HDFS Tiered Storage
2
3. Overview
• HDFS Tiered Storage
̶ Mount and manage remote stores through HDFS
• Earlier talks
̶ Hadoop Summit ‘16, San Jose
̶ Dataworks Summit ‘17, Munich
• This talk
̶ Introduce Tiered Storage in HDFS (design, read path,…)
̶ Focus on progress since earlier talks (mounting in HDFS, write path,…)
̶ Demo
3
REMOTE
STORE
APP
HADOOP CLUSTER
HDFS
4. Use Case I: Ephemeral Hadoop Clusters
• EMR on S3, HDInsight over WASB, …
• Several workarounds used today
̶ DistCp
̶ Use only remote storage
̶ Explicitly manage local and cloud storage
• Goal: Seamlessly use local and remote
(cloud) stores as one instance of HDFS
̶ Retrieve data to local cluster on-demand
̶ Use local storage to cache data
4
Data in cloud store (e.g., S3, WASB)
Hadoop
clusterHadoop
cluster read/write
read/write
5. Use Case II: Backup data to object stores
• Business value of Hadoop + Object Storage:
̶ Data retention: very high fault tolerance (erasure coding)
̶ Economics: cheap storage for cold data
̶ Business continuity planning: backup, migrate, …
• Public Clouds: Microsoft Azure, AWS S3, GCS, …
• Private Clouds: WD ActiveScale Object Storage
̶ S3-compatible object storage system
̶ Linear scalability in # racks, objects, throughput
̶ Entry level (100’s TB) – Scale out (5PB+/rack)
̶ http://www.hgst.com/products/systems
6. • Today: Hadoop Compatible FileSystems (s3a://, wasb://)
̶ Direct IO between Hadoop apps and object store
̶ Scalable & Resilient: outsourcing NameNode functions
• Compatible does not mean identical
̶ Most are not even FileSystems (notion of directories, append, …)
̶ No Data Locality: less performant for hot/real-time data
̶ Hadoop admin tools require HDFS: permissions/quota/security/…
̶ Workaround: explicitly manage local HDFS and remote cloud storage
• Goal: integrate better with HDFS
̶ Data-locality for hot data + object storage for cold data
̶ Offer familiar HDFS admin abstractions
Use Case II: Backup data to object stores
APP
HADOOP CLUSTER
READWRITE
7. Solution: “Mount” remote storage in HDFS
• Use HDFS to manage remote storage
̶ HDFS coordinates reads/writes to remote store
̶ Mount remote store as a PROVIDED tier in HDFS
• Details later in the talk
̶ Set StoragePolicy to move data between the tiers
7
… …
/
a b
HDFS
Namespace
…
… …
/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
HDFS
READWRITE
READWRITE
8. Solution: “Mount” remote storage in HDFS
• Use HDFS to manage remote storage
̶ HDFS coordinates reads/writes to remote store
̶ Mount remote store as a PROVIDED tier in HDFS
• Details later in the talk
̶ Set StoragePolicy to move data between the tiers
• Benefits
̶ Transparent to users/applications
̶ Provides unified namespace
̶ Can extend HDFS support for quotas, security etc.
̶ Enables caching/prefetching
8
REMOTE
STORE
APP
HADOOP CLUSTER
HDFS
9. Challenges
• Synchronize metadata without copying data
̶ Dynamically page in “blocks” on demand
̶ Define policies to prefetch and evict local replicas
• Mirror changes in remote namespace
̶ Handle out-of-band churn in remote storage
̶ Avoid dropping valid, cached data (e.g., rename)
• Handle writes consistently
̶ Writes committed to the backing store must “make sense”
• Dynamic mounting
̶ Efficient/clean mount-unmount behavior
̶ One Object Store mapping to multiple Namenodes
9
10. Outline
• Use cases
• Mounting remote stores in HDFS
• Demo
1. Backup from on-prem HDFS cluster to Azure Blob Store
2. Spin up an ephemeral HDFS cluster on Azure
• Types of mounts
• Reads in Tiered HDFS
• Writes in Tiered HDFS
10
12. Outline
• Use cases
• Mounting remote stores in HDFS
• Demo
1. Backup from on-prem HDFS cluster to Azure Blob Store
2. Spin up an ephemeral HDFS cluster on Azure
• Types of mounts
• Reads in Tiered HDFS
• Writes in Tiered HDFS
12
13. Types of mounts
• Ephemeral mounts
̶ Access data in remote store using HDFS (Use Case I)
̶ <source>: remoteFS://remote/path
̶ <dest>: hdfs://local/path
̶ Changes are bi-directional
• Backup mounts
̶ Backup data from HDFS to remote store (Use Case II)
̶ <source>: hdfs://local/path
̶ <dest>: remoteFS://remote/path
̶ Changes are uni-directional
hdfs dfsadmin -mount <source> <dest> [-ephemeral|-backup]
13
APP
HDFS
APP
HDFS
Ephemeral
mount
Backup
mount
14. Reads in ephemeral mounts
Remote namespace remoteFS://
… …
… …
/
a b c
e f g
d
Remote store
mount
Client
read(/d/e)
read(/c/d/e)
(file data)
(file data)
DN1 DN2
HDFS cluster
NN
… …
d
e f g
14
15. Enabled using the PROVIDED Storage Type
• Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
• Data in remote store mapped to HDFS blocks
on PROVIDED storage
̶ Each block associated with BlockAlias = (REF, nonce)
• Nonce used to detect changes on external store
• REF = (file URI, offset, length); nonce = GUID
• REF= (s3a://bucket/file, 0, 1024); nonce = <ETag>
̶ Mapping stored in a AliasMap
• Can use a KV store which is external to or in the NN
• PROVIDEDVolume on Datanodes reads/writes
data from/to remote store
DN1
Remote store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑟𝑒𝑚𝑜𝑡𝑒/𝑏𝑎𝑟
→ 𝑏 𝑘, … , 𝑏𝑙
𝑏 𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
AliasMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED
15
16. Example: Using an immutable cloud store
• Create FSImage and AliasMap
̶ Block StoragePolicy can be set as required
̶ E.g.: {rep=2, PROVIDED, DISK }
FSImage
AliasMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑟𝑒𝑚𝑜𝑡𝑒://c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑟𝑒𝑚𝑜𝑡𝑒://c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
Remote namespace remoteFS://
… …
… …
/
a b c
e f g
d
Remote store
16
17. Example: Using an immutable cloud store
• Start NN with the FSImage
• All blocks reachable when a DN with PROVIDED storage heartbeats in
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
FSImage
AliasMap
17
Remote namespace remoteFS://
18. Example: Using an immutable cloud store
• DN uses BlockAlias to read
from external store
̶ Data can be cached locally as it
is read (read-through cache)
… …
d
e f g
NN
BlockManager
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖, PROVIDED}}
Remote store
lookup(𝑏𝑖)
FSImage
AliasMap
18
open(“remote:///c/d/f/z1/”, GUID1)
… …
… …
/
a b c
e f g
d
Remote namespace remoteFS://
DN1 DN2
19. Writes in ephemeral mounts
• Metadata operations
̶ create(), mkdir(), chown etc.
̶ Synchronous on remote store
̶ For FileSystems: Namenode performs operation on remote store first
̶ For Blob Stores: metadata operations need not be propagated
• Example: Clients directly accessing S3 do not support notion of directories
• Data operations
̶ One of the Datanodes in the write pipeline writes to remote store
̶ BlockAlias passed in write pipeline
19
APP
HDFS
DN3DN1 DN2
DFSClient Remote store
Alias (Alias)
20. Writes in Backup mounts
• Daemon on Namenode backs up metadata/data in the mount
• Delegate work to Datanodes (similar to SPS [HDFS-10285])
• Backup of data based on remote store capabilities
̶ For FileSystems: Write block by block
̶ For blob stores: multi-part upload to upload blocks in parallel
20
APP
HDFS
DN2
Coordinator DN
Remote store
DN1
21. Writes in Backup mounts
• Daemon on Namenode backs up metadata/data in the mount
• Delegate work to Datanodes (similar to SPS [HDFS-10285])
• Backup of data based on remote store capabilities
̶ For FileSystems: Write block by block
̶ For blob stores: multi-part upload to upload blocks in parallel
• Use snapshots to maintain a consistent view
̶ Backup a particular snapshot
̶ Backup changes from previous snapshot
21
APP
HDFS
22. Assumptions
• Churn is rare and relatively predictable
̶ Analytic workloads, ETL into external/cloud storage, compute in cluster
• Clusters are either consumers/producers for a subtree/region
̶ FileSystem API has too little information to resolve conflicts
Ingest
ETL
Raw Data Bucket
Analytic Results
Bucket
Analytics
22
23. Conflict resolution
• Conflicts occur when remote store is directly modified
• Detected
̶ On read operations: e.g., using open-by-nonce operation
̶ On write operations: e.g., file to be created is already present
• Pluggable policy to resolve conflicts
̶ “HDFS wins”
̶ “Remote store wins”
̶ Rename files under conflict
23
24. Status
• Read-only ephemeral mounts
̶ HDFS-9806 branch on Apache Hadoop
• Backup mounts
̶ Prototype available (available on github)
• Next:
̶ Writes in ephemeral mounts
̶ Conflict resolution
̶ Create mounts in a running Namenode
24
25. Resources + Q&A
• HDFS Tiered Storage HDFS-9806
̶ Design documentation
̶ List of subtasks, lots of linked tickets – take one!
̶ Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
̶ {thomas.demoor, ewan.higgs}@wdc.om
̶ {cdoug,vijala}@microsoft.com
25
27. Benefits of the PROVIDED design
• Use existing HDFS features to enforce quotas, limits on storage tiers
̶ Simpler implementation, no mismatch between HDFS invariants and framework
• Supports different types of back-end storages
̶ org.apache.hadoop.FileSystem, blob stores, etc.
• Credentials hidden from client
̶ Only NN and DNs require credentials of external store
̶ HDFS can be used to enforce access controls for remote store
• Enables several policies to improve performance
̶ Set replication in FSImage to pre-fetch
̶ Read-through cache
̶ Actively pre-fetch while cluster is running
27
Notes de l'éditeur
Welcome. Thanks for coming. We’re discussing a proposal for implementing tiering in HDFS, building on its support for heterogeneous storage.
Thomas…
Hi, I am Virajith, and I am currently working as a Scientist at the Microsoft Cloud Information and Services Lab, where I started the project on HDFS Tiered Storage about a year ago with Chris Douglas.
• Started the effort almost a year ago now
•Chris and virajith posted a design doc, ewan and I were trying to solve the same problem so we joined forces
•In those talks, we dicussed the design. Today, we will of course reintroduce that
•But we want to focus on the progress we’ve made on mounting and the write path.
•And there is a demo. No tricks!
•Ephemeral aka short-living Hadoop clusters
•EMR, HDInsight, whatever custom env you have with k8s, …
•Persistent data lives in remote store outside of Hadoop cluster
•Need to load in data at start-up and backup data before cluster is spun down
•Several workarouds: distcp, sacrificing perf by using remote only or explicitly managing both remote and local in the app
•To adress this use case the goal is for our proposed solution to present a single HDFS instance that abstracts away the underlying topology and retrieves/stores data on demand
•Local storagecan be seen as a temporary cache
•
•Every year at Hadoop Summit interacting with public cloud object stores gains more attention
•Why would one want to use object stores with Hadoop
•data is stored very efficiently at low cost
•enables lots of data movement workflows
•Some people have / need private cloud(scale, compliance, …)
•Install an object store into your DC next to your on-prem Hadoop cluster and get high performance
We happen to make one of these,there are others as well.
•The proposed solution allows mounting remote stores into HDFS
•This can be another HDFS cluster or object stores or …
•Mounting is a well-known abstraction for any (storage) admin
•We leverage HDFS Heterogeneous Storag by adding a new type PROVIDED
•Data can be moved by setting the StoragePolicy: <PROVIDED> or <SSD, PROVIDED>
•Main benefits is that we transparently abstract away the underlying storage.
•The user/app does not know whether data is local or not, HDFS handles this completely, offering a unified namespace and all the regular HDFS admin tools “just work”
• Furthermore, there are nteresting caching / load on-demand opportunitites
•
There are a few challenges. These can be broadly be grouped into the read path and the write path.
In the read path, we're mostly focused on caching and synchronizing changes to the object storage.
In the write path, we're concerned with writing new blocks and dynamically mounting object stores. We consider this phase 2.
Before we go into the technical details of how we make all this work, let’s look at a demo. In this demo, I will show you how we can backup data in an on-prem HDFS cluster to Azure blob store, and once the data is backed up, show that we can spin up HDFS clusters in Azure that can consume this data. So, we will illustrate the ability to both write and read data to remote stores.
[Show local cluster HDFS page] Here we have an on-prem HDFS cluster, which
[show directories in UI] contains two directories under /user/hadoop.
[show hdfs-site.xml] For backup to work, we specify the backup path in the hdfs configuration file. In this case, it is this URL in azure blob store.
[start running the setStoragePolicy command] Now suppose you want to back up the workloads directory. For this, in the current prototype, we just set the storage policy of the directory to be PROVIDED, and use the –scheduleBlockMoves flag to start the storage policy satisfier. We built this prototype on top the SPS work from Intel that is happening in HDFS-10285.
[run the command] now once the command is run, lets go to the Azure portal and to verify that we see it.
[show that we directory appears][switch between HDFS web page and azure webpage] Now we can see that all the files under the workloads directory is backed up to azure.
Now suppose we want to backup another directory. Let’s do it for the bin directory under YCSB.
[run back up command for YCSB].
[go back to azure and show that we have it backed up] See, we now have the YCSB/bin directory backed up.
It is as simple as this. Just set the storage policy and data will be backed up to the configured location.
Now let’s see how we can mount this data in azure blob store on a cluster in Azure. I have already started up a few VMs on Azure that will serve as the hosts for HDFS.
[start creating FSImage] first for the mount to work, we have to create an FSImage to describe what files are stored on the blob store.
[show blocks.csv] This creates a block map, which I will describe later – it essentially is a mapping from block ids to the paths on the remote store. For this demo, we just use a text file but this can be in a KV store.
[run command] now let’s start HDFS on the cloud.
[Show the UI of the hdfs cluster] Here is the webpage for HDFS running on azure. [show the URL] this machine is on azure.
[now go to the file browser] Here we can see that the data all appears in the cluster on the cloud. Now, isn’t that cool
So, what have we seen this in this demo?
We started off with an on-prem cluster,
-> we were able to back up data to azure blob store by setting the storage policy on the data.
-> Then we created an FSImage of this backup to describe what the blob store contains.
-> And finally we were able to start up an HDFS cluster on Azure which reads this FSImage and mounts the data in the blob store.
-> So, a particular location on the on-prem cluster is
-> mapped to a corresponding blob on the blob store,
-> and is eventually accessible on the cluster in the cloud.
Now let’s go into the technical details on how all this works in HDFS.
We define two kinds of mounts in our work for the 2 kinds of use cases we aim to address. The first are ephemeral mounts where we use HDFS to access data in remote stores. Here the source of the data is the remote store, and the mount destination is in HDFS.
-> The change propagation is bi-directional. So, any changes in the remote store are propagated to HDFS, and any changes in HDFS are propagated to the remote store.
-> The 2nd kind of mounts we define are backup mounts.
-> These are used to backup data in HDFS to remote stores. So, the source here is HDFS and the destination is the remote store.
-> The change propogation is one-directional here – only changes in HDFS are transferred to the remote store.
We define two kinds of mounts to simplify how we reason about the semantics of the mounts. Another option is to define merge mounts where we merge the contents of the source and destination – however, the semantics of such mounts can get complicated
Now let’s look at how these mounts work in practice. To start off, I will talk about how reads work in ephemeral mounts.
-> Suppose, this is the part of the remote namespace we want to
-> mount in HDFS. If the mount is successful, we should be able to access data in the cloud through HDFS. That is
-> if a client comes and requests for a particular file, say /d/e, from HDFS, then HDFS should be
-> able to read the file from the external store,
-> get back the data from the external store and
-> stream the data back to the client.
In this work, we enable this----
For this, we introduce a new storage type called Provided which will be a peer to existing storage types. The Provided storage type is used to refer to data in the remote store.
-> So, Datanodes can now support 4 kinds of storage types.
-> Data in the remote store is mapped to HDFS blocks on provided storage.
So, in HDFS today, The NN is partitioned into a namespace (FSNamesystem) that maps files to a sequence of block IDs, and the BlockManager, which is responsible for the block lifecycle management and maintaining the locations of the blocks of any file. In this example, file /a/foo is mapped to blocks with ids bi to bj. Each block ID is mapped to a list of replicas resident on a storage attached to a datanode. For example, here we have block b_i mapped to storages s1, s2 and s3.
-> As HDFS understands blocks, for files in the provided storage, we use a similar mapping. So, a file /remote/bar is mapped to blocks bk to b_l, and each of these blocks is mapped to a provided storage.
-> However, this is not sufficient to locate the data in the remote store. We need to have some mapping between these blocks and how data is laid out in the remote store. For this, every block in “provided” storage are mapped to an alias. An alias is simply a tuple: a reference is something resolvable in the namespace of the remote store, and the nonce to verify that the reference still locates the data matching that block.
For example, if the remote store is another FileSystem, then my reference may be a URI, offset, length. and the nonce can be a GUID like an inode or fileID. If the remote store is a blob store like s3, the REF can be a blob name, offset and length, and the nonce can be a ETAG id.
-> We also maintain an AliasMap which contains the mapping between the block ids and their aliases. This can be in the NN or be an external KV store.
-> Finally, we have provided volumes in Datanodes which are used to read and write data from the external store. The provided volume is essentially implements a client that is capable to talking to the external store.
Summary: the AliasMap helps us map HDFS metadata to metadata on the remote store, and the PROVIDED storage type helps HDFS understand that the data is actually remote.
Let’s drill down into an example and walk through how the ephemeral mounts would work. Assume we want to mount this remote subtree in HDFS.
-> For this we generate two things– the FSImage, and the AliasMap.
-> The FSImage is a mirror of the metadata. Every file in this image is partitioned it a sequence of blocks. The image contains only the block IDs and storage policy for each block. Along with the FSImage, we also generate the AliasMap
-> which stores for each block id, the block Alias on the external store. Each Alias points to the file on the remote store the block references, the offset of the block, and the length of the block, and a nonce (inodeId, LMT) sufficient to detect inconsistency.
The FSImage and AliasMap can now be used to start up a HDFS Namenode. If we set the replication to be > 1, we can load the data into the cluster before any clients read it.
-> When a DN configured with provided storage volume reports in, the NN assumes that all blocks in the AliasMap can be reachable through this Datanode, and it marks all the blocks as available. There are no individual block reports for provided blocks.
-> So, when a client calls getBlockLocations() for a provided file z1,
-> the blockmanager resolves the composite DN to a
-> physical DN that is configured with a provided volume. The DN can be chosen based on a pluggable policy For example, we can resolve the location to the closest DN to the client.
-> Now, when the client goes to the DN to read the provided block, the DN know only that the block is provided but doesn’t have the block local to it. So
-> it goes to the alias map to the resolve the block id to the Alias on the remote store,
-> The DN uses this Alias to open the corresponding file on the remote store, reads the file, and passes the data along to the client,
-> because the block is read through the DN, we can also cache the data as a local block.
I will next briefly talk about how writes work with provided storage.
First lets look at ephemeral mounts.
When the remote store for an ephemeral mount is a FileSystem, the metadata operations are first performed in the remote store by the NN then performed locally. This ensures that if the remote operation fails, the NN can fail the client without having to revert back any local state.
For remote stores that are blob stores or that do not support metadata information such as permissions or directories, metadata operations need not be propagated to the remote store.
For operations that involve writing to files on the remote store, we plug into the existing write pipeline in HDFS. The BlockAlias is passed along to the DN that writes the Provided replica, The DN the uses the Information in the Alias to figure out where to write in the remote store. Any failures in writing to the remote store can be recovered similar to failures in the existing write pipeline.
As opposed to ephemeral mounts, where write operations will be initiated by the client, for back up mounts, writes should happen without any continuous user/client interaction. So, for this, we have a daemon in the NN that backs up the data in the mount.
Whenever any subtree is set up for backup, the backup daemon goes over all the files in this directory and backs them up. It delegates the work of backing up individual files to datanodes similar to how SPS works which is what we used for the prototype in our demo.
The backup can happen based on the capabilities of the remote store. For FSes…., For blob stores, ….
As we backup, the files in HDFS might change. To maintain a consistent view on the remote store, we use snapshots in HDFS. When a backup is initiated, we take a snapshot of the subtree that is being backed up, and copy back the metadata and data of that snapshot. During this time the subtree would have evolved. Once the first snapshot has been copied, we take a 2nd snapshot, figure out that deltas between the snapshots and copy these deltas over. We continue doing this, moving from snapshot to snapshot until the backup is unmounted.
As opposed to ephemeral mounts, where write operations will be initiated by the client, for back up mounts, writes should happen without any continuous user/client interaction. So, for this, we have a daemon in the NN that backs up the data in the mount.
Whenever any subtree is set up for backup, the backup daemon goes over all the files in this directory and backs them up. It delegates the work of backing up individual files to datanodes similar to how SPS works which is what we used for the prototype in our demo.
The backup can happen based on the capabilities of the remote store. For FSes…., For blob stores, ….
As we backup, the files in HDFS might change. To maintain a consistent view on the remote store, we use snapshots in HDFS. When a backup is initiated, we take a snapshot of the subtree that is being backed up, and copy back the metadata and data of that snapshot. During this time the subtree would have evolved. Once the first snapshot has been copied, we take a 2nd snapshot, figure out that deltas between the snapshots and copy these deltas over. We continue doing this, moving from snapshot to snapshot until the backup is unmounted.
In this work, we try to mount stores without expecting any additional APIs than those supported by the FileSystem API. Without additional support from the remote stores, these APIs are generally not sufficient to have a HDFS and the remote store stay in tight synchronization. Even if we mount the remote store as read-only, we can only get eventual consistency without support from the remote store.
However, in general we provide workable semantics for big data workloads. In most scenarios we target, churn is relatively rare, and is generally predictable. For example, most data ingest happens in year/month/day/hour layouts and is mostly additive. Because of this, we can have some simple heuristics that help resolve inconsistencies between the remote store and HDFS.
We also assume that clusters are either producers or consumers of data. If clusters both produce and consume data, then we might run into conflicts. And in most cases, we do not enough information across multiple storage systems to resolve such conflicts
Fundamentally: no magic, here but we try to provide a tractable solution that covers most common cases, and deployments.
Please join us. We have a design document posted to JIRA, an active discussion of the implementation choices, and we’ll be starting a branch to host these changes. The existing work on READ_ONLY_SHARED replicas has a superlative design doc, if you want to contribute but need some orientation in the internal details.
We have a few minutes for questions, but please find us after the talk. There are far more details than we can possibly cover in a single presentation and we’re still setting the design, so we’re very open to collaboration. Thanks, and... let’s take a couple questions.
There are a few points worth calling out, here.
* First, this is a relatively small change to HDFS. The only client-visible change adds a new storage type. As a user, this is simpler than coordinating with copying jobs. In our cloud example, all the cluster’s data is immediately available once it’s in the namespace, even if the replication policy hasn’t prefetched data into local media.
* Second, particularly for read-only mounts, this is a narrow API to implement. For cloud backup scenarios- where the NN is the only writer to the namespace- then we only need the block to object ID map and NN metadata to mount a prefix/snapshot of the cluster.
In our example the cloud credentials are hidden from the client. S3/WAS both authenticate clients to containers using a single key. Because HDFS owns and protects the external store’s credentials, the client only accesses data permitted by HDFS. Generally, we can use features of HDFS that aren’t directly supported by the backing store if we can define the mapping.
Finally, because the client reads through the DN, it can cache a copy of the block on read. Pointedly, the NN can direct the client to any DN that should cache a copy on read, opening some interesting combinations of placement policies and read-through caching. The DN isn’t necessarily the closest to the client, but it may follow another objective function or replication policy.