2. Original Goal
Provide greater availability and durability with geographically distinct replicas.
Multi-Region Replication
• Replicate objects to other Swift clusters.
• Allow a configurable number of remote replicas.
• Ideally allow per container configuration.
Problems
• Very complex to implement, the simpler feature I propose
is already pretty complex.
• Swift currently only has a cluster-wide replica count.
• Tracking how many replicas are remote and where adds
complexity.
• Per container remote replica counts adds complexity.
Complexity = More Time and More Bugs
3. New Goal
Provide greater availability and durability with geographically distinct replicas.
Simpler Container Synchronization
• Replicate objects to other Swift clusters.
• Remote replica count not configurable, it is the number of
replicas the remote cluster is already configured for.
• Per container configuration allowed, but just "to where".
Benefits
• Much simpler (but still complex).
• Doesn't alter fundamental Swift internals.
• Per container configuration that doesn't change behavior,
only the destination.
• Side Benefit: Can actually synchronize containers within the
same cluster. (Migrating an account to another, for instance.)
Simpler = Less Time and Fewer Bugs
4. How the User Would Use It
1. Set the first container's X-Container-Sync-To and
X-Container-Sync-Key values; the To to the second
container's URL and the Key made up:
$ st post -t https://cluster2/v1/AUTH_gholt/container2 -k secret container1
2. Set the second container's X-Container-Sync-To and
X-Container-Sync-Key values; the To to the first
container's URL and the Key the same made up value:
$ st post -t https://cluster1/v1/AUTH_gholt/container1 -k secret container2
Now, any existing objects in the containers will be synced
to one another as well as any additional objects.
5. Advanced Container Synchronization
You can synchronize more than just two containers.
Normally you just synchronize the two containers:
Container 1 Container 2
But, you could synchronize more by using a chain:
Container 1 Container 2 Container 3
6. Caveats
• Valid X-Container-Sync-To destinations must be configured for each
cluster ahead of time. The feature is based on Cluster Trust.
• The Swift cluster clocks need to be set reasonably close to one other.
Swift timestamps each operation and these timestamps are used in conflict
resolution. For example, if an object is deleted on one cluster and
overwritten on the other, whichever has the newest timestamp will win.
• There needs to be enough bandwidth between the clusters to keep up
with all the changes to the synchronized containers.
• There will be a burst of bandwidth used when turning the feature on for
an existing container full of objects.
• A user has no explicit guarantee when a change will make it to the remote
cluster. For example, a successful PUT means that cluster has the object,
not the remote cluster. The synchronization happens in the background.
• Does not sync object POSTs yet (more on this later).
• Since background syncs come from the container servers themselves, they
need to communicate with the remote cluster, probably requiring an HTTP
proxy, and probably one per zone to avoid choke points.
7. What’s Left To Do?
HTTP Proxying
Tests
Documentation
POSTs
Because object POSTs don't currently cause a container database update, we need
to either cause an update or come up with another way to synchronize them.
The current plan is to modify POSTs to actually be a COPY internally.
Downside: POSTs to large files will take longer.
Upside: We have noticed very few POSTs in production.
8. Live Account Migrations
This is a big step towards live account migrations.
1. Turn on sync for the linked accounts on the two clusters.
2. Wait for the new account to get caught up.
3. Switch auth response URL to new account and revoke all existing account tokens.
4. Put old account in a read-only mode.
5. Turn off sync from the new account to the old.
6. Wait until old account is no longer sending updates plus some safety time.
7. Purge old account.
Missing Pieces:
• Account sync (creating new containers on both sides, deletes and posts too).
• Account read-only mode.
• Using alternate operator-only headers to not conflict with the user's, also keeping
the user from seeing or modifying the values.
9. Implementation
st
• Updated to set/read X-Container-Sync-To and X-Container-Sync-Key.
Swauth and container-server
• Requires a new conf value allowed_sync_hosts indicating the allowed remote
clusters.
swift-container-sync
• New daemon that runs on every container server.
• Scans every container database looking for ones with sync turned on.
• Sends updates based on any new ROWIDs in the container database.
• Keeps sync points in the local container databases of the last ROWIDs sent out.
10. Complexity - swift-container-sync
There are three container databases on different servers for each container.
No need and quite wasteful for each to send all the updates.
Easiest solution is to just have one send out the updates, but:
• What if that one is down?
• Couldn't synchronization be done faster if all three were involved?
Instead, each sends a different third of the updates (assuming 3 replicas here).
• Downside: If one is down, a third of the updates will be delayed until it comes back up.
So, in addition, each node will send all older updates to ensure quicker synchronization.
• Normally, each server does a third of the updates.
• Each server also does all older updates for assurance.
• The vast majority of assurance updates will short circuit.
11. In The Weeds
• Two sync points are kept per container database.
• All rows between the two sync points trigger updates. *
• Any rows newer than both sync points cause updates
depending on the node's position for the container (primary
nodes do one third, etc. depending on the replica count of
course).
• After a sync run, the first sync point is set to the newest
ROWID known and the second sync point is set to newest
ROWID for which all updates have been sent.
* This is a slight lie. It actually only needs to send the two-thirds of updates it isn't
primarily responsible for since it knows it already sent the other third.
12. In The Weeds
An example may help. Assume replica count is 3 and perfectly
matching ROWIDs starting at 1.
First sync run, database has 6 rows:
• SyncPoint1 starts as -1.
• SyncPoint2 starts as -1.
• No rows between points, so no "all updates" rows.
• Six rows newer than SyncPoint1, so a third of the rows are sent by
node 1, another third by node 2, remaining third by node 3.
• SyncPoint1 is set as 6 (the newest ROWID known).
• SyncPoint2 is left as -1 since no "all updates" rows were synced.
13. In The Weeds
Next sync run, database has 12 rows:
• SyncPoint1 starts as 6.
• SyncPoint2 starts as -1.
• The rows between -1 and 6 all trigger updates (most of which should
short-circuit on the remote end as having already been done).
• Six more rows newer than SyncPoint1, so a third of the rows are
sent by node 1, another third by node 2, remaining third by node 3.
• SyncPoint1 is set as 12 (the newest ROWID known).
• SyncPoint2 is set as 6 (the newest "all updates" ROWID).
In this way, under normal circumstances each node sends its share of updates
each run and just sends a batch of older updates to ensure nothing was missed.
14. Extras
• swift-container-sync can be configured to only spend x amount of time trying
to sync a given container -- avoids one crazy container starving out all others.
• A crash of a container server means lost container database copies that will be
replaced by one of the remaining copies on the other servers. The
reestablished server will get the sync points from the copy, but no updates will
be lost due to the "all updates" algorithm the other two followed.
• Rebalancing the container ring moves container database copies around, but
results in the same behavior as a crashed server would.
• For bidirectional sync setups, the receiver will send the sender back the
updates (though they will short-circuit). Only way I can think to prevent that is
to track where updates were received from (X-Loop) but that's expensive.
Anything Else?
gholt@rackspace.com
http://tlohg.com/