Theseus' data

Theseus' Data
Migrating Production Backends with Zero-Downtime

"The ship wherein Theseus and the youth of Athens returned from
Crete had thirty oars, and was preserved by the Athenians down
even to the time of Demetrius Phalereus, for they took away the old
planks as they decayed, putting in new and stronger timber in their
place, in so much that this ship became a standing example among
the philosophers, for the logical question of things that grow; one
side holding that the ship remained the same, and the other
contending that it was not the same."
—Plutarch, Theseus
The name of my talk come from a parable found in the writings of the 1st century Greek historian Plutarch. He basically asks the question, if a ship
traveling from Crete to Athens is replaced board by board, so that upon arriving no plank is the original, is it the same ship that arrives in Athens?

“Did anybody drown?”
—Some engineer

Data modeling is hard
In the abstract, it’s all ontology - objects and their relationships.

Data modeling is harder
In reality, you have to be concerned about access patterns and how they map to your data-stores. Say we’ve got a document store (like mongo) and we
want to save 3rd party demographic data on our users. Stuff like Facebook friends, Rapleaf info on your email address, Political district info, etc. And it
makes sense to use nested documents, because we’ve got business logic around merging the various sources of data down to attributes on the user
object. And everybody’s happy, particularly the PM who wants to add some other piece of demographic data to be able to slice and dice some UX
experiments.

Ontological semantic driﬅ
“Why does everything keep changing?”

—Some engineer
but this stuff is hard. primarily because things change. guaranteed. so we’ve got our appended data nested documents collection. and then we get this
great idea - let’s batch through all FB connected users, find their friends who are also change.org members, find what petitions they’ve signed, and send
out recommendation emails (“10 of your friends signed this great petition!”).
!
And then it crashes. spectacularly. We’re cursoring through every nested document of every document to pull out a single field that’s not indexed, that
may not even exist. And mongo heats up and pagers go off and infrastructure engineers come over with hatred in their eyes.

Ontological semantic driﬅ
!
“The future is not later than having been, and having-been is
not earlier than the Present. Temporality temporalizes itself as
a future which makes present in a process of having been.”

—Heidegger, Being and Time
so when you seen an engineer pulling her hair out wondering why the data model no longer fits and you’ve got 8 levels of business logic to handle
mapping things to your current uses, and performance is a mercurial blackbox, point them to Heidegger. and then run. Being and Time is really large and
hurts when thrown at you.
!
In reality though, I think the Heideggerian response to the parable of theseus’ ship is a particularly astute one. What it is to be is to BE in time. It’s not that
there’s an infinite succession of present “Ships” - now and now and now and now - nor that there’s some eternal Ship that exists independent of all of
those particular presents - but rather the very being that is “Shipness” can only be spoken sensibly about against the horizon of time.
Undergrad philosophy lecture aside, this conception of our data models and objects as inherently temporalized and contextualized, reflecting the past and
projecting the future, is a powerful one for tackling technical debt and business changes, perhaps better reformulated as “ontological semantic drift”
!
!
!

This is what our site looked like in 2007 (excuse the broken images, it’s from the WayBackMachine). Everything centered around a “Change” - that had
Events and Actions, Opinions, Blog posts, etc.

And this is us today. Centered around a petition and its signatures - plus shares, comments, news updates, victory declarations, etc. As you can imagine,
lots has changed with regards to our “ontology”

Some things have changed…
As the business objects of our company changed, their schemas and logic changed as well. Here’s a particularly odd wart in our rails codebase. a “Petition”
is actually an “Event” (remember the “Actions” in the old screenshot?) As an “Event” changed gradually it became weird to think of it as an “Event”, so let’s
just call it a “Petition”. Everything’s ﬁxed right? In this case, really dealing with this gradual but fundamental shift in a core data model was eventually
handled by breaking it out into it’s own “petition service” that could then encapsulate the logic around what it is to be a petition, and provide a clean
interface to any other components in our application that interacted with petitions. Moving core elements into a service oriented architecture like this is a
great way to deal with the technical debt and “ontological semantic shift” that occurs to your data models over time. But it’s also expensive.

A “simple” case study
Appended User Data in mongo. Outer document with multiple embedded documents (for each particular vendor). Updates are usually to an entire
embedded document (not to the whole thing). Reads are usually based on a particular user id, or a batch of user ids. some reads where we deﬁnitely just
want to grab a single ﬁeld in a single nested document without loading the whole thing.

Preparation
First had to do auditing. Where did we have old data (ﬁelds, etc) in mongo that wasn’t being used at all? Where did we have business logic to work around
those warts? Is it worth doing some initial data sanitization in mongo?
!
Also, what are our access patterns? what types of data are always queried together, what is queried separately? how do we want to distribute it? for us that
meant partition key and clustering key as we moved this to CQL backed by cassandra.

A side note on hidden complexity
As you start to untangle these old models, you might notice pulling on one thread pulls on another pulls on another. When you notice this happening, you
can have a couple of responses. One, which I’ll discuss in the next example I’ll talk about in a little bit, is to head back to the whiteboard and work to
understand what you actually need to build (rather than ﬁguring out how to re-engineer what you already have). The other is to make sure you have a ﬁrm
grasp on what you’re actually changing, and trying to ensure you’re changing the smallest independent chunks, and building larger changes atop of that,
rather than trying to go top-down without realizing all the complexity on the bottom until you get there, 2 weeks late.

Parallel model
•Interface to new data store/model

•Separate models, separate writing

•Some duplication of “fat model” code
So we’ve explored the underlying data and how it’s being stored, as well as spent some time pulling at and scoping out the complexity of the business
logic and how it potentially massages some temporal/semantic drift to “fit” your current application.
!
With that understanding, you can then make a parallel model to serve as an interface to your new data store.
!
I’ve been a fan of making completely separate models rather than forking the write/read access portions of the existing model to also write to this other
db, or whatever. There’s something to be said for a lack of extra files (that you’re just going to have to go and clean up later), and not repeating your
business logic (esp making sure you don’t lose something important like an async trigger after save to fire off an email or something). those two objections
stated,
!
But I’ve found a lot of the business logic you find in your “fat models” is actually all just code trying to account for all the ontological semantic drift that
has taken place over time, and by starting with a new file you get a chance to clean up all the cruft around the actually essential interface and access points
and relationships that you need for some particular object.

Parallel model
So I’ve followed this basic pattern, keeping the old model around and adding another function so I can also access the new model. We’ve copied our
business logic around, discarding all the cruft that’s built up over time - maybe our friends list was an array of strings, or an array of integers, or a string
that needed to be split on some weird delimiter - we don’t really care about any of that now (with our new model). but we’ve kept things like our async
triggers, and other model-level functionality - like determining what a user’s gender is when we’ve got multiple data sources each contributing their best
guess.
!
Now we’ve got tests right? And we’ve added some unit testing within our new model, making sure validation still occurs, the calls to the actual db layer are
correct, etc. But what about testing whether we’ve missed something in porting things over to the new model?

Parallel model
Let’s make a minor change then, and run out test suite backed by the new model. Assuming we haven’t missed something (and no gaping holes around
key flows in our test suite) we should be able to use this as a measure of how completely you’ve fulfilled the responsibilities of the old model.
!
Of course in reality things aren’t that simple, and you’ll surely run into issues with things like factories and fixtures. Which I’ve generally taken as an
invitation to clean up and speed up test suites by switching things to using mocked/stubbed objects rather than factories - where it makes sense to do so.

Shadow writing
•Backfill or rebuild

•other source-of-truth?

•Shadow writes to new model

•Triggers
On backfilling - can we ensure data is updated chronologically? Can we start shadow writing before backfilling and then only backfill data that is older than
shadow written data? Can we do this on a per-column basis (thanks Cassandra!)
!
Shadow writing to the new model is relatively straightforward. One thing to consider is whether the write needs to be synchronous or if it can be pushed to
a queue. But any extra asynchrony means doubly ensuring your backfill won’t cause newer data to get overwritten by older data if queue ordering gets
messy.
!
Finally triggers - we’ve used rollout flags in the past to decide which model, e.g., sends out an email after_save - the old one or the new one. Presumably
you have test coverage that checks for this, so when we flip things to favoring the new model we can feel relatively confident that we’re not losing any
side-effects or things like that.

A side note on backfilling
A fun story about backfilling that I probably shouldn’t tell. I was planning to enqueue a whole bunch (like 5000+) asynchronous jobs to pull some data
from Redshift and S3 and then run an EMR job to generate some statistics over the data.
!
!
I was dumb, or not paranoid enough (which is often the same thing when we’re talking about doing large data migrations on in-flight production data) and
didn’t do any calculations around how many workers on how many nodes would be available to start processing these jobs. So I enqueued a few around
7pm one Thursday evening, and I waited around for a few minutes to make sure they all finished successfully. Then I enqueued the rest of them, and went
out to dinner.
!
While we’re waiting for our food, I get a hipchat notification. Some of the Resque queues are backed up, email sends aren’t going out in a timely fashion,
etc. I immediately realize that I must’ve forgotten to change the backfill jobs to go to a specific backfill queue, and that they’re backing up other jobs from
finishing. Another engineer (who wasn’t out eating dinner, or is more dedicated than I am, thanks Scott) cancels all the enqueued jobs, we decide to leave
the in flight ones running as they should finish up within 15 minutes. I close my phone and promise to check back after dinner.
!
Here’s what our Redshift query load normally looks like. At any given time, there could be a few to maybe a dozen queries in flight, most lasting only a few
seconds, a few taking a minute or two.

A side note on backﬁlling
And here’s what our Redshift query load looked like when I got home from dinner.
!
We were basically trying to dump the same 100M row table for different time windows 80 times in parallel. Which made redshift very, very sad.
!
So infrastructure began to just kill the queries in Redshift, thinking that would make the workers drop their connections and timeout. Unfortunately here I
was being just the right amount of paranoid and had a fallback to MySQL in case Redshift dropped the connection (it happens sometimes) and in normal
non-backﬁll circumstances only one of these would be running at a time. So these 80+ queries then all went awry on our Galera cluster, bringing down the
entire site for a few minutes until we manually killed the resque processes and canceled the runaway queries.
!
moral of the story being -

Moral of the story
Don’t enqueue thousands of background jobs and
then immediately go out to dinner.

!
And use a backfill queue.
be aware of how your backfill is going to affect things like db resources, queues, etc. sure this is only going to run once, but you want to be able to turn it
down, turn it up, or turn it off - and you really want to be able to turn it back on/up and have things just continue seamlessly. there’s nothing much worse
than running a backfill for a day and a half only to have to turn it off due to a viral campaign causing massively increased site-traffic, and then restart the
whole damn thing when the spike is over.

Shadow reading
•First sanity check.

•Then sanity check again.

•Group speciﬁc rollout possible?

•Keep writing to the original model!

What makes a “simple” case
• Interface will stay relatively unchanged

• Data access patterns generally mirrored
across old and new models

• Original implementation not intrinsically tied to
data-store/architecture

A “complex” case study
• Client suppressions

• Batch writes

• Batch reads (deliveries, stats)

• Online reads (sponsored petition filtering)
Client suppressions, batch files (With lots of duplicates from previous files) dropped off by client. We want to load all suppressions (so we don’t show
people in the suppression lists ads from that org, and we don’t deliver opt-ins to that org from people they already know about or that have already
opted-out of the org).
Also want to calculate stats (overlap between existing change.org users, new users in the suppression file, etc). And then be able to query that data with
windows (“show me everybody who opted-in between january and march 2014”, taking into account who the organization had suppressed through that
window).

!
We had a relatively complex suppression store setup - mostly due to issues we had running level db over the network. So much of the business logic was
written so that operations that had to touch leveldb could all run on a particular node (where the db was running locally). This was accomplished through
shoehorning Resque into a state machine of sorts, which kinda worked, except when it didn’t, or when level db got corrupted, and people got woken up in
the middle of the night because client deliveries were stalled.
!
Additionally we were overloading the “value” in level db to be a json structure. which is ﬁne if you want to get the whole thing, and have a single thread
updating the whole thing. but less ﬁne when you want to get a particular “column” or have multiple threads trying to update multiple columns (without
stepping on each others toes)

• What did we actually need to support?

• How had our modeling disabled us from
dealing with ontological semantic drift?
Load ﬁles.
Generate stats.
Online suppression.
Batch suppression. But why?
!
Architectural choices (speciﬁcally level db on a single node) had driven data modeling.
!
New “feature” like “deliver optins for this particular time window” wasn’t possible.
!
!
!

Going off primary use-cases, we want to support batch writes in parallel, support online reads (without having to cache outside of the suppression store
like we did with level db), support more redundancy in the data, more availability, etc.

But something that provides all of this is bad at providing aggregate stats.
!
So let’s pull that out into a separate problem, use hadoop for what it is good at (tallying up co-occurences in large text ﬁles), and be done with it.

• Rollout without shadow writing

• Two separate worlds need to run

• Again, backﬁll woes!

• Shadow delivery with automated checks

• Per group rollout

Some takeaways
• Sometimes you have to scrap the whole thing

• Service-oriented-architecture to encapsulate
concerns
Sometimes it’s worth it to scrap the whole thing and rewrite it. A service-oriented-architecture model is helpful here for separating out concerns and
having the core business logic dealing with clearly deﬁned interfaces.

Some takeaways
• But that’s expensive

• Technical debt and technology/scaling
changes

• Ontological semantic drift
But that’s expensive. By thinking about technical debt and technology/scaling changes in terms of ontological semantic drift - i.e., an essential part of
your data-models themselves - it can be easier to structure models that lend to ﬂexibility without having some present-version of the future baked in.

Some takeaways
When dealing with existing ontological semantic
drift, determining the complexity level of a change
- before embarking on writing code - is crucial.

Some takeaways
Simpler changes generally involve the same steps:

1. create new and backﬁll

2. shadow-write new

3. shadow-read new, verify against old

4. real read new, shadow-write & verify against old

5. remove old

Some takeaways
• Veriﬁcation (of end-results) is possible even
where shadow-veriﬁcation is not.

• Gradually favoring the new data model can
also provide a level of assurance.

Some takeaways
More complex changes should force you back to the whiteboard to see what your objects actually are all about, which is intrinsically related to how you
access them.
!
"being prepared for ontological semantic drift" doesn't mean future-tripping and trying to plan for the future-as-present but rather making more explicit
interfaces and side-effects so you don't have a tangled mess of craziness when things necessarily change over time

Thanks for listening!
I want to thank my colleagues at change.org for supporting me, even when I
break the site while out to dinner.

!
Vijay Ramesh

Software Engineer, Data Science

vijay@change.org

vijaykramesh

We’re hiring!
If problems like these keep you up at night, we’d
love to have you join our team!

!
Check out change.org/careers or come chat with
me after the talk.

Theseus' data

Recommandé

Contenu connexe

Similaire à Theseus' data

Similaire à Theseus' data (20)

Dernier

Dernier (20)

Theseus' data