SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Theseus' Data
Migrating Production Backends with Zero-Downtime
"The ship wherein Theseus and the youth of Athens returned from
Crete had thirty oars, and was preserved by the Athenians down
even to the time of Demetrius Phalereus, for they took away the old
planks as they decayed, putting in new and stronger timber in their
place, in so much that this ship became a standing example among
the philosophers, for the logical question of things that grow; one
side holding that the ship remained the same, and the other
contending that it was not the same."
—Plutarch, Theseus
The name of my talk come from a parable found in the writings of the 1st century Greek historian Plutarch. He basically asks the question, if a ship
traveling from Crete to Athens is replaced board by board, so that upon arriving no plank is the original, is it the same ship that arrives in Athens?
“Did anybody drown?”
—Some engineer
Data modeling is hard
In the abstract, it’s all ontology - objects and their relationships.
Data modeling is harder
In reality, you have to be concerned about access patterns and how they map to your data-stores. Say we’ve got a document store (like mongo) and we
want to save 3rd party demographic data on our users. Stuff like Facebook friends, Rapleaf info on your email address, Political district info, etc. And it
makes sense to use nested documents, because we’ve got business logic around merging the various sources of data down to attributes on the user
object. And everybody’s happy, particularly the PM who wants to add some other piece of demographic data to be able to slice and dice some UX
experiments.
Ontological semantic driſt
“Why does everything keep changing?”

	 	 	 	 	 	 	 	 	 	 	 —Some engineer
but this stuff is hard. primarily because things change. guaranteed. so we’ve got our appended data nested documents collection. and then we get this
great idea - let’s batch through all FB connected users, find their friends who are also change.org members, find what petitions they’ve signed, and send
out recommendation emails (“10 of your friends signed this great petition!”). 
!
And then it crashes. spectacularly. We’re cursoring through every nested document of every document to pull out a single field that’s not indexed, that
may not even exist. And mongo heats up and pagers go off and infrastructure engineers come over with hatred in their eyes.
Ontological semantic driſt
!
“The future is not later than having been, and having-been is
not earlier than the Present. Temporality temporalizes itself as
a future which makes present in a process of having been.”
	 	 	 	 	 	 	 	 	 	 	 	 	 	 	 	 	
	 	 	 	 	 	 	 	 	 —Heidegger, Being and Time
so when you seen an engineer pulling her hair out wondering why the data model no longer fits and you’ve got 8 levels of business logic to handle
mapping things to your current uses, and performance is a mercurial blackbox, point them to Heidegger. and then run. Being and Time is really large and
hurts when thrown at you. 
!
In reality though, I think the Heideggerian response to the parable of theseus’ ship is a particularly astute one. What it is to be is to BE in time. It’s not that
there’s an infinite succession of present “Ships” - now and now and now and now - nor that there’s some eternal Ship that exists independent of all of
those particular presents - but rather the very being that is “Shipness” can only be spoken sensibly about against the horizon of time.
Undergrad philosophy lecture aside, this conception of our data models and objects as inherently temporalized and contextualized, reflecting the past and
projecting the future, is a powerful one for tackling technical debt and business changes, perhaps better reformulated as “ontological semantic drift”
!
!
!
This is what our site looked like in 2007 (excuse the broken images, it’s from the WayBackMachine). Everything centered around a “Change” - that had
Events and Actions, Opinions, Blog posts, etc.
And this is us today. Centered around a petition and its signatures - plus shares, comments, news updates, victory declarations, etc. As you can imagine,
lots has changed with regards to our “ontology”
Some things have changed…
As the business objects of our company changed, their schemas and logic changed as well. Here’s a particularly odd wart in our rails codebase. a “Petition”
is actually an “Event” (remember the “Actions” in the old screenshot?) As an “Event” changed gradually it became weird to think of it as an “Event”, so let’s
just call it a “Petition”. Everything’s fixed right? In this case, really dealing with this gradual but fundamental shift in a core data model was eventually
handled by breaking it out into it’s own “petition service” that could then encapsulate the logic around what it is to be a petition, and provide a clean
interface to any other components in our application that interacted with petitions. Moving core elements into a service oriented architecture like this is a
great way to deal with the technical debt and “ontological semantic shift” that occurs to your data models over time. But it’s also expensive.
A “simple” case study
Appended User Data in mongo. Outer document with multiple embedded documents (for each particular vendor). Updates are usually to an entire
embedded document (not to the whole thing). Reads are usually based on a particular user id, or a batch of user ids. some reads where we definitely just
want to grab a single field in a single nested document without loading the whole thing.
Preparation
First had to do auditing. Where did we have old data (fields, etc) in mongo that wasn’t being used at all? Where did we have business logic to work around
those warts? Is it worth doing some initial data sanitization in mongo? 
!
Also, what are our access patterns? what types of data are always queried together, what is queried separately? how do we want to distribute it? for us that
meant partition key and clustering key as we moved this to CQL backed by cassandra.
A side note on hidden complexity
As you start to untangle these old models, you might notice pulling on one thread pulls on another pulls on another. When you notice this happening, you
can have a couple of responses. One, which I’ll discuss in the next example I’ll talk about in a little bit, is to head back to the whiteboard and work to
understand what you actually need to build (rather than figuring out how to re-engineer what you already have). The other is to make sure you have a firm
grasp on what you’re actually changing, and trying to ensure you’re changing the smallest independent chunks, and building larger changes atop of that,
rather than trying to go top-down without realizing all the complexity on the bottom until you get there, 2 weeks late.
Parallel model
•Interface to new data store/model

•Separate models, separate writing

•Some duplication of “fat model” code
So we’ve explored the underlying data and how it’s being stored, as well as spent some time pulling at and scoping out the complexity of the business
logic and how it potentially massages some temporal/semantic drift to “fit” your current application. 
!
With that understanding, you can then make a parallel model to serve as an interface to your new data store. 
!
I’ve been a fan of making completely separate models rather than forking the write/read access portions of the existing model to also write to this other
db, or whatever. There’s something to be said for a lack of extra files (that you’re just going to have to go and clean up later), and not repeating your
business logic (esp making sure you don’t lose something important like an async trigger after save to fire off an email or something). those two objections
stated, 
!
But I’ve found a lot of the business logic you find in your “fat models” is actually all just code trying to account for all the ontological semantic drift that
has taken place over time, and by starting with a new file you get a chance to clean up all the cruft around the actually essential interface and access points
and relationships that you need for some particular object.
Parallel model
So I’ve followed this basic pattern, keeping the old model around and adding another function so I can also access the new model. We’ve copied our
business logic around, discarding all the cruft that’s built up over time - maybe our friends list was an array of strings, or an array of integers, or a string
that needed to be split on some weird delimiter - we don’t really care about any of that now (with our new model). but we’ve kept things like our async
triggers, and other model-level functionality - like determining what a user’s gender is when we’ve got multiple data sources each contributing their best
guess. 
!
Now we’ve got tests right? And we’ve added some unit testing within our new model, making sure validation still occurs, the calls to the actual db layer are
correct, etc. But what about testing whether we’ve missed something in porting things over to the new model?
Parallel model
Let’s make a minor change then, and run out test suite backed by the new model. Assuming we haven’t missed something (and no gaping holes around
key flows in our test suite) we should be able to use this as a measure of how completely you’ve fulfilled the responsibilities of the old model. 
!
Of course in reality things aren’t that simple, and you’ll surely run into issues with things like factories and fixtures. Which I’ve generally taken as an
invitation to clean up and speed up test suites by switching things to using mocked/stubbed objects rather than factories - where it makes sense to do so.
Shadow writing
Shadow writing
•Backfill or rebuild 

•other source-of-truth?

•Shadow writes to new model

•Triggers
On backfilling - can we ensure data is updated chronologically? Can we start shadow writing before backfilling and then only backfill data that is older than
shadow written data? Can we do this on a per-column basis (thanks Cassandra!) 
!
Shadow writing to the new model is relatively straightforward. One thing to consider is whether the write needs to be synchronous or if it can be pushed to
a queue. But any extra asynchrony means doubly ensuring your backfill won’t cause newer data to get overwritten by older data if queue ordering gets
messy. 
!
Finally triggers - we’ve used rollout flags in the past to decide which model, e.g., sends out an email after_save - the old one or the new one. Presumably
you have test coverage that checks for this, so when we flip things to favoring the new model we can feel relatively confident that we’re not losing any
side-effects or things like that.
A side note on backfilling
A fun story about backfilling that I probably shouldn’t tell. I was planning to enqueue a whole bunch (like 5000+) asynchronous jobs to pull some data
from Redshift and S3 and then run an EMR job to generate some statistics over the data. 
!
!
I was dumb, or not paranoid enough (which is often the same thing when we’re talking about doing large data migrations on in-flight production data) and
didn’t do any calculations around how many workers on how many nodes would be available to start processing these jobs. So I enqueued a few around
7pm one Thursday evening, and I waited around for a few minutes to make sure they all finished successfully. Then I enqueued the rest of them, and went
out to dinner. 
!
While we’re waiting for our food, I get a hipchat notification. Some of the Resque queues are backed up, email sends aren’t going out in a timely fashion,
etc. I immediately realize that I must’ve forgotten to change the backfill jobs to go to a specific backfill queue, and that they’re backing up other jobs from
finishing. Another engineer (who wasn’t out eating dinner, or is more dedicated than I am, thanks Scott) cancels all the enqueued jobs, we decide to leave
the in flight ones running as they should finish up within 15 minutes. I close my phone and promise to check back after dinner.
!
Here’s what our Redshift query load normally looks like. At any given time, there could be a few to maybe a dozen queries in flight, most lasting only a few
seconds, a few taking a minute or two.
A side note on backfilling
And here’s what our Redshift query load looked like when I got home from dinner.
!
We were basically trying to dump the same 100M row table for different time windows 80 times in parallel. Which made redshift very, very sad. 
!
So infrastructure began to just kill the queries in Redshift, thinking that would make the workers drop their connections and timeout. Unfortunately here I
was being just the right amount of paranoid and had a fallback to MySQL in case Redshift dropped the connection (it happens sometimes) and in normal
non-backfill circumstances only one of these would be running at a time. So these 80+ queries then all went awry on our Galera cluster, bringing down the
entire site for a few minutes until we manually killed the resque processes and canceled the runaway queries.
!
moral of the story being -
Moral of the story
Don’t enqueue thousands of background jobs and
then immediately go out to dinner. 

!
And use a backfill queue.
be aware of how your backfill is going to affect things like db resources, queues, etc. sure this is only going to run once, but you want to be able to turn it
down, turn it up, or turn it off - and you really want to be able to turn it back on/up and have things just continue seamlessly. there’s nothing much worse
than running a backfill for a day and a half only to have to turn it off due to a viral campaign causing massively increased site-traffic, and then restart the
whole damn thing when the spike is over.
Shadow reading
Shadow reading
•First sanity check. 

•Then sanity check again.

•Group specific rollout possible?

•Keep writing to the original model!
Hold on to your butts
What makes a “simple” case
• Interface will stay relatively unchanged

• Data access patterns generally mirrored
across old and new models 

• Original implementation not intrinsically tied to
data-store/architecture
A “complex” case study
• Client suppressions

• Batch writes

• Batch reads (deliveries, stats)

• Online reads (sponsored petition filtering)
Client suppressions, batch files (With lots of duplicates from previous files) dropped off by client. We want to load all suppressions (so we don’t show
people in the suppression lists ads from that org, and we don’t deliver opt-ins to that org from people they already know about or that have already
opted-out of the org).
Also want to calculate stats (overlap between existing change.org users, new users in the suppression file, etc). And then be able to query that data with
windows (“show me everybody who opted-in between january and march 2014”, taking into account who the organization had suppressed through that
window).
A “complex” case study
!
We had a relatively complex suppression store setup - mostly due to issues we had running level db over the network. So much of the business logic was
written so that operations that had to touch leveldb could all run on a particular node (where the db was running locally). This was accomplished through
shoehorning Resque into a state machine of sorts, which kinda worked, except when it didn’t, or when level db got corrupted, and people got woken up in
the middle of the night because client deliveries were stalled. 
!
Additionally we were overloading the “value” in level db to be a json structure. which is fine if you want to get the whole thing, and have a single thread
updating the whole thing. but less fine when you want to get a particular “column” or have multiple threads trying to update multiple columns (without
stepping on each others toes)
A “complex” case study
• What did we actually need to support?

• How had our modeling disabled us from
dealing with ontological semantic drift?
Load files.
Generate stats. 
Online suppression.
Batch suppression. But why? 
!
Architectural choices (specifically level db on a single node) had driven data modeling. 
!
New “feature” like “deliver optins for this particular time window” wasn’t possible. 
!
!
!
A “complex” case study
Going off primary use-cases, we want to support batch writes in parallel, support online reads (without having to cache outside of the suppression store
like we did with level db), support more redundancy in the data, more availability, etc.

But something that provides all of this is bad at providing aggregate stats. 
!
So let’s pull that out into a separate problem, use hadoop for what it is good at (tallying up co-occurences in large text files), and be done with it.
A “complex” case study
• Rollout without shadow writing

• Two separate worlds need to run

• Again, backfill woes!

• Shadow delivery with automated checks

• Per group rollout
Some takeaways
• Sometimes you have to scrap the whole thing

• Service-oriented-architecture to encapsulate
concerns
Sometimes it’s worth it to scrap the whole thing and rewrite it. A service-oriented-architecture model is helpful here for separating out concerns and
having the core business logic dealing with clearly defined interfaces.
Some takeaways
• But that’s expensive

• Technical debt and technology/scaling
changes

• Ontological semantic drift
But that’s expensive. By thinking about technical debt and technology/scaling changes in terms of ontological semantic drift - i.e., an essential part of
your data-models themselves - it can be easier to structure models that lend to flexibility without having some present-version of the future baked in.
Some takeaways
When dealing with existing ontological semantic
drift, determining the complexity level of a change
- before embarking on writing code - is crucial.
Some takeaways
Simpler changes generally involve the same steps:

1. create new and backfill

2. shadow-write new

3. shadow-read new, verify against old

4. real read new, shadow-write & verify against old

5. remove old
Some takeaways
• Verification (of end-results) is possible even
where shadow-verification is not. 

• Gradually favoring the new data model can
also provide a level of assurance.
Some takeaways
More complex changes should force you back to the whiteboard to see what your objects actually are all about, which is intrinsically related to how you
access them. 
!
"being prepared for ontological semantic drift" doesn't mean future-tripping and trying to plan for the future-as-present but rather making more explicit
interfaces and side-effects so you don't have a tangled mess of craziness when things necessarily change over time
Thanks for listening!
I want to thank my colleagues at change.org for supporting me, even when I
break the site while out to dinner. 

!
Vijay Ramesh

Software Engineer, Data Science

vijay@change.org

vijaykramesh
We’re hiring!
If problems like these keep you up at night, we’d
love to have you join our team! 

!
Check out change.org/careers or come chat with
me after the talk.

Contenu connexe

Similaire à Theseus' data

From DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysOri Pekelman
 
Data sync on iOS with Couchbase Mobile
Data sync on iOS with Couchbase MobileData sync on iOS with Couchbase Mobile
Data sync on iOS with Couchbase MobileThiago Alencar
 
On Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the WebOn Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the WebJames Hendler
 
I've Always Wanted To Data Model - Data Week 2013
I've Always Wanted To Data Model - Data Week 2013I've Always Wanted To Data Model - Data Week 2013
I've Always Wanted To Data Model - Data Week 2013Ian Varley
 
Kellogg XML Holland Speech
Kellogg XML Holland SpeechKellogg XML Holland Speech
Kellogg XML Holland SpeechDave Kellogg
 
SAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and DesignSAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and DesignMichael Heron
 
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...Colin Panisset
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented developmentrajmundr
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsSteve Pember
 
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRYLizzyManz
 
Data Driven Design - Frontend Conference Zurich
Data Driven Design - Frontend Conference ZurichData Driven Design - Frontend Conference Zurich
Data Driven Design - Frontend Conference ZurichMemi Beltrame
 

Similaire à Theseus' data (20)

On no sql.partiii
On no sql.partiiiOn no sql.partiii
On no sql.partiii
 
From DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed Apidays
 
Data sync on iOS with Couchbase Mobile
Data sync on iOS with Couchbase MobileData sync on iOS with Couchbase Mobile
Data sync on iOS with Couchbase Mobile
 
On Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the WebOn Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the Web
 
I've Always Wanted To Data Model - Data Week 2013
I've Always Wanted To Data Model - Data Week 2013I've Always Wanted To Data Model - Data Week 2013
I've Always Wanted To Data Model - Data Week 2013
 
Kellogg XML Holland Speech
Kellogg XML Holland SpeechKellogg XML Holland Speech
Kellogg XML Holland Speech
 
Ad507
Ad507Ad507
Ad507
 
SAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and DesignSAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and Design
 
Micro services
Micro servicesMicro services
Micro services
 
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
LASTconf 2018 - System Mapping: Discover, Communicate and Explore the Real Co...
 
Rulespace
RulespaceRulespace
Rulespace
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented development
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and Grails
 
Graphql
GraphqlGraphql
Graphql
 
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
“Don’t Repeat Yourself”: 4 Process Street Features to Keep Work DRY
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Resource Description Pres and Paper
Resource Description Pres and PaperResource Description Pres and Paper
Resource Description Pres and Paper
 
What’s Up, EDoc?!
What’s Up,EDoc?!What’s Up,EDoc?!
What’s Up, EDoc?!
 
Data Driven Design - Frontend Conference Zurich
Data Driven Design - Frontend Conference ZurichData Driven Design - Frontend Conference Zurich
Data Driven Design - Frontend Conference Zurich
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 

Dernier

React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 

Dernier (20)

React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 

Theseus' data

  • 1. Theseus' Data Migrating Production Backends with Zero-Downtime
  • 2. "The ship wherein Theseus and the youth of Athens returned from Crete had thirty oars, and was preserved by the Athenians down even to the time of Demetrius Phalereus, for they took away the old planks as they decayed, putting in new and stronger timber in their place, in so much that this ship became a standing example among the philosophers, for the logical question of things that grow; one side holding that the ship remained the same, and the other contending that it was not the same." —Plutarch, Theseus The name of my talk come from a parable found in the writings of the 1st century Greek historian Plutarch. He basically asks the question, if a ship traveling from Crete to Athens is replaced board by board, so that upon arriving no plank is the original, is it the same ship that arrives in Athens?
  • 4. Data modeling is hard In the abstract, it’s all ontology - objects and their relationships.
  • 5. Data modeling is harder In reality, you have to be concerned about access patterns and how they map to your data-stores. Say we’ve got a document store (like mongo) and we want to save 3rd party demographic data on our users. Stuff like Facebook friends, Rapleaf info on your email address, Political district info, etc. And it makes sense to use nested documents, because we’ve got business logic around merging the various sources of data down to attributes on the user object. And everybody’s happy, particularly the PM who wants to add some other piece of demographic data to be able to slice and dice some UX experiments.
  • 6. Ontological semantic driſt “Why does everything keep changing?” —Some engineer but this stuff is hard. primarily because things change. guaranteed. so we’ve got our appended data nested documents collection. and then we get this great idea - let’s batch through all FB connected users, find their friends who are also change.org members, find what petitions they’ve signed, and send out recommendation emails (“10 of your friends signed this great petition!”). ! And then it crashes. spectacularly. We’re cursoring through every nested document of every document to pull out a single field that’s not indexed, that may not even exist. And mongo heats up and pagers go off and infrastructure engineers come over with hatred in their eyes.
  • 7. Ontological semantic driſt ! “The future is not later than having been, and having-been is not earlier than the Present. Temporality temporalizes itself as a future which makes present in a process of having been.” —Heidegger, Being and Time so when you seen an engineer pulling her hair out wondering why the data model no longer fits and you’ve got 8 levels of business logic to handle mapping things to your current uses, and performance is a mercurial blackbox, point them to Heidegger. and then run. Being and Time is really large and hurts when thrown at you. ! In reality though, I think the Heideggerian response to the parable of theseus’ ship is a particularly astute one. What it is to be is to BE in time. It’s not that there’s an infinite succession of present “Ships” - now and now and now and now - nor that there’s some eternal Ship that exists independent of all of those particular presents - but rather the very being that is “Shipness” can only be spoken sensibly about against the horizon of time. Undergrad philosophy lecture aside, this conception of our data models and objects as inherently temporalized and contextualized, reflecting the past and projecting the future, is a powerful one for tackling technical debt and business changes, perhaps better reformulated as “ontological semantic drift” ! ! !
  • 8. This is what our site looked like in 2007 (excuse the broken images, it’s from the WayBackMachine). Everything centered around a “Change” - that had Events and Actions, Opinions, Blog posts, etc.
  • 9. And this is us today. Centered around a petition and its signatures - plus shares, comments, news updates, victory declarations, etc. As you can imagine, lots has changed with regards to our “ontology”
  • 10. Some things have changed… As the business objects of our company changed, their schemas and logic changed as well. Here’s a particularly odd wart in our rails codebase. a “Petition” is actually an “Event” (remember the “Actions” in the old screenshot?) As an “Event” changed gradually it became weird to think of it as an “Event”, so let’s just call it a “Petition”. Everything’s fixed right? In this case, really dealing with this gradual but fundamental shift in a core data model was eventually handled by breaking it out into it’s own “petition service” that could then encapsulate the logic around what it is to be a petition, and provide a clean interface to any other components in our application that interacted with petitions. Moving core elements into a service oriented architecture like this is a great way to deal with the technical debt and “ontological semantic shift” that occurs to your data models over time. But it’s also expensive.
  • 11. A “simple” case study Appended User Data in mongo. Outer document with multiple embedded documents (for each particular vendor). Updates are usually to an entire embedded document (not to the whole thing). Reads are usually based on a particular user id, or a batch of user ids. some reads where we definitely just want to grab a single field in a single nested document without loading the whole thing.
  • 12. Preparation First had to do auditing. Where did we have old data (fields, etc) in mongo that wasn’t being used at all? Where did we have business logic to work around those warts? Is it worth doing some initial data sanitization in mongo? ! Also, what are our access patterns? what types of data are always queried together, what is queried separately? how do we want to distribute it? for us that meant partition key and clustering key as we moved this to CQL backed by cassandra.
  • 13. A side note on hidden complexity As you start to untangle these old models, you might notice pulling on one thread pulls on another pulls on another. When you notice this happening, you can have a couple of responses. One, which I’ll discuss in the next example I’ll talk about in a little bit, is to head back to the whiteboard and work to understand what you actually need to build (rather than figuring out how to re-engineer what you already have). The other is to make sure you have a firm grasp on what you’re actually changing, and trying to ensure you’re changing the smallest independent chunks, and building larger changes atop of that, rather than trying to go top-down without realizing all the complexity on the bottom until you get there, 2 weeks late.
  • 14. Parallel model •Interface to new data store/model •Separate models, separate writing •Some duplication of “fat model” code So we’ve explored the underlying data and how it’s being stored, as well as spent some time pulling at and scoping out the complexity of the business logic and how it potentially massages some temporal/semantic drift to “fit” your current application. ! With that understanding, you can then make a parallel model to serve as an interface to your new data store. ! I’ve been a fan of making completely separate models rather than forking the write/read access portions of the existing model to also write to this other db, or whatever. There’s something to be said for a lack of extra files (that you’re just going to have to go and clean up later), and not repeating your business logic (esp making sure you don’t lose something important like an async trigger after save to fire off an email or something). those two objections stated, ! But I’ve found a lot of the business logic you find in your “fat models” is actually all just code trying to account for all the ontological semantic drift that has taken place over time, and by starting with a new file you get a chance to clean up all the cruft around the actually essential interface and access points and relationships that you need for some particular object.
  • 15. Parallel model So I’ve followed this basic pattern, keeping the old model around and adding another function so I can also access the new model. We’ve copied our business logic around, discarding all the cruft that’s built up over time - maybe our friends list was an array of strings, or an array of integers, or a string that needed to be split on some weird delimiter - we don’t really care about any of that now (with our new model). but we’ve kept things like our async triggers, and other model-level functionality - like determining what a user’s gender is when we’ve got multiple data sources each contributing their best guess. ! Now we’ve got tests right? And we’ve added some unit testing within our new model, making sure validation still occurs, the calls to the actual db layer are correct, etc. But what about testing whether we’ve missed something in porting things over to the new model?
  • 16. Parallel model Let’s make a minor change then, and run out test suite backed by the new model. Assuming we haven’t missed something (and no gaping holes around key flows in our test suite) we should be able to use this as a measure of how completely you’ve fulfilled the responsibilities of the old model. ! Of course in reality things aren’t that simple, and you’ll surely run into issues with things like factories and fixtures. Which I’ve generally taken as an invitation to clean up and speed up test suites by switching things to using mocked/stubbed objects rather than factories - where it makes sense to do so.
  • 18. Shadow writing •Backfill or rebuild •other source-of-truth? •Shadow writes to new model •Triggers On backfilling - can we ensure data is updated chronologically? Can we start shadow writing before backfilling and then only backfill data that is older than shadow written data? Can we do this on a per-column basis (thanks Cassandra!) ! Shadow writing to the new model is relatively straightforward. One thing to consider is whether the write needs to be synchronous or if it can be pushed to a queue. But any extra asynchrony means doubly ensuring your backfill won’t cause newer data to get overwritten by older data if queue ordering gets messy. ! Finally triggers - we’ve used rollout flags in the past to decide which model, e.g., sends out an email after_save - the old one or the new one. Presumably you have test coverage that checks for this, so when we flip things to favoring the new model we can feel relatively confident that we’re not losing any side-effects or things like that.
  • 19. A side note on backfilling A fun story about backfilling that I probably shouldn’t tell. I was planning to enqueue a whole bunch (like 5000+) asynchronous jobs to pull some data from Redshift and S3 and then run an EMR job to generate some statistics over the data. ! ! I was dumb, or not paranoid enough (which is often the same thing when we’re talking about doing large data migrations on in-flight production data) and didn’t do any calculations around how many workers on how many nodes would be available to start processing these jobs. So I enqueued a few around 7pm one Thursday evening, and I waited around for a few minutes to make sure they all finished successfully. Then I enqueued the rest of them, and went out to dinner. ! While we’re waiting for our food, I get a hipchat notification. Some of the Resque queues are backed up, email sends aren’t going out in a timely fashion, etc. I immediately realize that I must’ve forgotten to change the backfill jobs to go to a specific backfill queue, and that they’re backing up other jobs from finishing. Another engineer (who wasn’t out eating dinner, or is more dedicated than I am, thanks Scott) cancels all the enqueued jobs, we decide to leave the in flight ones running as they should finish up within 15 minutes. I close my phone and promise to check back after dinner. ! Here’s what our Redshift query load normally looks like. At any given time, there could be a few to maybe a dozen queries in flight, most lasting only a few seconds, a few taking a minute or two.
  • 20. A side note on backfilling And here’s what our Redshift query load looked like when I got home from dinner. ! We were basically trying to dump the same 100M row table for different time windows 80 times in parallel. Which made redshift very, very sad. ! So infrastructure began to just kill the queries in Redshift, thinking that would make the workers drop their connections and timeout. Unfortunately here I was being just the right amount of paranoid and had a fallback to MySQL in case Redshift dropped the connection (it happens sometimes) and in normal non-backfill circumstances only one of these would be running at a time. So these 80+ queries then all went awry on our Galera cluster, bringing down the entire site for a few minutes until we manually killed the resque processes and canceled the runaway queries. ! moral of the story being -
  • 21. Moral of the story Don’t enqueue thousands of background jobs and then immediately go out to dinner. ! And use a backfill queue. be aware of how your backfill is going to affect things like db resources, queues, etc. sure this is only going to run once, but you want to be able to turn it down, turn it up, or turn it off - and you really want to be able to turn it back on/up and have things just continue seamlessly. there’s nothing much worse than running a backfill for a day and a half only to have to turn it off due to a viral campaign causing massively increased site-traffic, and then restart the whole damn thing when the spike is over.
  • 23. Shadow reading •First sanity check. •Then sanity check again. •Group specific rollout possible? •Keep writing to the original model!
  • 24. Hold on to your butts
  • 25. What makes a “simple” case • Interface will stay relatively unchanged • Data access patterns generally mirrored across old and new models • Original implementation not intrinsically tied to data-store/architecture
  • 26. A “complex” case study • Client suppressions • Batch writes • Batch reads (deliveries, stats) • Online reads (sponsored petition filtering) Client suppressions, batch files (With lots of duplicates from previous files) dropped off by client. We want to load all suppressions (so we don’t show people in the suppression lists ads from that org, and we don’t deliver opt-ins to that org from people they already know about or that have already opted-out of the org). Also want to calculate stats (overlap between existing change.org users, new users in the suppression file, etc). And then be able to query that data with windows (“show me everybody who opted-in between january and march 2014”, taking into account who the organization had suppressed through that window).
  • 27. A “complex” case study ! We had a relatively complex suppression store setup - mostly due to issues we had running level db over the network. So much of the business logic was written so that operations that had to touch leveldb could all run on a particular node (where the db was running locally). This was accomplished through shoehorning Resque into a state machine of sorts, which kinda worked, except when it didn’t, or when level db got corrupted, and people got woken up in the middle of the night because client deliveries were stalled. ! Additionally we were overloading the “value” in level db to be a json structure. which is fine if you want to get the whole thing, and have a single thread updating the whole thing. but less fine when you want to get a particular “column” or have multiple threads trying to update multiple columns (without stepping on each others toes)
  • 28. A “complex” case study • What did we actually need to support? • How had our modeling disabled us from dealing with ontological semantic drift? Load files. Generate stats. Online suppression. Batch suppression. But why? ! Architectural choices (specifically level db on a single node) had driven data modeling. ! New “feature” like “deliver optins for this particular time window” wasn’t possible. ! ! !
  • 29. A “complex” case study Going off primary use-cases, we want to support batch writes in parallel, support online reads (without having to cache outside of the suppression store like we did with level db), support more redundancy in the data, more availability, etc. But something that provides all of this is bad at providing aggregate stats. ! So let’s pull that out into a separate problem, use hadoop for what it is good at (tallying up co-occurences in large text files), and be done with it.
  • 30. A “complex” case study • Rollout without shadow writing • Two separate worlds need to run • Again, backfill woes! • Shadow delivery with automated checks • Per group rollout
  • 31. Some takeaways • Sometimes you have to scrap the whole thing • Service-oriented-architecture to encapsulate concerns Sometimes it’s worth it to scrap the whole thing and rewrite it. A service-oriented-architecture model is helpful here for separating out concerns and having the core business logic dealing with clearly defined interfaces.
  • 32. Some takeaways • But that’s expensive • Technical debt and technology/scaling changes • Ontological semantic drift But that’s expensive. By thinking about technical debt and technology/scaling changes in terms of ontological semantic drift - i.e., an essential part of your data-models themselves - it can be easier to structure models that lend to flexibility without having some present-version of the future baked in.
  • 33. Some takeaways When dealing with existing ontological semantic drift, determining the complexity level of a change - before embarking on writing code - is crucial.
  • 34. Some takeaways Simpler changes generally involve the same steps: 1. create new and backfill 2. shadow-write new 3. shadow-read new, verify against old 4. real read new, shadow-write & verify against old 5. remove old
  • 35. Some takeaways • Verification (of end-results) is possible even where shadow-verification is not. • Gradually favoring the new data model can also provide a level of assurance.
  • 36. Some takeaways More complex changes should force you back to the whiteboard to see what your objects actually are all about, which is intrinsically related to how you access them. ! "being prepared for ontological semantic drift" doesn't mean future-tripping and trying to plan for the future-as-present but rather making more explicit interfaces and side-effects so you don't have a tangled mess of craziness when things necessarily change over time
  • 37. Thanks for listening! I want to thank my colleagues at change.org for supporting me, even when I break the site while out to dinner. ! Vijay Ramesh Software Engineer, Data Science vijay@change.org vijaykramesh
  • 38. We’re hiring! If problems like these keep you up at night, we’d love to have you join our team! ! Check out change.org/careers or come chat with me after the talk.