Dyn is an internet infrastructure company that provides managed DNS and email delivery services. The document discusses an approach called "Dark Architecture" for upgrading systems with zero downtime. It involves running the new and legacy systems in parallel, comparing outputs, and gradually shifting traffic to the new system once it is proven equivalent. The approach aims to continuously deliver value while keeping customers unaffected during the multi-stage migration process.
Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dyn Inc
1. Dark Architecture & How to Forklift Upgrade
Your Infrastructure with Zero Downtime
Cory von Wallenstein
Chief Technology Officer,
Dyn Inc.
@cvonwallenstein
@cvonwallenstein from @DynInc at #gluecon
2. But First, Who Is Dyn?
• Internet Infrastructure as a Service
– Managed DNS and Email Delivery
• 230 Global Employees (we bootstrapped to 170)
• Headquarters in Manchester, NH (offices in SFO & UK too)
• Raised first financing in Oct 2012: $38MM from NorthBridge
@cvonwallenstein from @DynInc at #gluecon
3. Problem We Are Trying To Solve
Inputs
Black Magic
(Your Current System Architecture)
Outputs
Different Black Magic
(Your New System Architecture)
Inputs
Inputs
Inputs
Outputs
Outputs
Outputs
Scale
x10, x102, etc.
Performance
(t2 - t0) <= (t1 - t0)
t1
t2
t0
t0
Coupling
Tight -> Loose
@cvonwallenstein from @DynInc at #gluecon
5. Why Things Get This Way
• Time to market reigns supreme
– MVP was very… minimum… on… everything
– Sooner is better than perfect
• Prototype to production to scale without
architectural rigor
– Skillset for system engineering in high demand
• Seen more often in small teams who find
product market fit faster than expected
– Inexperience, but we’ve all been there
@cvonwallenstein from @DynInc at #gluecon
6. Dark Architecture
• A way of thinking about, and technical
approach to, solving the
scale/performance/coupling problem while
enabling the business to succeed and keeping
(some) of your hair
• We stand on shoulders of giants
– Fowler, Amazon, Netflix, etc.
@cvonwallenstein from @DynInc at #gluecon
7. High Level of Dark Architecture
• Legacy approach: Flag Day Upgrade/Deploy
– Scope out 3 month upgrade to swap architecture A
to B, turns into 6 months, don’t get to anything else,
cross fingers on flag day, fight fires where broken,
gain weight, lose hair, girlfriend breaks up with you,
team quits, FML…
• Evolved approach: Fowler’s Blue/Green Deploy
– Two copies of system, load balancing to rapidly
deploy new system version, rapidly fail back to
legacy on failure (only one active at a time)
@cvonwallenstein from @DynInc at #gluecon
8. High Level of Dark Architecture
• Dark Architecture Approach
– Two copies of system, both active, send inputs for a
workflow to both, compare outputs and throw one
away (the one you threw the output away from is
the “dark architecture”), log and inspect output
differences, gain confidence in new system when
differences go away, swap which output you throw
away (effectively bringing the “dark” architecture
“light”), achieve equilibrium on what workflows get
processed by what system so your business has
flexibility, high five everyone, onward and upward.
@cvonwallenstein from @DynInc at #gluecon
9. Tangible Examples
• Scaling Global DNS Stats beyond 17 POPs
– MySQL to Cassandra, Log file rsync to agg counts
@cvonwallenstein from @DynInc at #gluecon
10. Tangible Examples
• Scaling Email Delivery beyond 1 billion/month
– Cron to daemon (2011), Perl to Node.js (now)
11. Dark Architecture Manifesto
1. Clear definition of success over ambiguity
– Likely scale/performance measured, may get
blank stares on coupling
2. Continuously deliver value over months of no
visible progress
3. Confidence in functional equivalence over
scope creep
4. ^5’s over finger pointing
5. Plan for failure over cross fingers
@cvonwallenstein from @DynInc at #gluecon
12. Dark Architecture Manifesto
6. Customer impact over elegant system
diagrams
7. System flows over system components
8. Operational confidence and familiarity over
trial by fire
9. Having a ten item list over a nine item list
10. Architecture evolution over architecture
revolution
@cvonwallenstein from @DynInc at #gluecon
13. Scope and Priority
• Prioritize a backlog of input/output workflows
by amount of pain
– Don’t think on a system component level
• “swap MySQL for Cassandra”
– Think on a system workflow level
• “retrieve query logs and render *.example.com graphs”
– This exercise will force you to hone scope to
exactly where the pain is so you can focus on
delivering the solution to this pain first and save
others for later.
@cvonwallenstein from @DynInc at #gluecon
15. Legacy Approach: Week 0
Legacy System
100% of functionality
enabled
100% of functionality
consumed
@cvonwallenstein from @DynInc at #gluecon
16. Legacy Approach: Week 1
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
0% of functionality
enabled
0% of functionality
consumed
@cvonwallenstein from @DynInc at #gluecon
17. Legacy Approach: Week 4
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
25% of functionality
enabled
0% of functionality
consumed
Most people start with easy pieces under a
misguided “crawl walk run” philosophy. Quick
wins on easy stuff while saving hard problems
for later rarely ends well.
@cvonwallenstein from @DynInc at #gluecon
18. Legacy Approach: Week 8
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
35% of functionality
enabled
0% of functionality
consumed
Progress slows as harder problems encountered
@cvonwallenstein from @DynInc at #gluecon
19. Legacy Approach: Week 12
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
80% of functionality
enabled
0% of functionality
consumed
80% of projects spend 80% of their calendar time
at 80% perceived completion. I’m 80% sure.
@cvonwallenstein from @DynInc at #gluecon
20. Legacy Approach: Week 24
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
100% of functionality
enabled
0% of functionality
consumed
Other fires came up, things took longer than
expected, you know… business. Morale never
been lower
@cvonwallenstein from @DynInc at #gluecon
21. Legacy Approach: Flag Day!
Legacy System
100% of functionality
enabled
0% of functionality
consumed
New System
100% of functionality
enabled
100% of functionality
consumed
@cvonwallenstein from @DynInc at #gluecon
22. Legacy Approach: Flag Day!
Legacy System
100% of functionality
enabled
0% of functionality
consumed
New System
100% of functionality
enabled
100% of functionality
consumed
@cvonwallenstein from @DynInc at #gluecon
24. Dark Architecture Approach: Week 0
Legacy System
100% of functionality
enabled
100% of functionality
consumed
@cvonwallenstein from @DynInc at #gluecon
25. Dark Architecture Approach: Week 1
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
0% of functionality
enabled
0% of functionality
consumed
@cvonwallenstein from @DynInc at #gluecon
26. Dark Architecture Approach: Week 2
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
0% of functionality
enabled
0% of functionality
consumed
No functionality yet,
just dark architecture
framework for two
inputs and two
outputs (throwing
one output away)
27. Dark Architecture Approach: Week 3
Legacy System
100% of functionality
enabled
100% of functionality
consumed
New System
2% of functionality
enabled
2% of functionality
consumed (dark)
Throw one away, but log
and inspect differences!
28. Dark Architecture Approach: Week 4
Legacy System
100% of functionality
enabled
98% of functionality
consumed
New System
2% of functionality
enabled
2% of functionality
consumed
Gain confidence
operating with two
equal outputs, switch
which one is thrown
away for that workflow.
Goes horribly wrong?
Switch back.
29. Dark Architecture Approach: Week 12
Legacy System
100% of functionality
enabled
80% of functionality
consumed
New System
20% of functionality
enabled
20% of functionality
consumed
Where do we stand at
expected 3 months?
Most painful 20% of
problems resolved…
now we have
flexibility for what to
do next.
30. Customer impact over elegant
system diagrams
• Your customers are not paying you to have
pretty whiteboards of elegant system
architectures
• Your customers are paying you to make their
pain go away. This gets priority.
• It’s OK to have different workflows handled by
different systems to give your team agility
– Other priorities came up? System is stable.
– Have technical debt time? Continue arch migration
@cvonwallenstein from @DynInc at #gluecon
31. Parting Takeaways
• Manifesto is a preference, not a rule
• Think in flows not components
• Deliver most painful pieces first so when
priorities change, you’re not left half complete.
• Process success >>> process name
• Be realistic. DA provides flexibility and frequent
victories for morale and some value delivered
sooner, but it won’t necessarily make a full
architecture migration faster in calendar days.
@cvonwallenstein from @DynInc at #gluecon