SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
Architecting for Failure
                         Michael Brunton-Spall
                            @bruntonspall
                         QCon London 2012




Thursday, 8 March 2012
The inevitability of
                               failure
                    • If you take nothing else away:
                     • Systems will fail
                    • Architect for failure
                     • prevention
                     • mitigation

Thursday, 8 March 2012
guardian.co.uk
                           circa 2008


Thursday, 8 March 2012
Basic Architecture

                                Apache


                               AppServer


                               Database


Thursday, 8 March 2012
Basic Architecture


                    • This is your basic J2EE stack
                    • Lets apply scaling basics


Thursday, 8 March 2012
Scaling
                                     Load Balancer

                          Apache        Apache        Apache

                                     Load Balancer

                         AppServer    AppServer      AppServer

                                       Database

Thursday, 8 March 2012
Scaled Architecture

                    • We don’t scale databases this way
                    • Load balancers give scaling
                    • Also a bit of spatial redundancy
                    • But what about our data center?

Thursday, 8 March 2012
Redundancy
                                   Global Load Balancer

                         Load Balancer              Load Balancer

            Apache                Apache         Apache          Apache

                         Load Balancer              Load Balancer

        AppServer               AppServer      AppServer       AppServer

                           Database                       Database
Thursday, 8 March 2012
Redundant Architecture
                    • Real spatial redundancy
                    • Global load balancing via DNS
                    • Twin datacenters
                     • Redundant power
                     • Redundant internet connectivity
                    • Database in Active/Passive
Thursday, 8 March 2012
Success stories

                    • Serves 3.5m unique daily browsers
                    • Over 1.6m unique pieces of content
                    • supports hundreds of editorial staff
                    • create articles, audio, video, galleries,
                         interactives, micro-sites



Thursday, 8 March 2012
Drawbacks
                    • Monolithic system
                     • Understands everything
                       • football leagues
                       • financial markets
                       • mortgage applications
                       • content!
Thursday, 8 March 2012
The microapps
                         revolution circa 2011


Thursday, 8 March 2012
Thursday, 8 March 2012
Thursday, 8 March 2012
Microapps

                    • Separation of Systems
                    • SSI-like technology
                    • HTTP


Thursday, 8 March 2012
AppEngine, Python,
                         Ruby, EC2 - Oh My
                    • Proliferation of systems, languages and
                         frameworks
                    • Faster development
                    • Increased innovation
                    • Hack Days!
                    • Built on content API
Thursday, 8 March 2012
Microapps Architecture
                         Apache


                    AppServer


                         Database



Thursday, 8 March 2012
Microapps Architecture
                         Apache       EC2


                    AppServer       AppEngine


                         Database



Thursday, 8 March 2012
Microapps Architecture
                         Apache       EC2


                    AppServer       AppEngine


                         Database   Content
                                    API (EC2)


Thursday, 8 March 2012
Microapps Architecture
                         Apache               EC2


                    AppServer       Cache   AppEngine


                         Database           Content
                                            API (EC2)


Thursday, 8 March 2012
The cost of
                            diversification

                    • Support
                    • Maintenance
                    • Decided to settle on JVM stack primarily


Thursday, 8 March 2012
Benefits

                    • Lots of small simple applications
                    • Can code, release, test in isolation
                    • Cache
                     • max-age
                     • stale-if-error

Thursday, 8 March 2012
Drawbacks

                    • Increased architectural complexity
                     • Need a big cache
                     • Context


Thursday, 8 March 2012
The biggest problem

                    • Microapp latency affects CMS latency
                    • Failure is not a problem
                    • Slow is a problem
                     • stale-while-revalidate?

Thursday, 8 March 2012
Emergency Mode




Thursday, 8 March 2012
Emergency Mode

                    • Dynamic pages are expensive
                    • ‘Peaky’ traffic
                    • Often small subset of functionality
                    • Trade off dynamism for speed

Thursday, 8 March 2012
Emergency Mode

                    • Caches do not expire based on time
                    • Serve pressed pages first
                    • Render pages from caches second
                    • Render page as normal finally

Thursday, 8 March 2012
Page Pressing

                    • In memory caches aren’t enough
                    • Need a full page cache
                    • Stored on disk as generated HTML
                    • Served like static files
                    • Capable of over 1k pages/s per server

Thursday, 8 March 2012
Really cache everything

                    • Except for microapps
                    • Emergency mode for CMS doesn’t affect
                         microapps by design
                    • Microapps are cached anyway


Thursday, 8 March 2012
Gotta cache ‘em all

     • 1.6 million pieces of content
     • http://www.guardian.co.uk/uk/budget
     • http://www.guardian.co.uk/travel/france+travel/skiing
     • http://www.guardian.co.uk/theguardian/2012/mar/02
     • http://www.guardian.co.uk/technology/apple?page=2

Thursday, 8 March 2012
Cache what’s important

                    • Content - when modified
                    • Navigation - Every 2 weeks
                    • Automatic but important - Every 2 weeks
                    • Automatic (eg tag combiners) - Never
                    • Can force a page press

Thursday, 8 March 2012
Monitoring

                    • Help find the problem
                    • What has gone wrong?
                    • When did it go wrong?
                    • What changed when it went wrong?
                    • What can I turn off?

Thursday, 8 March 2012
Monitoring
                    • Aggregate stats
                     • individual, microapp, per colo, per stage
                    • Monitor everything?
                     • Is cpu usage that important?
                     • Consistent
                    • Alerting is not monitoring
Thursday, 8 March 2012
Automatic switches

                    • Release valves
                    • Emergency mode
                    • Database off mode


Thursday, 8 March 2012
Switch if a threshold is
                              met
                    • Average page response time
                    • Reset after timeout (say 60 seconds)
                    • Prevents Ping-Pong of switches
                    • Not an error, normal behaviour
                    • Trends should be monitorable

Thursday, 8 March 2012
Diagnosing failure



Thursday, 8 March 2012
Why do I care?
                         • Your architecture must be easy to
                           diagnose
                         • These skills aren’t common enough
                          • Basic unix skills (sed, grep, cut, sort)
                          • Log analysis
                         • Take these into account when you design
Thursday, 8 March 2012
Logs, Logs, Logs, Logs

                    • When an issue occurs
                     • Copy logs from the affected server
                     • System, Stdout, Application, JVM
                    • reboot/disable/rebuild affected server
                    • Parse logs in parallel

Thursday, 8 March 2012
Logging
                    • Logs must be useful
                    • Don’t log extraneous data (not too large)
                    • Important data:
                     • Date and Time
                     • Affected code
                    • Parseable
Thursday, 8 March 2012
Loggable Events
                    • Request Logging (after including time)
                     • External service requests (after including
                         time)
                    • Interesting events
                    • Exceptions
                    • Database calls?
Thursday, 8 March 2012
Loggable Events
        2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19]
        INFO com.gu.management.logging.RequestLoggingFilter -
        Request for /pages/Guardian/artanddesign/artblog/2008/jan/
        31/catchofthedaysecondlifes completed in 231 ms




Thursday, 8 March 2012
Loggable Events
        2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19]
        INFO com.gu.management.logging.RequestLoggingFilter -
        Request for /pages/Guardian/artanddesign/artblog/2008/jan/
        31/catchofthedaysecondlifes completed in 231 ms




                         Date and Time



Thursday, 8 March 2012
Loggable Events
        2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19]
        INFO com.gu.management.logging.RequestLoggingFilter -
        Request for /pages/Guardian/artanddesign/artblog/2008/jan/
        31/catchofthedaysecondlifes completed in 231 ms

                                               Thread name

                         Date and Time



Thursday, 8 March 2012
Loggable Events
        2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19]
        INFO com.gu.management.logging.RequestLoggingFilter -
        Request for /pages/Guardian/artanddesign/artblog/2008/jan/
        31/catchofthedaysecondlifes completed in 231 ms

                                                 Thread name

                         Date and Time   Class name



Thursday, 8 March 2012
Log analysis is your
                               friend
                    • Simple tools for a simple life
                     • Grep
                     • Cut
                     • Uniq
                     • Sort
                     • Sed and Awk
Thursday, 8 March 2012
zgrep "RequestLoggingFilter - Request for.*completed in "
$LOGFILE
| grep -v " /management/"
| cut -d" " -f1,2,3,10,13
> $COMPLETED_REQUESTS_FILE




Thursday, 8 March 2012
cat $COMPLETED_REQUESTS_FILE
                         | cut -d " " -f5
                         | sort -nr
                         | uniq -c
                         | awk '{ SUM += $1; print $2, SUM }'
                         > $CUMULATIVE_REQUESTS_FILE




Thursday, 8 March 2012
cat $COMPLETED_REQUESTS_FILE
                         | cut -d " " -f5                      Also:
                         | sort -nr                      awk ‘{ print $5 }’
                         | uniq -c
                         | awk '{ SUM += $1; print $2, SUM }'
                         > $CUMULATIVE_REQUESTS_FILE




Thursday, 8 March 2012
You can get
                                 complicated
                         • When sed/awk et al aren’t enough
                         • Write your own
                          • Log parsing into mysql
                          • select count(*) from database_calls
                             where request_id in (select id from
                             requests where path like ‘/travel/france/
                             %’)


Thursday, 8 March 2012
Other kinds of failure



Thursday, 8 March 2012
Not all about software

                    • Your application
                    • The system it runs on
                    • Infrastructure failures
                    • Network failures
                    • Bugs

Thursday, 8 March 2012
Systems Failure

                    • Your system itself might get inconsistent
                    • Garbage collection loops
                    • Database connections
                    • Infinite loops
                    • File Handles

Thursday, 8 March 2012
Infrastructure failure

                    • Power fails
                    • UPS fails
                    • Database machine fails
                    • Your own machine fails

Thursday, 8 March 2012
Network failure

                    • Routers fail
                    • Uplinks fail
                    • Internet routing failures
                    • DNS failures
                    • Browser issues

Thursday, 8 March 2012
Predictable Failure

                    • Hard drives filling up
                    • CPU max usage
                    • Network usage
                    • AppEngine/EC2 budgets
                    • Capacity planning

Thursday, 8 March 2012
Unpredictable failure
                    • “There are things we know that we know,
                         things we know that we don’t know, and
                         things we don’t know that we don’t know”
                    • MTBF and MTBR
                    • If you can’t predict failure:
                     • Recover faster
                     • Mitigate the issue
Thursday, 8 March 2012
External dependencies
                    • Who is more likely to break, you or
                         twitter?
                    • But can you predict when twitter will
                         break?
                    • Never depend on a third party
                    • They will let you down
                    • At the worst possible time
Thursday, 8 March 2012
So what have we
                             learnt?


Thursday, 8 March 2012
Open Platform

                    • Need to handle peaky load
                    • Fault isolation from main database
                    • feels like we’ve been here before...


Thursday, 8 March 2012
Content API
                         Architecture
                             Apache


                            AppServer


                              Solr



Thursday, 8 March 2012
Content API
                         Architecture
                                      Apache
                                      Apache
                                       Apache


      Database             Indexer   AppServer
                                     AppServer
                                      AppServer


                            Solr       Solr
                                        Solr
                                         Solr


Thursday, 8 March 2012
Content API
                         Architecture
                          Console     Apache
                                      Apache
                                       Apache


      Database             Indexer   AppServer
                                     AppServer
                                      AppServer


                            Solr       Solr
                                        Solr
                                         Solr


Thursday, 8 March 2012
Benefits

                    • Indexer provides data isolation
                    • solr replication gives “read only replicas”
                    • EC2 instances can be spun up when
                         necessary




Thursday, 8 March 2012
Benefits
                    • Switches on backend
                     • Indexing
                     • Features
                     • Replication
                    • Switches in API
                     • content.guardianapis.com/.json?show-
Thursday, 8 March 2012
Drawbacks / Todo
                    • Indexer latency
                     • Message based indexing
                    • Replication latency
                     • ElasticSearch/SolrCloud/Mongo?
                    • Live updating data
                     • Separation of API’s
Thursday, 8 March 2012
Summary

                    • Expect Failure
                    • Plan for failure
                     • At 4am
                    • Keep it simple
                    • Keep everything independent

Thursday, 8 March 2012
Thank You
                       • michael.brunton-spall@guardian.co.uk
                       • @bruntonspall
                       • Thanks to Lisa van Gelder (@techbint),
                                 Mat Wall (@matwall), Philip Wills
                                 (@philwills) and Graham Tackley
                                 (@tackers)
 Giant Furry Rat - “Lost land of the Volcano”courtesy of BBC natural history unit
 Panic Button - http://www.flickr.com/photos/trancemist/361935363/
 Long Meg Sidings - http://www.flickr.com/photos/ingythewingy/5243875486/
 Server Rack - http://www.flickr.com/photos/jamisonjudd/2433102356/
 Release Valve - http://www.flickr.com/photos/kayveeinc/4107697872
 Ancient Planet - http://www.flickr.com/photos/gsfc/4479185727/
 Solar system - http://www.flickr.com/photos/gsfc/4479185727
 Gauges - http://www.flickr.com/photos/dgoodphoto/5264024028
 Logs - http://www.flickr.com/photos/catzrule/5693655199
 Higgs boson - http://www.flickr.com/photos/jurvetson/4233962874
 Toolbox - http://www.flickr.com/photos/jrhode/4632887921



Thursday, 8 March 2012

Contenu connexe

Similaire à Architecting for failure

MongoDB - Who, What & Where!
MongoDB - Who, What & Where!MongoDB - Who, What & Where!
MongoDB - Who, What & Where!Mark Hillick
 
Direct memory jugl-2012.03.08
Direct memory jugl-2012.03.08Direct memory jugl-2012.03.08
Direct memory jugl-2012.03.08Benoit Perroud
 
Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...
Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...
Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...Ivan Zoratti
 
Engines Lightning Talk
Engines Lightning TalkEngines Lightning Talk
Engines Lightning TalkDan Pickett
 
Games for the Masses (QCon London 2012)
Games for the Masses (QCon London 2012)Games for the Masses (QCon London 2012)
Games for the Masses (QCon London 2012)Wooga
 
Red Dirt Ruby Conference
Red Dirt Ruby ConferenceRed Dirt Ruby Conference
Red Dirt Ruby ConferenceJohn Woodell
 
Oracle my sql cluster cge
Oracle my sql cluster cgeOracle my sql cluster cge
Oracle my sql cluster cgeseungdon1
 
AIIM New England - ECM in an Interoperable World
AIIM New England - ECM in an Interoperable WorldAIIM New England - ECM in an Interoperable World
AIIM New England - ECM in an Interoperable WorldCheryl McKinnon
 
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLAccelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLSumeet Bansal
 
Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)MongoSF
 
MySQL DW Breakfast
MySQL DW BreakfastMySQL DW Breakfast
MySQL DW BreakfastIvan Zoratti
 
Dan node meetup_socket_talk
Dan node meetup_socket_talkDan node meetup_socket_talk
Dan node meetup_socket_talkIshi von Meier
 
Future Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, ContinuentFuture Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, ContinuentEero Teerikorpi
 
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Goikailan
 
Better front-end development in Atlassian plugins
Better front-end development in Atlassian pluginsBetter front-end development in Atlassian plugins
Better front-end development in Atlassian pluginsAtlassian
 
Cloud4all Architecture Overview
Cloud4all Architecture OverviewCloud4all Architecture Overview
Cloud4all Architecture Overviewicchp2012
 
Introducing Immutant
Introducing Immutant Introducing Immutant
Introducing Immutant Jim Crossley
 
My sql 5.5_product_update
My sql 5.5_product_updateMy sql 5.5_product_update
My sql 5.5_product_updatehenriquesidney
 

Similaire à Architecting for failure (20)

MongoDB - Who, What & Where!
MongoDB - Who, What & Where!MongoDB - Who, What & Where!
MongoDB - Who, What & Where!
 
Direct memory jugl-2012.03.08
Direct memory jugl-2012.03.08Direct memory jugl-2012.03.08
Direct memory jugl-2012.03.08
 
Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...
Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...
Ora mysql bothGetting the best of both worlds with Oracle 11g and MySQL Enter...
 
Engines Lightning Talk
Engines Lightning TalkEngines Lightning Talk
Engines Lightning Talk
 
Games for the Masses (QCon London 2012)
Games for the Masses (QCon London 2012)Games for the Masses (QCon London 2012)
Games for the Masses (QCon London 2012)
 
Red Dirt Ruby Conference
Red Dirt Ruby ConferenceRed Dirt Ruby Conference
Red Dirt Ruby Conference
 
Oracle my sql cluster cge
Oracle my sql cluster cgeOracle my sql cluster cge
Oracle my sql cluster cge
 
AIIM New England - ECM in an Interoperable World
AIIM New England - ECM in an Interoperable WorldAIIM New England - ECM in an Interoperable World
AIIM New England - ECM in an Interoperable World
 
Accelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQLAccelerating big data with ioMemory and Cisco UCS and NOSQL
Accelerating big data with ioMemory and Cisco UCS and NOSQL
 
Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)
 
App Engine Meetup
App Engine MeetupApp Engine Meetup
App Engine Meetup
 
MySQL DW Breakfast
MySQL DW BreakfastMySQL DW Breakfast
MySQL DW Breakfast
 
Dan node meetup_socket_talk
Dan node meetup_socket_talkDan node meetup_socket_talk
Dan node meetup_socket_talk
 
Future Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, ContinuentFuture Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, Continuent
 
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
 
Better front-end development in Atlassian plugins
Better front-end development in Atlassian pluginsBetter front-end development in Atlassian plugins
Better front-end development in Atlassian plugins
 
Cloud4all Architecture Overview
Cloud4all Architecture OverviewCloud4all Architecture Overview
Cloud4all Architecture Overview
 
Introducing Immutant
Introducing Immutant Introducing Immutant
Introducing Immutant
 
My sql 5.5_product_update
My sql 5.5_product_updateMy sql 5.5_product_update
My sql 5.5_product_update
 
Oracle Database appliancepptx
Oracle Database appliancepptxOracle Database appliancepptx
Oracle Database appliancepptx
 

Dernier

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Dernier (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Architecting for failure

  • 1. Architecting for Failure Michael Brunton-Spall @bruntonspall QCon London 2012 Thursday, 8 March 2012
  • 2. The inevitability of failure • If you take nothing else away: • Systems will fail • Architect for failure • prevention • mitigation Thursday, 8 March 2012
  • 3. guardian.co.uk circa 2008 Thursday, 8 March 2012
  • 4. Basic Architecture Apache AppServer Database Thursday, 8 March 2012
  • 5. Basic Architecture • This is your basic J2EE stack • Lets apply scaling basics Thursday, 8 March 2012
  • 6. Scaling Load Balancer Apache Apache Apache Load Balancer AppServer AppServer AppServer Database Thursday, 8 March 2012
  • 7. Scaled Architecture • We don’t scale databases this way • Load balancers give scaling • Also a bit of spatial redundancy • But what about our data center? Thursday, 8 March 2012
  • 8. Redundancy Global Load Balancer Load Balancer Load Balancer Apache Apache Apache Apache Load Balancer Load Balancer AppServer AppServer AppServer AppServer Database Database Thursday, 8 March 2012
  • 9. Redundant Architecture • Real spatial redundancy • Global load balancing via DNS • Twin datacenters • Redundant power • Redundant internet connectivity • Database in Active/Passive Thursday, 8 March 2012
  • 10. Success stories • Serves 3.5m unique daily browsers • Over 1.6m unique pieces of content • supports hundreds of editorial staff • create articles, audio, video, galleries, interactives, micro-sites Thursday, 8 March 2012
  • 11. Drawbacks • Monolithic system • Understands everything • football leagues • financial markets • mortgage applications • content! Thursday, 8 March 2012
  • 12. The microapps revolution circa 2011 Thursday, 8 March 2012
  • 15. Microapps • Separation of Systems • SSI-like technology • HTTP Thursday, 8 March 2012
  • 16. AppEngine, Python, Ruby, EC2 - Oh My • Proliferation of systems, languages and frameworks • Faster development • Increased innovation • Hack Days! • Built on content API Thursday, 8 March 2012
  • 17. Microapps Architecture Apache AppServer Database Thursday, 8 March 2012
  • 18. Microapps Architecture Apache EC2 AppServer AppEngine Database Thursday, 8 March 2012
  • 19. Microapps Architecture Apache EC2 AppServer AppEngine Database Content API (EC2) Thursday, 8 March 2012
  • 20. Microapps Architecture Apache EC2 AppServer Cache AppEngine Database Content API (EC2) Thursday, 8 March 2012
  • 21. The cost of diversification • Support • Maintenance • Decided to settle on JVM stack primarily Thursday, 8 March 2012
  • 22. Benefits • Lots of small simple applications • Can code, release, test in isolation • Cache • max-age • stale-if-error Thursday, 8 March 2012
  • 23. Drawbacks • Increased architectural complexity • Need a big cache • Context Thursday, 8 March 2012
  • 24. The biggest problem • Microapp latency affects CMS latency • Failure is not a problem • Slow is a problem • stale-while-revalidate? Thursday, 8 March 2012
  • 26. Emergency Mode • Dynamic pages are expensive • ‘Peaky’ traffic • Often small subset of functionality • Trade off dynamism for speed Thursday, 8 March 2012
  • 27. Emergency Mode • Caches do not expire based on time • Serve pressed pages first • Render pages from caches second • Render page as normal finally Thursday, 8 March 2012
  • 28. Page Pressing • In memory caches aren’t enough • Need a full page cache • Stored on disk as generated HTML • Served like static files • Capable of over 1k pages/s per server Thursday, 8 March 2012
  • 29. Really cache everything • Except for microapps • Emergency mode for CMS doesn’t affect microapps by design • Microapps are cached anyway Thursday, 8 March 2012
  • 30. Gotta cache ‘em all • 1.6 million pieces of content • http://www.guardian.co.uk/uk/budget • http://www.guardian.co.uk/travel/france+travel/skiing • http://www.guardian.co.uk/theguardian/2012/mar/02 • http://www.guardian.co.uk/technology/apple?page=2 Thursday, 8 March 2012
  • 31. Cache what’s important • Content - when modified • Navigation - Every 2 weeks • Automatic but important - Every 2 weeks • Automatic (eg tag combiners) - Never • Can force a page press Thursday, 8 March 2012
  • 32. Monitoring • Help find the problem • What has gone wrong? • When did it go wrong? • What changed when it went wrong? • What can I turn off? Thursday, 8 March 2012
  • 33. Monitoring • Aggregate stats • individual, microapp, per colo, per stage • Monitor everything? • Is cpu usage that important? • Consistent • Alerting is not monitoring Thursday, 8 March 2012
  • 34. Automatic switches • Release valves • Emergency mode • Database off mode Thursday, 8 March 2012
  • 35. Switch if a threshold is met • Average page response time • Reset after timeout (say 60 seconds) • Prevents Ping-Pong of switches • Not an error, normal behaviour • Trends should be monitorable Thursday, 8 March 2012
  • 37. Why do I care? • Your architecture must be easy to diagnose • These skills aren’t common enough • Basic unix skills (sed, grep, cut, sort) • Log analysis • Take these into account when you design Thursday, 8 March 2012
  • 38. Logs, Logs, Logs, Logs • When an issue occurs • Copy logs from the affected server • System, Stdout, Application, JVM • reboot/disable/rebuild affected server • Parse logs in parallel Thursday, 8 March 2012
  • 39. Logging • Logs must be useful • Don’t log extraneous data (not too large) • Important data: • Date and Time • Affected code • Parseable Thursday, 8 March 2012
  • 40. Loggable Events • Request Logging (after including time) • External service requests (after including time) • Interesting events • Exceptions • Database calls? Thursday, 8 March 2012
  • 41. Loggable Events 2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19] INFO com.gu.management.logging.RequestLoggingFilter - Request for /pages/Guardian/artanddesign/artblog/2008/jan/ 31/catchofthedaysecondlifes completed in 231 ms Thursday, 8 March 2012
  • 42. Loggable Events 2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19] INFO com.gu.management.logging.RequestLoggingFilter - Request for /pages/Guardian/artanddesign/artblog/2008/jan/ 31/catchofthedaysecondlifes completed in 231 ms Date and Time Thursday, 8 March 2012
  • 43. Loggable Events 2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19] INFO com.gu.management.logging.RequestLoggingFilter - Request for /pages/Guardian/artanddesign/artblog/2008/jan/ 31/catchofthedaysecondlifes completed in 231 ms Thread name Date and Time Thursday, 8 March 2012
  • 44. Loggable Events 2012-03-06 14:58:19,351 [resin-tcp-connection-*:8080-19] INFO com.gu.management.logging.RequestLoggingFilter - Request for /pages/Guardian/artanddesign/artblog/2008/jan/ 31/catchofthedaysecondlifes completed in 231 ms Thread name Date and Time Class name Thursday, 8 March 2012
  • 45. Log analysis is your friend • Simple tools for a simple life • Grep • Cut • Uniq • Sort • Sed and Awk Thursday, 8 March 2012
  • 46. zgrep "RequestLoggingFilter - Request for.*completed in " $LOGFILE | grep -v " /management/" | cut -d" " -f1,2,3,10,13 > $COMPLETED_REQUESTS_FILE Thursday, 8 March 2012
  • 47. cat $COMPLETED_REQUESTS_FILE | cut -d " " -f5 | sort -nr | uniq -c | awk '{ SUM += $1; print $2, SUM }' > $CUMULATIVE_REQUESTS_FILE Thursday, 8 March 2012
  • 48. cat $COMPLETED_REQUESTS_FILE | cut -d " " -f5 Also: | sort -nr awk ‘{ print $5 }’ | uniq -c | awk '{ SUM += $1; print $2, SUM }' > $CUMULATIVE_REQUESTS_FILE Thursday, 8 March 2012
  • 49. You can get complicated • When sed/awk et al aren’t enough • Write your own • Log parsing into mysql • select count(*) from database_calls where request_id in (select id from requests where path like ‘/travel/france/ %’) Thursday, 8 March 2012
  • 50. Other kinds of failure Thursday, 8 March 2012
  • 51. Not all about software • Your application • The system it runs on • Infrastructure failures • Network failures • Bugs Thursday, 8 March 2012
  • 52. Systems Failure • Your system itself might get inconsistent • Garbage collection loops • Database connections • Infinite loops • File Handles Thursday, 8 March 2012
  • 53. Infrastructure failure • Power fails • UPS fails • Database machine fails • Your own machine fails Thursday, 8 March 2012
  • 54. Network failure • Routers fail • Uplinks fail • Internet routing failures • DNS failures • Browser issues Thursday, 8 March 2012
  • 55. Predictable Failure • Hard drives filling up • CPU max usage • Network usage • AppEngine/EC2 budgets • Capacity planning Thursday, 8 March 2012
  • 56. Unpredictable failure • “There are things we know that we know, things we know that we don’t know, and things we don’t know that we don’t know” • MTBF and MTBR • If you can’t predict failure: • Recover faster • Mitigate the issue Thursday, 8 March 2012
  • 57. External dependencies • Who is more likely to break, you or twitter? • But can you predict when twitter will break? • Never depend on a third party • They will let you down • At the worst possible time Thursday, 8 March 2012
  • 58. So what have we learnt? Thursday, 8 March 2012
  • 59. Open Platform • Need to handle peaky load • Fault isolation from main database • feels like we’ve been here before... Thursday, 8 March 2012
  • 60. Content API Architecture Apache AppServer Solr Thursday, 8 March 2012
  • 61. Content API Architecture Apache Apache Apache Database Indexer AppServer AppServer AppServer Solr Solr Solr Solr Thursday, 8 March 2012
  • 62. Content API Architecture Console Apache Apache Apache Database Indexer AppServer AppServer AppServer Solr Solr Solr Solr Thursday, 8 March 2012
  • 63. Benefits • Indexer provides data isolation • solr replication gives “read only replicas” • EC2 instances can be spun up when necessary Thursday, 8 March 2012
  • 64. Benefits • Switches on backend • Indexing • Features • Replication • Switches in API • content.guardianapis.com/.json?show- Thursday, 8 March 2012
  • 65. Drawbacks / Todo • Indexer latency • Message based indexing • Replication latency • ElasticSearch/SolrCloud/Mongo? • Live updating data • Separation of API’s Thursday, 8 March 2012
  • 66. Summary • Expect Failure • Plan for failure • At 4am • Keep it simple • Keep everything independent Thursday, 8 March 2012
  • 67. Thank You • michael.brunton-spall@guardian.co.uk • @bruntonspall • Thanks to Lisa van Gelder (@techbint), Mat Wall (@matwall), Philip Wills (@philwills) and Graham Tackley (@tackers) Giant Furry Rat - “Lost land of the Volcano”courtesy of BBC natural history unit Panic Button - http://www.flickr.com/photos/trancemist/361935363/ Long Meg Sidings - http://www.flickr.com/photos/ingythewingy/5243875486/ Server Rack - http://www.flickr.com/photos/jamisonjudd/2433102356/ Release Valve - http://www.flickr.com/photos/kayveeinc/4107697872 Ancient Planet - http://www.flickr.com/photos/gsfc/4479185727/ Solar system - http://www.flickr.com/photos/gsfc/4479185727 Gauges - http://www.flickr.com/photos/dgoodphoto/5264024028 Logs - http://www.flickr.com/photos/catzrule/5693655199 Higgs boson - http://www.flickr.com/photos/jurvetson/4233962874 Toolbox - http://www.flickr.com/photos/jrhode/4632887921 Thursday, 8 March 2012