SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Failover and Takeover Contingency Mechanisms
                    for Network Partition and Node Failure


              Macías López, Laura M. Castro, David Cabrero

                  MADS Research Group – Universidade da Coruña (Spain)


                                 Erlang Workshop
                         Copenhaguen, 14th September 2012




Erlang Workshop (2012)                                           Fail/Takeover Mechanisms   1 / 25
Why are we (all) here?




   Erlang Workshop (2012)   Fail/Takeover Mechanisms   2 / 25
Why are we (all) here?




   Erlang Workshop (2012)   Fail/Takeover Mechanisms   3 / 25
Why are we (all) here?




   Erlang Workshop (2012)   Fail/Takeover Mechanisms   3 / 25
Why are we (all) here?




   Erlang Workshop (2012)   Fail/Takeover Mechanisms   3 / 25
Why are we (all) here?




   Erlang Workshop (2012)   Fail/Takeover Mechanisms   4 / 25
Why are we (all) here?




   Erlang Workshop (2012)   Fail/Takeover Mechanisms   4 / 25
Why are we (presenting this work) here?




                                      concurrency!

                                      high-
                                      availability!

                                      distribution!




   Erlang Workshop (2012)                 Fail/Takeover Mechanisms   5 / 25
Why are we (presenting this work) here?

Unexpected problems
after deployment!




                                                   node failures!
                                                  system failure!



   Erlang Workshop (2012)                 Fail/Takeover Mechanisms   6 / 25
Why are we (presenting this work) here?




   Erlang Workshop (2012)                 Fail/Takeover Mechanisms   7 / 25
Why are we (presenting this work) here?




   Erlang Workshop (2012)                 Fail/Takeover Mechanisms   7 / 25
Outline



1   The system

2   The problems at deployment

3   The solution

4   Final remarks




    Erlang Workshop (2012)       Fail/Takeover Mechanisms   8 / 25
The system
ADVERTISE


Distributed system for advertisement transmission to on-customer-home
set-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator




   Erlang Workshop (2012)                             Fail/Takeover Mechanisms   9 / 25
The system
ADVERTISE’s requirements




     ensure the appropriate coordination of advertising mechanisms:

            compilation of events

            emission of advertising signals to STBs during a period of time

            recording hits (displays) of a specific piece of advertisement



Major challenge
Management of the size of the communications network:

              growing number of operator’s customers (∼ 100.000)



    Erlang Workshop (2012)                                    Fail/Takeover Mechanisms   10 / 25
The system
ADVERTISE’s architecture




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   11 / 25
The system
ADVERTISE’s architecture




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   11 / 25
The system
ADVERTISE’s architecture




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   11 / 25
The system
ADVERTISE’s architecture




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   11 / 25
The system
ADVERTISE’s architecture




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   11 / 25
The system
ADVERTISE’s architecture




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   11 / 25
The system
ADVERTISE’s structure




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   12 / 25
The system
ADVERTISE as Erlang Distributed Application


To meet its requirements, ADVERTISE was designed

as a distributed application over several nodes




     Erlang Workshop (2012)                        Fail/Takeover Mechanisms   13 / 25
The system
ADVERTISE as Erlang Distributed Application


To meet its requirements, ADVERTISE was designed

as a distributed application over several nodes




     Erlang Workshop (2012)                        Fail/Takeover Mechanisms   13 / 25
The problems at deployment
The symptoms




ADVERTISE deployment environment

presented some particularities that had not been foreseen:

     some nodes showed a tendency to fail more often than others

     network partition was common during some time periods (noon,

     night)


In this situation. . .
                      Fault tolerance requirements were not met!



    Erlang Workshop (2012)                                 Fail/Takeover Mechanisms   14 / 25
The problems at deployment
The diagnosis




ADVERTISE was developed and tested over several physical machines




     Erlang Workshop (2012)                      Fail/Takeover Mechanisms   15 / 25
The problems at deployment
The diagnosis




ADVERTISE was deployed over several virtual machines




     Erlang Workshop (2012)                      Fail/Takeover Mechanisms   15 / 25
The problems at deployment
The diagnosis




ADVERTISE was deployed over several virtual machines

      running on a single physical machine

      using a shared hard disk

      sharing the network link

      sharing with other apps/VMs


Frequent saturation of shared resources was perceived by ADVERTISE
nodes as short network partitions.


     Erlang Workshop (2012)                       Fail/Takeover Mechanisms   15 / 25
The problems at deployment
The consequences




If nodes lose connectivity, believe that all the others are down and assume

system functions, there are likely to be inconsistencies when connectivity

is restored (duplicated responsibilities, data inconsistencies).



Perceived network partitions led to cascade failovers
Duplicated registration of global names, random killing of conflicting

processes, overflow and eventual stop of the supervision mechanisms.




    Erlang Workshop (2012)                               Fail/Takeover Mechanisms   16 / 25
The solution

For ADVERTISE, data consistency was more important

than availability:

     system could not afford that advertising campaigns, rules,

     or media were lost or became inconsistent

     instead, it was acceptable that no ads were sent to STBs

     (or that they were delayed)

The solution
We re-designed ADVERTISE to be deployed over a minimum of 3 nodes,

                             and never on an isolated node

    Erlang Workshop (2012)                                   Fail/Takeover Mechanisms   17 / 25
The solution
ADVERTISE initialisation




     Erlang Workshop (2012)   Fail/Takeover Mechanisms   18 / 25
The solution
ADVERTISE initialisation




     Erlang Workshop (2012)   Fail/Takeover Mechanisms   18 / 25
The solution
ADVERTISE initialisation




     Erlang Workshop (2012)   Fail/Takeover Mechanisms   18 / 25
The solution
ADVERTISE initialisation




     Erlang Workshop (2012)   Fail/Takeover Mechanisms   18 / 25
The solution
ADVERTISE initialisation




     Erlang Workshop (2012)   Fail/Takeover Mechanisms   18 / 25
The solution
ADVERTISE initialisation




     Erlang Workshop (2012)   Fail/Takeover Mechanisms   18 / 25
The solution
ADVERTISE boot




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   19 / 25
The solution
ADVERTISE boot




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   19 / 25
The solution
ADVERTISE boot




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   19 / 25
The solution
ADVERTISE boot




    Erlang Workshop (2012)   Fail/Takeover Mechanisms   19 / 25
The solution
Node integrity check




   1    Retrieve the last known population of active nodes Listactives

   2    Retrieve the list of all ADVERTISE nodes from the configuration Listall

   3    Filter Listall removing ping-unreachable nodes

   4    If

                          (filtered (Listall ) = Listactives ) ∧ (|Listactives | = 1)


        ADVERTISE is suspended immediately,

        and node is rebooted once connectivity is restored


       Erlang Workshop (2012)                                            Fail/Takeover Mechanisms   20 / 25
The solution
Distributed AC check




  1    DAC is queried on all nodes, to get PID of ADVERTISE local sup

  2    If ∃n ∈ Listall for which ADVERTISE local sup PID could not be
       retrieved, node failure is assumed
          1   If n ∈ Listactives it means it replies to ping from the global supervisor but
              cannot reach others; after a timeout

                  1   If n ∈ Listactives node failure is confirmed
                           /
                  2   If n ∈ Listactives node is up and we reboot it




      Erlang Workshop (2012)                                           Fail/Takeover Mechanisms   21 / 25
The solution
Current ADVERTISE deployment



     Cluster of 3 virtual nodes, handles an average of 18K STBs per node

     with peaks of 23K STBs during prime time

     Our tests reached a maximum of 45K STBs per node

     System running with no incidents reported in the last 4 months

     Most intensive advertising campaign was a 2-month promotion:

     displayed over 66 million times, with a peak of 140K times in 1 hour

     Average campaign can be displayed a total of 500K, with peaks of up

     to 30K in 1 hour during prime time Saturday night

    Erlang Workshop (2012)                           Fail/Takeover Mechanisms   22 / 25
Final remarks
Lessons learned




When designing a distributed Erlang app, one must take into account:

                                         Network security
    Network reliability
                                         Network topology
    Latency of requests
                                         Heterogeneity of components
    Bandwidth
                                         Scalability




    Erlang Workshop (2012)                             Fail/Takeover Mechanisms   23 / 25
Final remarks
Lessons learned




When designing a distributed Erlang app, one must take into account:

                                         Network security
    Network reliability
                                         Network topology
    Latency of requests
                                         Heterogeneity of components
    Bandwidth
                                         Scalability




    Erlang Workshop (2012)                             Fail/Takeover Mechanisms   23 / 25
Final remarks
Your mileage may vary!




Had ADVERTISE requirements been substantially different




      we would probably have favoured

      availability over consistency, for instance




                                        And that would be a different story. . .



     Erlang Workshop (2012)                               Fail/Takeover Mechanisms   24 / 25
Questions?




                      Audience ! thanks




            Some images and icons were downloaded from: openclipart.org




  Erlang Workshop (2012)                                  Fail/Takeover Mechanisms   25 / 25

Contenu connexe

Similaire à Failover and takeover contingency mechanisms for network partition and node failure

Otm 2013 c13_e-17a-plessis-elisabeth-otm-self-help
Otm 2013 c13_e-17a-plessis-elisabeth-otm-self-helpOtm 2013 c13_e-17a-plessis-elisabeth-otm-self-help
Otm 2013 c13_e-17a-plessis-elisabeth-otm-self-helpjucaab
 
SAmgI: Automatic Metadata Generation v.2
SAmgI: Automatic Metadata Generation v.2SAmgI: Automatic Metadata Generation v.2
SAmgI: Automatic Metadata Generation v.2Xavier Ochoa
 
Model-Based Design & Analysis.ppt
Model-Based Design & Analysis.pptModel-Based Design & Analysis.ppt
Model-Based Design & Analysis.pptRajuRaju183149
 
Joe armstrong erlanga_languageforprogrammingreliablesystems
Joe armstrong erlanga_languageforprogrammingreliablesystemsJoe armstrong erlanga_languageforprogrammingreliablesystems
Joe armstrong erlanga_languageforprogrammingreliablesystemsSentifi
 
Optimizing the Enterprise Manager 12c
Optimizing the Enterprise Manager 12cOptimizing the Enterprise Manager 12c
Optimizing the Enterprise Manager 12cKellyn Pot'Vin-Gorman
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technicalsolarisyougood
 
Strategies oled optimization jmp 2016 09-19
Strategies oled optimization jmp 2016 09-19Strategies oled optimization jmp 2016 09-19
Strategies oled optimization jmp 2016 09-19David Lee
 
Strategies for Optimization of an OLED Device
Strategies for Optimization of an OLED DeviceStrategies for Optimization of an OLED Device
Strategies for Optimization of an OLED DeviceDavid Lee
 
ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...
ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...
ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...ARC Advisory Group
 
Implementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulationImplementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulationkarthik annam
 
Managing elasticity across Multi-cloud providers
Managing elasticity across Multi-cloud providersManaging elasticity across Multi-cloud providers
Managing elasticity across Multi-cloud providersFawaz Fernand PARAISO
 
Om enterprise labs session leader
Om enterprise labs session leaderOm enterprise labs session leader
Om enterprise labs session leaderMark Maclean
 
How to Get Started With Advanced Contro
How to Get Started With Advanced ControHow to Get Started With Advanced Contro
How to Get Started With Advanced ControEmerson Exchange
 
Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...
Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...
Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...Itzik Reich
 
OSI_MySQL_Performance Schema
OSI_MySQL_Performance SchemaOSI_MySQL_Performance Schema
OSI_MySQL_Performance SchemaMayank Prasad
 
Thinking in a Highly Concurrent, Mostly-functional Language - Cesarini
Thinking in a Highly Concurrent, Mostly-functional Language - CesariniThinking in a Highly Concurrent, Mostly-functional Language - Cesarini
Thinking in a Highly Concurrent, Mostly-functional Language - CesariniCodemotion
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talkAbhik Roychoudhury
 

Similaire à Failover and takeover contingency mechanisms for network partition and node failure (20)

dl_catalog_hrREV3
dl_catalog_hrREV3dl_catalog_hrREV3
dl_catalog_hrREV3
 
Otm 2013 c13_e-17a-plessis-elisabeth-otm-self-help
Otm 2013 c13_e-17a-plessis-elisabeth-otm-self-helpOtm 2013 c13_e-17a-plessis-elisabeth-otm-self-help
Otm 2013 c13_e-17a-plessis-elisabeth-otm-self-help
 
SAmgI: Automatic Metadata Generation v.2
SAmgI: Automatic Metadata Generation v.2SAmgI: Automatic Metadata Generation v.2
SAmgI: Automatic Metadata Generation v.2
 
F.M.E.C.A pdf
F.M.E.C.A pdfF.M.E.C.A pdf
F.M.E.C.A pdf
 
Model-Based Design & Analysis.ppt
Model-Based Design & Analysis.pptModel-Based Design & Analysis.ppt
Model-Based Design & Analysis.ppt
 
Joe armstrong erlanga_languageforprogrammingreliablesystems
Joe armstrong erlanga_languageforprogrammingreliablesystemsJoe armstrong erlanga_languageforprogrammingreliablesystems
Joe armstrong erlanga_languageforprogrammingreliablesystems
 
Optimizing the Enterprise Manager 12c
Optimizing the Enterprise Manager 12cOptimizing the Enterprise Manager 12c
Optimizing the Enterprise Manager 12c
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technical
 
Strategies oled optimization jmp 2016 09-19
Strategies oled optimization jmp 2016 09-19Strategies oled optimization jmp 2016 09-19
Strategies oled optimization jmp 2016 09-19
 
Strategies for Optimization of an OLED Device
Strategies for Optimization of an OLED DeviceStrategies for Optimization of an OLED Device
Strategies for Optimization of an OLED Device
 
ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...
ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...
ARC's Wil Chin Remote Operations Management Presentation @ ARC Industry Forum...
 
Implementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulationImplementation of area optimized low power multiplication and accumulation
Implementation of area optimized low power multiplication and accumulation
 
Managing elasticity across Multi-cloud providers
Managing elasticity across Multi-cloud providersManaging elasticity across Multi-cloud providers
Managing elasticity across Multi-cloud providers
 
Om enterprise labs session leader
Om enterprise labs session leaderOm enterprise labs session leader
Om enterprise labs session leader
 
How to Get Started With Advanced Contro
How to Get Started With Advanced ControHow to Get Started With Advanced Contro
How to Get Started With Advanced Contro
 
Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...
Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...
Itzik Reich-EMC World 2015-Best Practices for running virtualized workloads o...
 
OSI_MySQL_Performance Schema
OSI_MySQL_Performance SchemaOSI_MySQL_Performance Schema
OSI_MySQL_Performance Schema
 
Cluster Schedulers
Cluster SchedulersCluster Schedulers
Cluster Schedulers
 
Thinking in a Highly Concurrent, Mostly-functional Language - Cesarini
Thinking in a Highly Concurrent, Mostly-functional Language - CesariniThinking in a Highly Concurrent, Mostly-functional Language - Cesarini
Thinking in a Highly Concurrent, Mostly-functional Language - Cesarini
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talk
 

Plus de Laura M. Castro

Ola, ChatGPT... que carreira sería boa para min?
Ola, ChatGPT... que carreira sería boa para min?Ola, ChatGPT... que carreira sería boa para min?
Ola, ChatGPT... que carreira sería boa para min?Laura M. Castro
 
IAs xerativas e nesgos de xénero
IAs xerativas e nesgos de xéneroIAs xerativas e nesgos de xénero
IAs xerativas e nesgos de xéneroLaura M. Castro
 
As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...
As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...
As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...Laura M. Castro
 
David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...
David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...
David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...Laura M. Castro
 
Why on Earth would I test if I have to just "Let it crash"?
Why on Earth would I test if I have to just "Let it crash"?Why on Earth would I test if I have to just "Let it crash"?
Why on Earth would I test if I have to just "Let it crash"?Laura M. Castro
 
How the BEAM will change your mind
How the BEAM will change your mindHow the BEAM will change your mind
How the BEAM will change your mindLaura M. Castro
 
So I used Erlang... is my system as scalable as they say it'd be?
So I used Erlang... is my system as scalable as they say it'd be?So I used Erlang... is my system as scalable as they say it'd be?
So I used Erlang... is my system as scalable as they say it'd be?Laura M. Castro
 
Elixir: the not-so-hidden path to Erlang
Elixir: the not-so-hidden path to ErlangElixir: the not-so-hidden path to Erlang
Elixir: the not-so-hidden path to ErlangLaura M. Castro
 
Automatic generation of UML sequence diagrams from test counterexamples
Automatic generation of UML sequence diagrams from test counterexamplesAutomatic generation of UML sequence diagrams from test counterexamples
Automatic generation of UML sequence diagrams from test counterexamplesLaura M. Castro
 
Making property-based testing easier to read for humans
Making property-based testing easier to read for humansMaking property-based testing easier to read for humans
Making property-based testing easier to read for humansLaura M. Castro
 
Erlang as a supporting technology for teaching Software Architecture
Erlang as a supporting technology for teaching Software ArchitectureErlang as a supporting technology for teaching Software Architecture
Erlang as a supporting technology for teaching Software ArchitectureLaura M. Castro
 
Experiencias Industriales con Programación Declarativa
Experiencias Industriales con Programación DeclarativaExperiencias Industriales con Programación Declarativa
Experiencias Industriales con Programación DeclarativaLaura M. Castro
 
Functional programming goes to Hollywood... and around the world!
Functional programming goes to Hollywood... and around the world!Functional programming goes to Hollywood... and around the world!
Functional programming goes to Hollywood... and around the world!Laura M. Castro
 
Editing documents with LaTeX
Editing documents with LaTeXEditing documents with LaTeX
Editing documents with LaTeXLaura M. Castro
 
Introdución á edición de textos con LaTeX
Introdución á edición de textos con LaTeXIntrodución á edición de textos con LaTeX
Introdución á edición de textos con LaTeXLaura M. Castro
 
Edición de textos con LaTeX
Edición de textos con LaTeXEdición de textos con LaTeX
Edición de textos con LaTeXLaura M. Castro
 
Edición de textos con LaTeX
Edición de textos con LaTeXEdición de textos con LaTeX
Edición de textos con LaTeXLaura M. Castro
 
Improving software development using Erlang/OTP
Improving software development using Erlang/OTPImproving software development using Erlang/OTP
Improving software development using Erlang/OTPLaura M. Castro
 
Testing database applications with QuickCheck
Testing database applications with QuickCheckTesting database applications with QuickCheck
Testing database applications with QuickCheckLaura M. Castro
 

Plus de Laura M. Castro (20)

Ola, ChatGPT... que carreira sería boa para min?
Ola, ChatGPT... que carreira sería boa para min?Ola, ChatGPT... que carreira sería boa para min?
Ola, ChatGPT... que carreira sería boa para min?
 
IAs xerativas e nesgos de xénero
IAs xerativas e nesgos de xéneroIAs xerativas e nesgos de xénero
IAs xerativas e nesgos de xénero
 
As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...
As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...
As intelixencias artificiais como xeradoras de cultura: exploración dos nesgo...
 
David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...
David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...
David vs. Goliat: lecciones aprendidas de una experiencia fallida de adopción...
 
Why on Earth would I test if I have to just "Let it crash"?
Why on Earth would I test if I have to just "Let it crash"?Why on Earth would I test if I have to just "Let it crash"?
Why on Earth would I test if I have to just "Let it crash"?
 
How the BEAM will change your mind
How the BEAM will change your mindHow the BEAM will change your mind
How the BEAM will change your mind
 
Elixir vs Java
Elixir vs JavaElixir vs Java
Elixir vs Java
 
So I used Erlang... is my system as scalable as they say it'd be?
So I used Erlang... is my system as scalable as they say it'd be?So I used Erlang... is my system as scalable as they say it'd be?
So I used Erlang... is my system as scalable as they say it'd be?
 
Elixir: the not-so-hidden path to Erlang
Elixir: the not-so-hidden path to ErlangElixir: the not-so-hidden path to Erlang
Elixir: the not-so-hidden path to Erlang
 
Automatic generation of UML sequence diagrams from test counterexamples
Automatic generation of UML sequence diagrams from test counterexamplesAutomatic generation of UML sequence diagrams from test counterexamples
Automatic generation of UML sequence diagrams from test counterexamples
 
Making property-based testing easier to read for humans
Making property-based testing easier to read for humansMaking property-based testing easier to read for humans
Making property-based testing easier to read for humans
 
Erlang as a supporting technology for teaching Software Architecture
Erlang as a supporting technology for teaching Software ArchitectureErlang as a supporting technology for teaching Software Architecture
Erlang as a supporting technology for teaching Software Architecture
 
Experiencias Industriales con Programación Declarativa
Experiencias Industriales con Programación DeclarativaExperiencias Industriales con Programación Declarativa
Experiencias Industriales con Programación Declarativa
 
Functional programming goes to Hollywood... and around the world!
Functional programming goes to Hollywood... and around the world!Functional programming goes to Hollywood... and around the world!
Functional programming goes to Hollywood... and around the world!
 
Editing documents with LaTeX
Editing documents with LaTeXEditing documents with LaTeX
Editing documents with LaTeX
 
Introdución á edición de textos con LaTeX
Introdución á edición de textos con LaTeXIntrodución á edición de textos con LaTeX
Introdución á edición de textos con LaTeX
 
Edición de textos con LaTeX
Edición de textos con LaTeXEdición de textos con LaTeX
Edición de textos con LaTeX
 
Edición de textos con LaTeX
Edición de textos con LaTeXEdición de textos con LaTeX
Edición de textos con LaTeX
 
Improving software development using Erlang/OTP
Improving software development using Erlang/OTPImproving software development using Erlang/OTP
Improving software development using Erlang/OTP
 
Testing database applications with QuickCheck
Testing database applications with QuickCheckTesting database applications with QuickCheck
Testing database applications with QuickCheck
 

Dernier

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Dernier (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Failover and takeover contingency mechanisms for network partition and node failure

  • 1. Failover and Takeover Contingency Mechanisms for Network Partition and Node Failure Macías López, Laura M. Castro, David Cabrero MADS Research Group – Universidade da Coruña (Spain) Erlang Workshop Copenhaguen, 14th September 2012 Erlang Workshop (2012) Fail/Takeover Mechanisms 1 / 25
  • 2. Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 2 / 25
  • 3. Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
  • 4. Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
  • 5. Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
  • 6. Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 4 / 25
  • 7. Why are we (all) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 4 / 25
  • 8. Why are we (presenting this work) here? concurrency! high- availability! distribution! Erlang Workshop (2012) Fail/Takeover Mechanisms 5 / 25
  • 9. Why are we (presenting this work) here? Unexpected problems after deployment! node failures! system failure! Erlang Workshop (2012) Fail/Takeover Mechanisms 6 / 25
  • 10. Why are we (presenting this work) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 7 / 25
  • 11. Why are we (presenting this work) here? Erlang Workshop (2012) Fail/Takeover Mechanisms 7 / 25
  • 12. Outline 1 The system 2 The problems at deployment 3 The solution 4 Final remarks Erlang Workshop (2012) Fail/Takeover Mechanisms 8 / 25
  • 13. The system ADVERTISE Distributed system for advertisement transmission to on-customer-home set-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator Erlang Workshop (2012) Fail/Takeover Mechanisms 9 / 25
  • 14. The system ADVERTISE’s requirements ensure the appropriate coordination of advertising mechanisms: compilation of events emission of advertising signals to STBs during a period of time recording hits (displays) of a specific piece of advertisement Major challenge Management of the size of the communications network: growing number of operator’s customers (∼ 100.000) Erlang Workshop (2012) Fail/Takeover Mechanisms 10 / 25
  • 15. The system ADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • 16. The system ADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • 17. The system ADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • 18. The system ADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • 19. The system ADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • 20. The system ADVERTISE’s architecture Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
  • 21. The system ADVERTISE’s structure Erlang Workshop (2012) Fail/Takeover Mechanisms 12 / 25
  • 22. The system ADVERTISE as Erlang Distributed Application To meet its requirements, ADVERTISE was designed as a distributed application over several nodes Erlang Workshop (2012) Fail/Takeover Mechanisms 13 / 25
  • 23. The system ADVERTISE as Erlang Distributed Application To meet its requirements, ADVERTISE was designed as a distributed application over several nodes Erlang Workshop (2012) Fail/Takeover Mechanisms 13 / 25
  • 24. The problems at deployment The symptoms ADVERTISE deployment environment presented some particularities that had not been foreseen: some nodes showed a tendency to fail more often than others network partition was common during some time periods (noon, night) In this situation. . . Fault tolerance requirements were not met! Erlang Workshop (2012) Fail/Takeover Mechanisms 14 / 25
  • 25. The problems at deployment The diagnosis ADVERTISE was developed and tested over several physical machines Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
  • 26. The problems at deployment The diagnosis ADVERTISE was deployed over several virtual machines Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
  • 27. The problems at deployment The diagnosis ADVERTISE was deployed over several virtual machines running on a single physical machine using a shared hard disk sharing the network link sharing with other apps/VMs Frequent saturation of shared resources was perceived by ADVERTISE nodes as short network partitions. Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
  • 28. The problems at deployment The consequences If nodes lose connectivity, believe that all the others are down and assume system functions, there are likely to be inconsistencies when connectivity is restored (duplicated responsibilities, data inconsistencies). Perceived network partitions led to cascade failovers Duplicated registration of global names, random killing of conflicting processes, overflow and eventual stop of the supervision mechanisms. Erlang Workshop (2012) Fail/Takeover Mechanisms 16 / 25
  • 29. The solution For ADVERTISE, data consistency was more important than availability: system could not afford that advertising campaigns, rules, or media were lost or became inconsistent instead, it was acceptable that no ads were sent to STBs (or that they were delayed) The solution We re-designed ADVERTISE to be deployed over a minimum of 3 nodes, and never on an isolated node Erlang Workshop (2012) Fail/Takeover Mechanisms 17 / 25
  • 30. The solution ADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • 31. The solution ADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • 32. The solution ADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • 33. The solution ADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • 34. The solution ADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • 35. The solution ADVERTISE initialisation Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
  • 36. The solution ADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • 37. The solution ADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • 38. The solution ADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • 39. The solution ADVERTISE boot Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
  • 40. The solution Node integrity check 1 Retrieve the last known population of active nodes Listactives 2 Retrieve the list of all ADVERTISE nodes from the configuration Listall 3 Filter Listall removing ping-unreachable nodes 4 If (filtered (Listall ) = Listactives ) ∧ (|Listactives | = 1) ADVERTISE is suspended immediately, and node is rebooted once connectivity is restored Erlang Workshop (2012) Fail/Takeover Mechanisms 20 / 25
  • 41. The solution Distributed AC check 1 DAC is queried on all nodes, to get PID of ADVERTISE local sup 2 If ∃n ∈ Listall for which ADVERTISE local sup PID could not be retrieved, node failure is assumed 1 If n ∈ Listactives it means it replies to ping from the global supervisor but cannot reach others; after a timeout 1 If n ∈ Listactives node failure is confirmed / 2 If n ∈ Listactives node is up and we reboot it Erlang Workshop (2012) Fail/Takeover Mechanisms 21 / 25
  • 42. The solution Current ADVERTISE deployment Cluster of 3 virtual nodes, handles an average of 18K STBs per node with peaks of 23K STBs during prime time Our tests reached a maximum of 45K STBs per node System running with no incidents reported in the last 4 months Most intensive advertising campaign was a 2-month promotion: displayed over 66 million times, with a peak of 140K times in 1 hour Average campaign can be displayed a total of 500K, with peaks of up to 30K in 1 hour during prime time Saturday night Erlang Workshop (2012) Fail/Takeover Mechanisms 22 / 25
  • 43. Final remarks Lessons learned When designing a distributed Erlang app, one must take into account: Network security Network reliability Network topology Latency of requests Heterogeneity of components Bandwidth Scalability Erlang Workshop (2012) Fail/Takeover Mechanisms 23 / 25
  • 44. Final remarks Lessons learned When designing a distributed Erlang app, one must take into account: Network security Network reliability Network topology Latency of requests Heterogeneity of components Bandwidth Scalability Erlang Workshop (2012) Fail/Takeover Mechanisms 23 / 25
  • 45. Final remarks Your mileage may vary! Had ADVERTISE requirements been substantially different we would probably have favoured availability over consistency, for instance And that would be a different story. . . Erlang Workshop (2012) Fail/Takeover Mechanisms 24 / 25
  • 46. Questions? Audience ! thanks Some images and icons were downloaded from: openclipart.org Erlang Workshop (2012) Fail/Takeover Mechanisms 25 / 25