SlideShare une entreprise Scribd logo
1  sur  61
Télécharger pour lire hors ligne
Distributed Programming
          and Data Consistency
          by Paulo Gaspar
          @paulogaspar7 on Twitter
          This will be placed at:
          http://www.slideshare.net/paulogaspar7

quinta-feira, 24 de Junho de 2010                          1

Twitter: @paulogaspar7 - http://twitter.com/paulogaspar7
Blog: http://paulogaspar7.blogspot.com/
This presentation is about...
                             Awareness about how most of us do some
                             kind Distributed Computing these days

                             Tuning Data Consistency for Fun and Profit

                             Tools and sources of knowledge related to
                             Distributed Computing

                             Where to get some source code too


quinta-feira, 24 de Junho de 2010                                        2
Consistency Perception




quinta-feira, 24 de Junho de 2010   3
What is Consistency?
quinta-feira, 24 de Junho de 2010                                                                                 4

Our perception of consistency is related with what we know about the system and its state. That is how we figure
what might fit...
What isn’t?
quinta-feira, 24 de Junho de 2010                                                                             5

...and what does not fit. Obviously a person will have a different degree of precision and tolerance than an
automated system.
Consistency across time
quinta-feira, 24 de Junho de 2010                                                    6

Consistency also has a time axis, with state sequences that make sense...
1 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
Consistency across time
quinta-feira, 24 de Junho de 2010                                                    7

2 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
Consistency across time
quinta-feira, 24 de Junho de 2010                                                    8

3 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
quinta-feira, 24 de Junho de 2010                                                      9

...and state sequences that do NOT make sense.
1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
quinta-feira, 24 de Junho de 2010                                                      10

2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
quinta-feira, 24 de Junho de 2010                                                      11

3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Consistency is perception
                     ...and time matters...
quinta-feira, 24 de Junho de 2010                                                                                   12

Again, each (type of) observer will have a different degree of evaluation precision and tolerance to inconsistencies.
Cache Consistency
          (Low Latency - high read performance)




quinta-feira, 24 de Junho de 2010                 13
The Case for LB Caches
                Memcached at FB:
         You HAVE TO Replicate to Scale-Out
quinta-feira, 24 de Junho de 2010                                                                                 14

An example of how you still might have to replicate in order to scale, even with a very high performance store.

The reason for FB’s issue (might lack some detail):
 http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more-
capacit.html

“What happens when you add more servers is that the number of requests is not reduced, only the number of keys
in each request is reduced. The number keys returned in a request only matters if you are bandwidth limited. The
server is still on the hook for processing the same number of requests. Adding more machines doesn't change the
number of request a server has to process and since these servers are already CPU bound they simply can't handle
more load. So adding more servers doesn't help you handle more requests. Not what we usually expect. This is
another example of why architecture matters.”
So, now it “Loadbalances”...
quinta-feira, 24 de Junho de 2010                                                                           15

...and with LB inconsistencies along the time axis can happen (eg. by reading from alternate out-of-synch
backends)
...but then you can have...
quinta-feira, 24 de Junho de 2010                                                      16

With the possibility of state sequences that do NOT make sense.
1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
quinta-feira, 24 de Junho de 2010                                                      17

2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
Inconsistency across time
quinta-feira, 24 de Junho de 2010                                                      18

3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
...now it can pick >1 versions!
quinta-feira, 24 de Junho de 2010                       19

Why you can have inconsistencies along the time axis.
Data Caching Consistency
                               Multi-layer and/or Load Balanced caches

                               Changing to cached data => can cause
                               => Inconsistency Across Time

                               Some candidate solutions:
                                    All in Memory with update Push
                                    (instead of TTL + Pull)
                                    Cache Replication/Synchronization
                                    The “Schrodinger” Consistency Model...

quinta-feira, 24 de Junho de 2010                                                                                        20
Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state
changes, are any server to client delays (due to caching) really there?

Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site
down) due to overload.
The “Schrodinger”
         Consistency Model
                               A Schrodinger’s Cache?
                                    Data Inconsistencies only
                                    matter if they can be observed
                                    ...but the observer might just be another
                                    system if its work quality is affected
                                    Parallelism x Accuracy of State Evaluation

                               The case for the 3” Cache on a “Live Site”

quinta-feira, 24 de Junho de 2010                                                                                        21
Parallelism x Accurate State Eval.:
- Cliff Click’s Non Blocking Counter
- How many breads exist at a given moment on the stores of a large supermarket network?

Might have to live without an at the moment state evaluation. Accurate evaluation of a past moment’s state might still be
possible.

Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state
changes, are any server to client delays (due to caching) really there?

Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site
down) due to overload.
Java Caching Solutions
                               http://java-source.net/open-source/cache-
                               solutions
                                    EhCache, OSCache, JBoss Cache, Apache JCS,
                                    Terracotta, etc
                                    EhCache now at Terracotta

                               Oracle Coherence

                               GigaSpaces XAP Data Grid

                               IBM WebSphere eXtreme Scale

quinta-feira, 24 de Junho de 2010                                                                                        22

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state
changes, are any server to client delays (due to caching) really there?

Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site
down) due to overload.
Other Interesting
         Caching Solutions

                               Varnish (HTTP Cache)

                               Redis

                               Memcached

                               There are many, many others...



quinta-feira, 24 de Junho de 2010                                                                                        23
Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state
changes, are any server to client delays (due to caching) really there?

Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site
down) due to overload.
Slow and Big Consistency
          (The Higher Latency - BigData)




quinta-feira, 24 de Junho de 2010          24
MapReduce is for embarrassingly
      parallel problems with some time...
quinta-feira, 24 de Junho de 2010                                                                                 25

Consistency scenarios, starting from the most “sexy” (Web, Peta Bytes of Data):
* MapReduce works like vote counting - vote mapped to voting tables, counted, “reduced” to stats;
* MR is appropriate for "embarrassingly parallel" tasks, like indexing the Internet and other huge processing tasks;
* We should use it whenever possible;
* There is a lot to be learned about Map Reduce:
 - Evaluation and expression of candidate problems;
 - Build and manage an its infrastructure;
 - etc.
* Even MR has coordination needs;
* Even MR should have SLAs (Service Level Agreements).
Coordination
             Consensus needed for Map Reduce
                   Consensus is the process of agreeing on one result
                   among a group of participants

             Consensus is not as easy as it seems
                   Byzantine Generals' Problem + 2 Generals' Problem
                   All solutions are probabilistic


quinta-feira, 24 de Junho de 2010                                                                                                                                              26

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

http://en.wikipedia.org/wiki/Two_Generals'_Problem

http://en.wikipedia.org/wiki/Byzantine_fault_tolerance#Origin

The thought experiment involves considering how they might go about coming to consensus. In its simplest form one general (referred to as the "first general" below) is known
to be the leader, decides on the time of attack, and must communicate this time to the other general. The requirement that causes the "problem" is that both generals must
attack at the agreed upon time to succeed. Having a solitary general attack is considered a disastrous failure. The problem is to come up with algorithms that the generals can
use, including sending messages and processing received messages, that can allow them to correctly conclude:
Yes, we will both attack at the agreed upon time.
Note that it is quite simple for the generals to come to an agreement on the time to attack. One successful message with a successful acknowledgement suffices for that. The
subtlety of the Two Generals' Problem is in the impossibility of designing algorithms for the generals to use to safely agree to the above statement.

Illustrating the problem
The first general may start by sending a message "Let us attack at 9 o'clock in the morning." However, once dispatched, the first general has no idea whether or not the
messenger got through. Any amount of uncertainty may lead the first general to hesitate to attack, since if the second general does not also attack at that time, the city's
garrison will repel the advance, leading to the destruction of that attacking general's forces. Knowing this, the second general may send a confirmation back to the first: "I
received your message and will attack at 9 o'clock." However, what if the confirmation messenger were captured? The second general, knowing that the first will hesitate
without the confirmation, may himself hesitate. A solution might seem to be to have the first general send a second confirmation: "I received your confirmation of the planned
attack." However, what if that messenger were captured? It quickly becomes evident that no matter how many rounds of confirmation are made there is no way to guarantee
the second requirement that both generals agree the message was delivered.
MapReduce + Consensus
                                    (Google + Hadoop Implementations)

                Google, coordination by Chubby using Paxos.
                Used only at Google;
                Google BigTable is a Wide Column Store which works
                on top of GoogleFS. Used only at Google;
                Hadoop, used at Amazon, Facebook, Rackspace,
                Twitter, Yahoo!, etc.;
                Hadoop ZooKeeper implements a Paxos variation and
                is used at Rackspace, Yahoo!, etc.;
                Hadoop HBase is a Wide Column Store, on top of
                HDFS and now uses ZooKeeper. Used at Yahoo! etc.
quinta-feira, 24 de Junho de 2010                                                        27

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Parallel between Google’s internally developed systems and their Hadoop counterparts.
 http://hadoop.apache.org/
 http://labs.google.com/papers/

The very interesting “coordinators”:
 http://labs.google.com/papers/chubby.html
 http://hadoop.apache.org/zookeeper/

Zookeeper sure looks like a very interesting and reusable piece of software.

Curiosity: HBase is faster since using ZooKeeper... is it also because of Zookeeper???
 http://hadoop.apache.org/hbase/
Apache Hadoop Projects
                              HBase (Distributed DB)
                              HDFS (Distributed File System)
                              MapReduce
                              Pig (Query Language++)
                              ZooKeeper (Coordination)
                              others (Common, Hive, Chukwa)
quinta-feira, 24 de Junho de 2010                                                 28

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

http://hadoop.apache.org/
Apache Zookeeper
                  Distributed Coordination
                  for Distributed Applications
                  Design Goals:
                        Simple, Replicated, Ordered, Fast (very resilient too)

                  Other Properties:
                        Thousands of clients, better for 10:1 reads:writes


quinta-feira, 24 de Junho de 2010                                                29

http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperOver.html
Apache Zookeeper API
                           Simple API:
                                    Filesystem like node tree
                                    Conditional Updates
                                    Watches (notifications)
                                    Ephemeral and Sequence Nodes

                           Out of box + recipes for:
                                    Name Service, Configuration, Group
                                    Membership, Barriers, Queues, Locks, 2P
                                    Commit, Leader Election
quinta-feira, 24 de Junho de 2010                                                 30

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperOver.html
Consistency w/ Interaction
          (Low Latency - read/write - harder stuff)




quinta-feira, 24 de Junho de 2010                     31
Two “High”/Sexy reasons for
            Distributing Data Storage
                                    (not just cache)


                High Performance Data Access
                (Read / Write)
                High Availability (HA)


quinta-feira, 24 de Junho de 2010                      32
Why care about HA?

                1.7% HDDs fail in the 1st year, 8.6% in the 3rd (Google)
                Unrecoverable RAM errors/year: 1.3% machines,
                0.22% DIMM (Google)
                Router, Rack, PDU, misc. network failures
                Over 4 nines only through redundancy, best hardware
                never good enough (James Hamilton-MS and Amazon)
                Hey! This might affect smaller fish like us!!!


quinta-feira, 24 de Junho de 2010                                                    33

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Sources:
For Google’s numbers check the slideware at:
 http://videolectures.net/wsdm09_dean_cblirs/

For the James Hamilton quote:
 http://mvdirona.com/jrh/TalksAndPapers/JamesRH_Ladis2008.pdf

Another very quoted paper with Google’s DRAM failure stats and patterns:
 http://research.google.com/pubs/pub35162.html

You can find other HA and Systems related papers from Google and James Hamilton at:
 http://mvdirona.com/jrh/work/
 http://research.google.com/pubs/DistributedSystemsandParallelComputing.html
Why care about Latency?
                Google: Half a second delay caused a 20% drop in
                traffic (30 results instead of 10, via Marissa Mayer);
                Amazon found every 100ms of latency costs 1% sales
                (via Greg Linden);
                A broker could lose $4 million in revenues per
                millisecond if their electronic trading platform is 5 ms
                behind the competition (via NYT).
                Hey! This affects anyone online!


quinta-feira, 24 de Junho de 2010                                                    34

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

You can find all this references trough this page (if you follow the links):
 http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it

Including these:
  http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html
  http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx
  http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
Fallacies of Distributed Computing
          (What can go wrong?)
                                    1. The network is reliable;
                                    2. Latency is zero;
                                    3. Bandwidth is infinite;
                                    4. The network is secure;
                                    5. Topology doesn't change;
                                    6. There is one administrator;
                                    7. Transport cost is zero;
                                    8. The network is homogeneous.

quinta-feira, 24 de Junho de 2010                                                 35

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Just to remember this classic on the HA challenges. A few more details at:
  http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
Other Distributed Data Contexts
        (the less sexy daily stuff?)

                   EAI / B2B / Systems Integration

                   Geographic Distribution
                   (e.g.:Health System+Hospitals)

                   Systems with n-tier / SOA Architectures

                   Elasticity on Peaks (still sexy...)

quinta-feira, 24 de Junho de 2010                                                                              36

The daily jobs of so many IT professionals have much more relation with this type of common distributed systems
than with the sexier kind we talked about before. But these fields too would benefit from the learning the lessons
and using the technologies we are talking about.
The CAP Theorem and
          Eventual Consistency



quinta-feira, 24 de Junho de 2010   37
CAP Theorem History
              1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems”
              paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley)

              2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC
              Conference

              2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and
              Nancy Lynch (MIT)

              2007-10-02: “Amazon's Dynamo” post by Werner Vogels
              (Amazon’s CTO) quoting a paper (by him + others)
              2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO)

              2010-04-23: With the PACELC model, Daniel Abadi remembers and explains
              on his blog the obvious importance of latency on BASE vs. ACID and other
              tuning decisions over designs which revolve around CAP.


quinta-feira, 24 de Junho de 2010                                                                                    38

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual
Consistency” chapter:
 http://books.couchdb.org/relax/intro/eventual-consistency

Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in
truly industrial sites, even with stats describing real life behavior:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and the now famous Eventually Consistent post by Werner Vogels:
  http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne,
is the best I found about the Brewer’s CAP Theorem and its history:
  http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

You should still take a look at:
* The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is
already mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf

* The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already
mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

* The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”:
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

* You might also see with your own eyes how CAP became a proved Theorem:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

* The PACELC model was described by Daniel Abadi on his blog at:
 http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html

Definition of ACID:
The CAP Theorem
               strong Consistency, high Availability, Partition-resilience:
                                   pick at most 2
quinta-feira, 24 de Junho de 2010                                             39

I simply had to put The Diagram, of course.

According to http://books.couchdb.org/relax/intro/eventual-consistency
...exemples of what goes in the interceptions:
CP => Classic RDBMS
      Enforced Consistency
CA => Paxos / Consensus
      Consensus Protocols for HA Consistency
AP => CouchDB + Eventually Consistent DBs
      Eventual Consistency
Eventual Consistency for
          Availability
          BASE                                                          ACID
          (Basically Available Soft-state Eventual consistency)         (Atomicity, Consistency, Isolation, Durability)


                Weak Consistency                                              Strong consistency
                (stale data ok)                                               (NO stale data)
                Availability first                                             Isolation
                Best effort                                                   Focus on “commit”

                Approximate answers OK                                        Availability?

                Aggressive (optimistic)                                       Conservative (pessimistic)

                Faster / Lower Latency                                        Safer

quinta-feira, 24 de Junho de 2010                                                                                                  40

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

You can find a variation of this slide at Brewer’s 2000’s PODC keynote at:
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

I skipped these rather controversial bits:
  ACID: * Nested transactions; * Difficult evolution (e.g. schema)
  BASE: * Simpler! * Easier evolution

I already tried both ways (data stores with and without schema) and I rather have some schema mechanism for the most
complex stuff.


ACID:
A)tomicity
Either all of the tasks of a transaction are performed or none of them are.

C)onsistency
A database remains in a consistent state before the start of the transaction and after the transaction is over (whether successful
or not).

I)solation
Other operations cannot access or see the data in an intermediate state during a transaction.

D)urability
Once the user has been notified of success, the transaction will persist. This means it will survive system failure, and that the
database system has checked the integrity constraints and won't need to abort the transaction.
CAP Trade-offs
                CA without P: Databases providing distributed transactions can
                only do it while their network is ok;

                CP without A: While there is a partition, transactions to an ACID
                database may be blocked until the partition heals
                (to avoid merge conflicts -> inconsistency);

                AP without C: Caching provides client-server partition resilience
                by replicating data, even if the partition prevents verifying if a
                replica is fresh. In general, any distributed DB problem can be
                solved with either:
                      expiration-based caching to get AP;
                      or replicas and majority voting to get PC
                      (minority is unavailable).

quinta-feira, 24 de Junho de 2010                                                                     41

* VERY FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Concept introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.):
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

I should probably skip this slide during a life presentation. This is stuff you have to read about.
Living with CAP
             All systems are probabilistic, wether they realize it or not
             And so are Distributed Transactions (2 Generals Problem)
             Life is Eventually Consistent
             Weak CAP Principle: The stronger the guarantees made
             about any two of C, A and P, the weaker the guarantees
             that can be made about the third
             Systems should degrade gracefully, instead of all or
             nothing (e.g.: displaying data from available partitions)

quinta-feira, 24 de Junho de 2010                                                                                       42

* #1 => Explain why Life is Eventually Consistent

Steve Yen clearly illustrates the “Life is Eventually Consistent” idea on the slideware (slides 40 to 45) he used for
his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009:
 http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf

The Weak CAP Principle was introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et
al.):
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

To understand how hard (ACID) Distributed Transactions are, you have an excellent history of the concepts related
to this problem here:
 http://betathoughts.blogspot.com/2007/06/brief-history-of-consensus-2pc-and.html

The difficulties of (ACID) Distributed Transactions are well illustrated by the classic Two Generals’ Problem:
 http://en.wikipedia.org/wiki/Two_Generals'_Problem

Leslie Lamport et al further explore the problem (and its solutions) on the classic “The Byzantine Generals Problem”
paper:
 http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf

And if you think that Two Phase Commit is a 100% reliable mechanism... think again:
 http://www.cs.cornell.edu/courses/cs614/2004sp/papers/Ske81.pdf

This is just to illustrate the difficulty of the problem. There are more reliable mechanisms, like Three Phase
Commit:
 http://en.wikipedia.org/wiki/Three-phase_commit_protocol
 http://ei.cs.vt.edu/~cs5204/fall99/distributedDBMS/sreenu/3pc.html

...or the so called Paxos Commit:
  http://research.microsoft.com/pubs/64636/tr-2003-96.pdf
CAP Theorem History
              1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems”
              paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley)

              2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC
              Conference

              2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and
              Nancy Lynch (MIT)

              2007-10-02: “Amazon's Dynamo” post by Werner Vogels
              (Amazon’s CTO) quoting a paper (by him + others)
              2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO)

              2010-04-23: With the PACELC model, Daniel Abadi remembers and explains
              on his blog the obvious importance of latency on BASE vs. ACID and other
              tuning decisions over designs which revolve around CAP.


quinta-feira, 24 de Junho de 2010                                                                                    43

* VERY FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

* Repeated slide, repeated notes (to pass focus from CAP to Dynamo and Eventual Consistency):

The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual
Consistency” chapter:
 http://books.couchdb.org/relax/intro/eventual-consistency

Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in
truly industrial sites, even with stats describing real life behavior:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and the now famous Eventually Consistent post by Werner Vogels:
  http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne,
is the best I found about the Brewer’s CAP Theorem and its history:
  http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

You should still take a look at:
* The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is
already mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf

* The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already
mentioned:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf

* The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”:
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

* You might also see with your own eyes how CAP became a proved Theorem:
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

* The PACELC model was described by Daniel Abadi on his blog at:
 http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
Amazon’s Dynamo DB
                        Also a “Wide Column Store”


         Problem                                 Technique
         Partitioning                            Consistent Hashing

         High Availability for writes            Vector clocks with reconciliation during reads

         Handling temporary failures             Sloppy Quorum and hinted handoff (NRW)

         Recovering from permanent failures Anti-entropy using Merkle trees

         Membership and failure detection        Gossip-based membership protocol and failure detection.


                 (in bold some techniques which could improve many
                          “enterprise” / “every-day” solutions)
quinta-feira, 24 de Junho de 2010                                                                                      44

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

The source here is the already mentioned Dynamo paper:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Strict distributed DBs, rather than dealing with the uncertainty of the correctness of an answer, make data is made
unavailable until it is absolutely certain that it is correct.

At Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution - avg or median not good
enough to provide a good experience for all. The choice for 99.9% over an even higher percentile has been made based
on a cost-benefit analysis which demonstrated a significant increase in cost to improve performance that much.
Experiences with Amazon’s production systems have shown that this approach provides a better overall experience
compared to those systems that meet SLAs defined based on the mean or median.
Tuning Consistency:
                                    N: number of nodes to replicate each item to;
                                    W: number of required nodes for write success;
                                    R: number of required nodes for read success.

                                    W < N = remaining nodes will receive the write later.
                                    R < N = remaining nodes ignored.


quinta-feira, 24 de Junho de 2010                                                                            45
Also based in the already mentioned Dynamo paper:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...but you can find a similar diagram and similar mechanisms described about several (NoSQL) databases that
partially clone Dynamo.
Previous Experiences...




quinta-feira, 24 de Junho de 2010   46
Eventually Consistent Systems

                               Banks
                               EAI Integrations
                               Many messaging based (SOA) systems
                               Google
                               Amazon
                               Etc.



quinta-feira, 24 de Junho de 2010                                                                             47

Unlike what many examples say, Banks often use Eventual Consistency on many (limited value/risk) transactions -
or use “large” periodic transaction / compensation fixed windows to process large numbers of larger value
movements. So much for those ACID transaction examples...
Amazon Dynamo Lessons
          (according to the paper)

                 Data returned to Shopping Cart 24h profiling:
                 0.00057% of requests saw 2 versions; 0.00047% of
                 requests saw 3 versions and 0.00009% of requests
                 saw 4 versions.
                 In two years applications have received successful
                 responses (without timing out) for 99.9995% of its
                 requests and no data loss event has occurred to date;
                 With coordination via Gossip protocol it is harder to
                 scale further than a few hundred nodes.
                 (Could be better w/ Chubby / ZK like coordinators?)

quinta-feira, 24 de Junho de 2010                                                                                     48
Also based in the already mentioned Dynamo paper:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Wikipedia has an article on Gossip Protocols (although, at the data I write this, not as precise as other Wikipedia
articles I just quoted):
 http://en.wikipedia.org/wiki/Gossip_protocol

The solution I mention as a possibly more scalable alternative to Gossip Protocols for consensus is the use of
Paxos (or derivates) Coordinators, like the proprietary Google’s Chubby or the open source Apache Hadoop
Zookeeper.

When I first wrote and used (at my SAPO Codebits 2009 talk) these slides, the only support I still had to my (then
intuitive) belief that these more directed approaches should be more efficient than Gossip Protocols was the 6.6
part from the Dynamo paper - the paper even mentions the possibility of “introducing hierarchical extensions to
Dynamo”.

Thanks to my SAPO Codebits talk I met Henrique Moniz, then a Ph.D. student at the University of Lisbon. After I
discussed this issue (consensus scalability) with him he pointed me to a couple of interesting papers, one of which
immediately captured my attention:
* Gossip-based broadcast protocols by João Leitão
 http://www.gsd.inesc-id.pt/~jleitao/pdf/masterthesis-leitao.pdf

This paper offers a more complete description of gossip protocols overhead and, to my surprise, also pointed a
few reliability weak spots on known Gossip Protocols. The paper goes on to present a more robust and efficient
Gossip Protocol called “HyParView” using a more “directed” approach.

HyParView sure looks like an interesting solution in terms of robustness for environments with an high incidence
of system/network failures but I still believe that using coordinators will be more efficient in a well controlled data
center.

Not that using coordinators and making them scale out BIG is exactly trivial, as you can read here:
-On the “Vertical Paxos and Primary-Backup Replication” paper, by Leslie Lamport et al, that Henrique Moniz
pointed me to:
 http://research.microsoft.com/pubs/80907/podc09v6.pdf

-Or on this interesting article from the Cloudera’s blog about the (now upcoming) Observers feature of Apache
NoSQL Java being used at
                Cassandra: at Facebook, being introduced on
                Twitter, persistent cache at reddit, replacing MySQL
                at digg, etc.
                Voldemort: at LinkedIn, Gilt Groupe e-commerce
                site (check Geir Magnusson’s QCon presentation);
                HBase: Yahoo!, Twitter, Adobe, Ning,
                Stumbleupon, Meetup, etc. Often used for high
                volume analytics but also for other high volume stores
                and M-R tasks.

quinta-feira, 24 de Junho de 2010                                                 49

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

http://www.infoq.com/presentations/Project-Voldemort-at-Gilt-Groupe
Is NoSQL better than SQL?
                The NoSQL vs. SQL database debate is really about
                ACID vs. BASE databases
                A query language advantage indicator is given by
                Hadoop Pig use at Twitter (via Kevin Well):
                      The Pig version is
                            5% of the code
                            5% of the time
                            Within 50% of the execution time
                Any one which used c-tree like DB APIs can say the same

quinta-feira, 24 de Junho de 2010                                                     50

http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/

http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
Some interesting techniques...
          (...which we could all be using...)




quinta-feira, 24 de Junho de 2010               51
Wikipedia image



                                               Vector Clocks
                           On each internal even a process increments its logical clock;
                           Before sending a message, it increments its own clock in the
                           vector and sends it with the message;
                           On receiving a message, it increments its clock and updates
                           each element on its own vector to max.(own, msg).

quinta-feira, 24 de Junho de 2010                                                                              52

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Also based in the already mentioned Dynamo paper:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and on the Wikipedia article about this algorithm:
  http://en.wikipedia.org/wiki/Vector_clock

Vector Clocks (and other similar algorithms) have a predecessor in Lamport timestamps:
 http://en.wikipedia.org/wiki/Lamport_timestamps

Introduced in the classic paper “Time, Clocks, and the Ordering of Events in a Distributed System” by Leslie
Lamport:
  http://en.wikipedia.org/wiki/Lamport_timestamps
Wikipedia image




                                    Merkle Tree / Hash Tree
                                Used to verify / compare a set of data blocks
                                and efficiently find where the mismatches are.

quinta-feira, 24 de Junho de 2010                                                     53

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Also based in the already mentioned Dynamo paper:
  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

...and on the Wikipedia article about this algorithm:
  http://en.wikipedia.org/wiki/Hash_tree
ACID and FAST
          (Lowest Latency - read/write - hardest stuff)




quinta-feira, 24 de Junho de 2010                         54
Immediately Consistent Systems

                                                               Data-grids:
                                                                Oracle Coherence
                  Trading                                       Gigaspaces
                                                               All Data in RAM
                  Online Gambling                              Can do ACID
                                                               Very High Speed
                                                               Max. Scale-out


quinta-feira, 24 de Junho de 2010                                                                              55

* FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION

Trading and Online Gambling really need to do large volumes of fast ACID transactions and are the big customers
of Data Grids.

Why Online Gambling needs ACID transactions has all to do with the type of game and the type of rules/assets
(some virtual) it involves.

Why Trading really needs ACID is s bit more obvious: you might be able to compensate an overdraft at a bank
(more so for limited values) but you really cannot sell shares you do not have for sale.

The performance needs are obvious for both too. For Trading there are even some new reasons, like (again):
  http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
Tools
          (Most with source code to pick from)




quinta-feira, 24 de Junho de 2010                56
NoSQL Taxonomy
          by Steve Yen [PG]
                key‐value‐cache: memcached, repcached, coherence, infinispan, eXtreme scale, jboss
                cache, velocity, terracota [???]
                key‐value‐store: keyspace [w/Paxos], flare, schema‐free, RAMCloud [, Mnesia (Erlang),
                Chordless]
                eventually‐consistent key‐value‐store: dynamo, Voldemort, Dynomite, SubRecord,
                MotionDb, Dovetaildb
                ordered‐key‐value‐store: tokyo tyrant[, BerkleyDB, JDBM], lightcloud, NMDB, luxio,
                memcachedb, actord
                data‐structures server: redis
                tuple‐store: gigaspaces [?], coord, apache river
                object database: ZopeDB, db4o, Shoal
                document store: CouchDB [evC, MVCC], MongoDB [evC], Jackrabbit, XML Databases,
                ThruDB, CloudKit, Perservere, Riak Basho [evC], Scalaris [Erlang, w/Paxos]
                wide columnar store: BigTable, Hadoop HBase [w/ Zookeeper], [Amazon Dynamo-evC, ]
                Cassandra [evC], Hypertable, KAI, OpenNeptune, Qbase, KDI
                [graph database: Neo4J, Sones, etc.]

quinta-feira, 24 de Junho de 2010                                                                                      57

From Steve Yen’s slideware (slide 54) he used for his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009:
 http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf

I do not completely understand or agree with Steve’s criteria but it sure is a possible starting point on building a
database/storage taxonomy.

The stuff in square brackets is mine. “evC” means Eventually Consistent and “?” just means I have doubts / don’t
understand some specific classification.
Some related Solutions I find interesting...
                Zookeeper
                (use it, configuration, elasticity, group membership, leader election, notification...)
                JDBM, BerkleyDB (careful w/ the OS license)
                (just use them for very fast persistence storage)
                Voldemort and Cassandra
                (use them or pick code for Vector Clocks, Merkle Trees, data compression,
                communications and other code - nice code bases)
                Redis (not Java, but usable from and a kind of a Swiss Knife)
                The Riak Basho Bitcask store idea. Used something similar (but not generic) in Java:
                http://downloads.basho.com/papers/bitcask-intro.pdf
                EhCache
                (the pre-Terracotta version shows of simplistic some stuff can be and still work)
                Just use RMI and native Java serialization (as EhCache does)
                JBoss Netty
                (if you want to do seriously fast network communication)
                Varnish
                (an HTTP cache which knows how to use Virtual Memory)

quinta-feira, 24 de Junho de 2010                                                                       58

http://downloads.basho.com/papers/bitcask-intro.pdf
Opportunities
          (...to use these tools)




quinta-feira, 24 de Junho de 2010   59
Some cases we could talk about...
                               EAI Integrations
                               (Should use Vector Clocks?)
                               Zookeeper at the “Farm” (Config./Coord.)
                               Live soccer game site
                               Web sites in general
                               Log like / timeline systems
                               (forums, healthcare, Twitter, etc.)
                               Analytics
                               Logistic Planing across EU case
                               Trading

quinta-feira, 24 de Junho de 2010                                                                                        60

This is the placeholder slide to exercise the ideas and discuss possible applications of some of the mechanisms
which were presented on this talk (had no time at Codebits... still tuning this not-so-easy presentation).

Except for the last two scenarios (and the Twitter alternative on the “Log like” one) all others represent quite
common types of problems which you can meet without having to work for a Fortune Top 50 company or for a
mega web portal / service. Even an “Analytics” with enough data to justify using MapReduce is common enough.
Many large (but not necessarily huge) companies often quit doing more with the data they have just because of
the trouble of finding a way to do it (“more”).

* “Analytics” (high data + easy on consistency as it is) is currently seem to be the playground of Map Reduce, with
Hadoop stuff being used “everywhere”. Look at how many times you can find the words “analytics” or
“analysis” (and “MapReduce”) on these “Powered by” Hadoop web pages:
 http://wiki.apache.org/hadoop/PoweredBy
 http://wiki.apache.org/hadoop/Hbase/PoweredBy

* “Live soccer game...” is a nice problem to discuss short live caching and its consistency issues;

* “Log like / timeline systems...” are systems where information is mostly “insert only” and most of the effort to
keep consistency is related to keeping proper ordering information (with timestamps being usually enough),
properly merging the data from different sources and respect the explicit or implicit SLAs on data
synchronizations. Obviously, there are different difficulties across the several cases here mentioned, depending on
data flow, necessary performance, etc.;

* “EAI Integrations” often need better knowledge about ordering and are not as simples as the previous scenario.
Due to factors like the use of asynchronous and event driven mechanisms and the possibility of having updates
for a given document across multiple steps of a (multiple) process(es), a timestamp is often too limited as
ordering information... but is often the most you get. IMO this is a good scenario for using Vector Clocks and
company;

* “Zookeeper” is a great system even if “just” to configure the simplest web (or webservice) farm, to coordinate the
simplest cross farm operations (e.g.: cache related) or just for each server to know which are its peers;

* “Logistic Planing” is a complex scenario which demands a mix of solutions. It revolves around a logistics
company which transports goods across Europe, with planning offices on different countries. I will probably have
to remove it from this slide for any future talk I might give on this topic even if it is the most interesting of them
all. So, it does not make much sense to develop it here (maybe a blog post since, to me, this is a >10 year old
Q&A




quinta-feira, 24 de Junho de 2010         61

Contenu connexe

Similaire à Distributed Programming and Data Consistency w/ Notes - June 2010 update

Distributed Programming and Data Consistency w/ Notes
Distributed Programming and Data Consistency w/ NotesDistributed Programming and Data Consistency w/ Notes
Distributed Programming and Data Consistency w/ NotesPaulo Gaspar
 
A Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage SystemA Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage SystemAlluxio, Inc.
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusRebecca Bilbro
 
Real-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky BorisReal-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky Borislucenerevolution
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareQuantum Leaps, LLC
 
Containers and Why They Matter
Containers and Why They MatterContainers and Why They Matter
Containers and Why They MatterRay Lukas
 
designing distributed scalable and reliable systems
designing distributed scalable and reliable systemsdesigning distributed scalable and reliable systems
designing distributed scalable and reliable systemsMauro Servienti
 
Gerenciando recursos computacionais com Apache Mesos
Gerenciando recursos computacionais com Apache MesosGerenciando recursos computacionais com Apache Mesos
Gerenciando recursos computacionais com Apache Mesostdc-globalcode
 
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native ObservabilityEric D. Schabell
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsAntonio García-Domínguez
 
Distributed Systems at Scale: Reducing the Fail
Distributed Systems at Scale:  Reducing the FailDistributed Systems at Scale:  Reducing the Fail
Distributed Systems at Scale: Reducing the FailKim Moir
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Mba ebooks ! Edhole
Mba ebooks ! EdholeMba ebooks ! Edhole
Mba ebooks ! EdholeEdhole.com
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesYoav Francis
 

Similaire à Distributed Programming and Data Consistency w/ Notes - June 2010 update (20)

Distributed Programming and Data Consistency w/ Notes
Distributed Programming and Data Consistency w/ NotesDistributed Programming and Data Consistency w/ Notes
Distributed Programming and Data Consistency w/ Notes
 
A Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage SystemA Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage System
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
Realtime search at Yammer
Realtime search at YammerRealtime search at Yammer
Realtime search at Yammer
 
Real Time Search at Yammer
Real Time Search at YammerReal Time Search at Yammer
Real Time Search at Yammer
 
Real-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky BorisReal-time Search at Yammer - By Aleksandrovsky Boris
Real-time Search at Yammer - By Aleksandrovsky Boris
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
 
Containers and Why They Matter
Containers and Why They MatterContainers and Why They Matter
Containers and Why They Matter
 
designing distributed scalable and reliable systems
designing distributed scalable and reliable systemsdesigning distributed scalable and reliable systems
designing distributed scalable and reliable systems
 
Gerenciando recursos computacionais com Apache Mesos
Gerenciando recursos computacionais com Apache MesosGerenciando recursos computacionais com Apache Mesos
Gerenciando recursos computacionais com Apache Mesos
 
Concurrency on the JVM
Concurrency on the JVMConcurrency on the JVM
Concurrency on the JVM
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
 
Distributed Systems at Scale: Reducing the Fail
Distributed Systems at Scale:  Reducing the FailDistributed Systems at Scale:  Reducing the Fail
Distributed Systems at Scale: Reducing the Fail
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Mba ebooks ! Edhole
Mba ebooks ! EdholeMba ebooks ! Edhole
Mba ebooks ! Edhole
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 

Dernier

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 

Dernier (20)

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 

Distributed Programming and Data Consistency w/ Notes - June 2010 update

  • 1. Distributed Programming and Data Consistency by Paulo Gaspar @paulogaspar7 on Twitter This will be placed at: http://www.slideshare.net/paulogaspar7 quinta-feira, 24 de Junho de 2010 1 Twitter: @paulogaspar7 - http://twitter.com/paulogaspar7 Blog: http://paulogaspar7.blogspot.com/
  • 2. This presentation is about... Awareness about how most of us do some kind Distributed Computing these days Tuning Data Consistency for Fun and Profit Tools and sources of knowledge related to Distributed Computing Where to get some source code too quinta-feira, 24 de Junho de 2010 2
  • 4. What is Consistency? quinta-feira, 24 de Junho de 2010 4 Our perception of consistency is related with what we know about the system and its state. That is how we figure what might fit...
  • 5. What isn’t? quinta-feira, 24 de Junho de 2010 5 ...and what does not fit. Obviously a person will have a different degree of precision and tolerance than an automated system.
  • 6. Consistency across time quinta-feira, 24 de Junho de 2010 6 Consistency also has a time axis, with state sequences that make sense... 1 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  • 7. Consistency across time quinta-feira, 24 de Junho de 2010 7 2 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  • 8. Consistency across time quinta-feira, 24 de Junho de 2010 8 3 of 3=> Expected event sequence (3 slide animation which SlideShare won’t handle)
  • 9. Inconsistency across time quinta-feira, 24 de Junho de 2010 9 ...and state sequences that do NOT make sense. 1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 10. Inconsistency across time quinta-feira, 24 de Junho de 2010 10 2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 11. Inconsistency across time quinta-feira, 24 de Junho de 2010 11 3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 12. Consistency is perception ...and time matters... quinta-feira, 24 de Junho de 2010 12 Again, each (type of) observer will have a different degree of evaluation precision and tolerance to inconsistencies.
  • 13. Cache Consistency (Low Latency - high read performance) quinta-feira, 24 de Junho de 2010 13
  • 14. The Case for LB Caches Memcached at FB: You HAVE TO Replicate to Scale-Out quinta-feira, 24 de Junho de 2010 14 An example of how you still might have to replicate in order to scale, even with a very high performance store. The reason for FB’s issue (might lack some detail): http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more- capacit.html “What happens when you add more servers is that the number of requests is not reduced, only the number of keys in each request is reduced. The number keys returned in a request only matters if you are bandwidth limited. The server is still on the hook for processing the same number of requests. Adding more machines doesn't change the number of request a server has to process and since these servers are already CPU bound they simply can't handle more load. So adding more servers doesn't help you handle more requests. Not what we usually expect. This is another example of why architecture matters.”
  • 15. So, now it “Loadbalances”... quinta-feira, 24 de Junho de 2010 15 ...and with LB inconsistencies along the time axis can happen (eg. by reading from alternate out-of-synch backends)
  • 16. ...but then you can have... quinta-feira, 24 de Junho de 2010 16 With the possibility of state sequences that do NOT make sense. 1 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 17. Inconsistency across time quinta-feira, 24 de Junho de 2010 17 2 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 18. Inconsistency across time quinta-feira, 24 de Junho de 2010 18 3 of 3=> UNexpected event sequence (3 slide animation which SlideShare won’t handle)
  • 19. ...now it can pick >1 versions! quinta-feira, 24 de Junho de 2010 19 Why you can have inconsistencies along the time axis.
  • 20. Data Caching Consistency Multi-layer and/or Load Balanced caches Changing to cached data => can cause => Inconsistency Across Time Some candidate solutions: All in Memory with update Push (instead of TTL + Pull) Cache Replication/Synchronization The “Schrodinger” Consistency Model... quinta-feira, 24 de Junho de 2010 20 Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state changes, are any server to client delays (due to caching) really there? Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site down) due to overload.
  • 21. The “Schrodinger” Consistency Model A Schrodinger’s Cache? Data Inconsistencies only matter if they can be observed ...but the observer might just be another system if its work quality is affected Parallelism x Accuracy of State Evaluation The case for the 3” Cache on a “Live Site” quinta-feira, 24 de Junho de 2010 21 Parallelism x Accurate State Eval.: - Cliff Click’s Non Blocking Counter - How many breads exist at a given moment on the stores of a large supermarket network? Might have to live without an at the moment state evaluation. Accurate evaluation of a past moment’s state might still be possible. Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state changes, are any server to client delays (due to caching) really there? Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site down) due to overload.
  • 22. Java Caching Solutions http://java-source.net/open-source/cache- solutions EhCache, OSCache, JBoss Cache, Apache JCS, Terracotta, etc EhCache now at Terracotta Oracle Coherence GigaSpaces XAP Data Grid IBM WebSphere eXtreme Scale quinta-feira, 24 de Junho de 2010 22 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state changes, are any server to client delays (due to caching) really there? Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site down) due to overload.
  • 23. Other Interesting Caching Solutions Varnish (HTTP Cache) Redis Memcached There are many, many others... quinta-feira, 24 de Junho de 2010 23 Even on a “live” site you can use a short lived cache. If the user can NOT observe the exact time of each server state changes, are any server to client delays (due to caching) really there? Moreover, it is often a matter of having small update-until-view delays due to caching or really big ones (or the site down) due to overload.
  • 24. Slow and Big Consistency (The Higher Latency - BigData) quinta-feira, 24 de Junho de 2010 24
  • 25. MapReduce is for embarrassingly parallel problems with some time... quinta-feira, 24 de Junho de 2010 25 Consistency scenarios, starting from the most “sexy” (Web, Peta Bytes of Data): * MapReduce works like vote counting - vote mapped to voting tables, counted, “reduced” to stats; * MR is appropriate for "embarrassingly parallel" tasks, like indexing the Internet and other huge processing tasks; * We should use it whenever possible; * There is a lot to be learned about Map Reduce: - Evaluation and expression of candidate problems; - Build and manage an its infrastructure; - etc. * Even MR has coordination needs; * Even MR should have SLAs (Service Level Agreements).
  • 26. Coordination Consensus needed for Map Reduce Consensus is the process of agreeing on one result among a group of participants Consensus is not as easy as it seems Byzantine Generals' Problem + 2 Generals' Problem All solutions are probabilistic quinta-feira, 24 de Junho de 2010 26 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION http://en.wikipedia.org/wiki/Two_Generals'_Problem http://en.wikipedia.org/wiki/Byzantine_fault_tolerance#Origin The thought experiment involves considering how they might go about coming to consensus. In its simplest form one general (referred to as the "first general" below) is known to be the leader, decides on the time of attack, and must communicate this time to the other general. The requirement that causes the "problem" is that both generals must attack at the agreed upon time to succeed. Having a solitary general attack is considered a disastrous failure. The problem is to come up with algorithms that the generals can use, including sending messages and processing received messages, that can allow them to correctly conclude: Yes, we will both attack at the agreed upon time. Note that it is quite simple for the generals to come to an agreement on the time to attack. One successful message with a successful acknowledgement suffices for that. The subtlety of the Two Generals' Problem is in the impossibility of designing algorithms for the generals to use to safely agree to the above statement. Illustrating the problem The first general may start by sending a message "Let us attack at 9 o'clock in the morning." However, once dispatched, the first general has no idea whether or not the messenger got through. Any amount of uncertainty may lead the first general to hesitate to attack, since if the second general does not also attack at that time, the city's garrison will repel the advance, leading to the destruction of that attacking general's forces. Knowing this, the second general may send a confirmation back to the first: "I received your message and will attack at 9 o'clock." However, what if the confirmation messenger were captured? The second general, knowing that the first will hesitate without the confirmation, may himself hesitate. A solution might seem to be to have the first general send a second confirmation: "I received your confirmation of the planned attack." However, what if that messenger were captured? It quickly becomes evident that no matter how many rounds of confirmation are made there is no way to guarantee the second requirement that both generals agree the message was delivered.
  • 27. MapReduce + Consensus (Google + Hadoop Implementations) Google, coordination by Chubby using Paxos. Used only at Google; Google BigTable is a Wide Column Store which works on top of GoogleFS. Used only at Google; Hadoop, used at Amazon, Facebook, Rackspace, Twitter, Yahoo!, etc.; Hadoop ZooKeeper implements a Paxos variation and is used at Rackspace, Yahoo!, etc.; Hadoop HBase is a Wide Column Store, on top of HDFS and now uses ZooKeeper. Used at Yahoo! etc. quinta-feira, 24 de Junho de 2010 27 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Parallel between Google’s internally developed systems and their Hadoop counterparts. http://hadoop.apache.org/ http://labs.google.com/papers/ The very interesting “coordinators”: http://labs.google.com/papers/chubby.html http://hadoop.apache.org/zookeeper/ Zookeeper sure looks like a very interesting and reusable piece of software. Curiosity: HBase is faster since using ZooKeeper... is it also because of Zookeeper??? http://hadoop.apache.org/hbase/
  • 28. Apache Hadoop Projects HBase (Distributed DB) HDFS (Distributed File System) MapReduce Pig (Query Language++) ZooKeeper (Coordination) others (Common, Hive, Chukwa) quinta-feira, 24 de Junho de 2010 28 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION http://hadoop.apache.org/
  • 29. Apache Zookeeper Distributed Coordination for Distributed Applications Design Goals: Simple, Replicated, Ordered, Fast (very resilient too) Other Properties: Thousands of clients, better for 10:1 reads:writes quinta-feira, 24 de Junho de 2010 29 http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperOver.html
  • 30. Apache Zookeeper API Simple API: Filesystem like node tree Conditional Updates Watches (notifications) Ephemeral and Sequence Nodes Out of box + recipes for: Name Service, Configuration, Group Membership, Barriers, Queues, Locks, 2P Commit, Leader Election quinta-feira, 24 de Junho de 2010 30 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION http://hadoop.apache.org/zookeeper/docs/r3.3.1/zookeeperOver.html
  • 31. Consistency w/ Interaction (Low Latency - read/write - harder stuff) quinta-feira, 24 de Junho de 2010 31
  • 32. Two “High”/Sexy reasons for Distributing Data Storage (not just cache) High Performance Data Access (Read / Write) High Availability (HA) quinta-feira, 24 de Junho de 2010 32
  • 33. Why care about HA? 1.7% HDDs fail in the 1st year, 8.6% in the 3rd (Google) Unrecoverable RAM errors/year: 1.3% machines, 0.22% DIMM (Google) Router, Rack, PDU, misc. network failures Over 4 nines only through redundancy, best hardware never good enough (James Hamilton-MS and Amazon) Hey! This might affect smaller fish like us!!! quinta-feira, 24 de Junho de 2010 33 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Sources: For Google’s numbers check the slideware at: http://videolectures.net/wsdm09_dean_cblirs/ For the James Hamilton quote: http://mvdirona.com/jrh/TalksAndPapers/JamesRH_Ladis2008.pdf Another very quoted paper with Google’s DRAM failure stats and patterns: http://research.google.com/pubs/pub35162.html You can find other HA and Systems related papers from Google and James Hamilton at: http://mvdirona.com/jrh/work/ http://research.google.com/pubs/DistributedSystemsandParallelComputing.html
  • 34. Why care about Latency? Google: Half a second delay caused a 20% drop in traffic (30 results instead of 10, via Marissa Mayer); Amazon found every 100ms of latency costs 1% sales (via Greg Linden); A broker could lose $4 million in revenues per millisecond if their electronic trading platform is 5 ms behind the competition (via NYT). Hey! This affects anyone online! quinta-feira, 24 de Junho de 2010 34 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION You can find all this references trough this page (if you follow the links): http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it Including these: http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
  • 35. Fallacies of Distributed Computing (What can go wrong?) 1. The network is reliable; 2. Latency is zero; 3. Bandwidth is infinite; 4. The network is secure; 5. Topology doesn't change; 6. There is one administrator; 7. Transport cost is zero; 8. The network is homogeneous. quinta-feira, 24 de Junho de 2010 35 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Just to remember this classic on the HA challenges. A few more details at: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
  • 36. Other Distributed Data Contexts (the less sexy daily stuff?) EAI / B2B / Systems Integration Geographic Distribution (e.g.:Health System+Hospitals) Systems with n-tier / SOA Architectures Elasticity on Peaks (still sexy...) quinta-feira, 24 de Junho de 2010 36 The daily jobs of so many IT professionals have much more relation with this type of common distributed systems than with the sexier kind we talked about before. But these fields too would benefit from the learning the lessons and using the technologies we are talking about.
  • 37. The CAP Theorem and Eventual Consistency quinta-feira, 24 de Junho de 2010 37
  • 38. CAP Theorem History 1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems” paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley) 2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC Conference 2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and Nancy Lynch (MIT) 2007-10-02: “Amazon's Dynamo” post by Werner Vogels (Amazon’s CTO) quoting a paper (by him + others) 2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO) 2010-04-23: With the PACELC model, Daniel Abadi remembers and explains on his blog the obvious importance of latency on BASE vs. ACID and other tuning decisions over designs which revolve around CAP. quinta-feira, 24 de Junho de 2010 38 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual Consistency” chapter: http://books.couchdb.org/relax/intro/eventual-consistency Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in truly industrial sites, even with stats describing real life behavior: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and the now famous Eventually Consistent post by Werner Vogels: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne, is the best I found about the Brewer’s CAP Theorem and its history: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem You should still take a look at: * The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf * The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf * The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf * You might also see with your own eyes how CAP became a proved Theorem: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf * The PACELC model was described by Daniel Abadi on his blog at: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html Definition of ACID:
  • 39. The CAP Theorem strong Consistency, high Availability, Partition-resilience: pick at most 2 quinta-feira, 24 de Junho de 2010 39 I simply had to put The Diagram, of course. According to http://books.couchdb.org/relax/intro/eventual-consistency ...exemples of what goes in the interceptions: CP => Classic RDBMS Enforced Consistency CA => Paxos / Consensus Consensus Protocols for HA Consistency AP => CouchDB + Eventually Consistent DBs Eventual Consistency
  • 40. Eventual Consistency for Availability BASE ACID (Basically Available Soft-state Eventual consistency) (Atomicity, Consistency, Isolation, Durability) Weak Consistency Strong consistency (stale data ok) (NO stale data) Availability first Isolation Best effort Focus on “commit” Approximate answers OK Availability? Aggressive (optimistic) Conservative (pessimistic) Faster / Lower Latency Safer quinta-feira, 24 de Junho de 2010 40 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION You can find a variation of this slide at Brewer’s 2000’s PODC keynote at: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf I skipped these rather controversial bits: ACID: * Nested transactions; * Difficult evolution (e.g. schema) BASE: * Simpler! * Easier evolution I already tried both ways (data stores with and without schema) and I rather have some schema mechanism for the most complex stuff. ACID: A)tomicity Either all of the tasks of a transaction are performed or none of them are. C)onsistency A database remains in a consistent state before the start of the transaction and after the transaction is over (whether successful or not). I)solation Other operations cannot access or see the data in an intermediate state during a transaction. D)urability Once the user has been notified of success, the transaction will persist. This means it will survive system failure, and that the database system has checked the integrity constraints and won't need to abort the transaction.
  • 41. CAP Trade-offs CA without P: Databases providing distributed transactions can only do it while their network is ok; CP without A: While there is a partition, transactions to an ACID database may be blocked until the partition heals (to avoid merge conflicts -> inconsistency); AP without C: Caching provides client-server partition resilience by replicating data, even if the partition prevents verifying if a replica is fresh. In general, any distributed DB problem can be solved with either: expiration-based caching to get AP; or replicas and majority voting to get PC (minority is unavailable). quinta-feira, 24 de Junho de 2010 41 * VERY FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Concept introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf I should probably skip this slide during a life presentation. This is stuff you have to read about.
  • 42. Living with CAP All systems are probabilistic, wether they realize it or not And so are Distributed Transactions (2 Generals Problem) Life is Eventually Consistent Weak CAP Principle: The stronger the guarantees made about any two of C, A and P, the weaker the guarantees that can be made about the third Systems should degrade gracefully, instead of all or nothing (e.g.: displaying data from available partitions) quinta-feira, 24 de Junho de 2010 42 * #1 => Explain why Life is Eventually Consistent Steve Yen clearly illustrates the “Life is Eventually Consistent” idea on the slideware (slides 40 to 45) he used for his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009: http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf The Weak CAP Principle was introduced at the 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf To understand how hard (ACID) Distributed Transactions are, you have an excellent history of the concepts related to this problem here: http://betathoughts.blogspot.com/2007/06/brief-history-of-consensus-2pc-and.html The difficulties of (ACID) Distributed Transactions are well illustrated by the classic Two Generals’ Problem: http://en.wikipedia.org/wiki/Two_Generals'_Problem Leslie Lamport et al further explore the problem (and its solutions) on the classic “The Byzantine Generals Problem” paper: http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf And if you think that Two Phase Commit is a 100% reliable mechanism... think again: http://www.cs.cornell.edu/courses/cs614/2004sp/papers/Ske81.pdf This is just to illustrate the difficulty of the problem. There are more reliable mechanisms, like Three Phase Commit: http://en.wikipedia.org/wiki/Three-phase_commit_protocol http://ei.cs.vt.edu/~cs5204/fall99/distributedDBMS/sreenu/3pc.html ...or the so called Paxos Commit: http://research.microsoft.com/pubs/64636/tr-2003-96.pdf
  • 43. CAP Theorem History 1999: 1st mention on the “Harvest, Yield and Scalable Tolerant Systems” paper by Eric A. Brewer (Berkley/Inktomi) and Armando Fox (Stanford/Berkley) 2000-07-19: Brewer’s CAP Conjecture part of Brewer’s keynote to the PODC Conference 2002-06: Brewer’s CAP Theorem proof published by Seth Gilbert (MIT) and Nancy Lynch (MIT) 2007-10-02: “Amazon's Dynamo” post by Werner Vogels (Amazon’s CTO) quoting a paper (by him + others) 2007-12-19: “Eventually Consistent” post by Werner Vogels (Amazon’s CTO) 2010-04-23: With the PACELC model, Daniel Abadi remembers and explains on his blog the obvious importance of latency on BASE vs. ACID and other tuning decisions over designs which revolve around CAP. quinta-feira, 24 de Junho de 2010 43 * VERY FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION * Repeated slide, repeated notes (to pass focus from CAP to Dynamo and Eventual Consistency): The online book “CouchDB: The Definitive Guide” has an interesting introduction to these concepts - the “Eventual Consistency” chapter: http://books.couchdb.org/relax/intro/eventual-consistency Really essential and truly amazing is the Dynamo paper by Werner Vogels et al, proof that BASE really works in truly industrial sites, even with stats describing real life behavior: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and the now famous Eventually Consistent post by Werner Vogels: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html If you dislike the introductory (justifiable) drama, just jump to the next part because this article, by Julian Browne, is the best I found about the Brewer’s CAP Theorem and its history: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem You should still take a look at: * The 1997 “Cluster-Based Scalable Network Services” paper (Brewer et al.) where the BASE vs ACID dilemma is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2034&rep=rep1&type=pdf * The 1999 “Harvest, Yeld and Scalable Tolerant Systems” paper (Brewer et al.) where the CAP conjecture is already mentioned: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf * The PODC 2000 keynote, by Brewer, that made the CAP conjecture and the BASE concept “popular”: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf * You might also see with your own eyes how CAP became a proved Theorem: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf * The PACELC model was described by Daniel Abadi on his blog at: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
  • 44. Amazon’s Dynamo DB Also a “Wide Column Store” Problem Technique Partitioning Consistent Hashing High Availability for writes Vector clocks with reconciliation during reads Handling temporary failures Sloppy Quorum and hinted handoff (NRW) Recovering from permanent failures Anti-entropy using Merkle trees Membership and failure detection Gossip-based membership protocol and failure detection. (in bold some techniques which could improve many “enterprise” / “every-day” solutions) quinta-feira, 24 de Junho de 2010 44 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION The source here is the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Strict distributed DBs, rather than dealing with the uncertainty of the correctness of an answer, make data is made unavailable until it is absolutely certain that it is correct. At Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution - avg or median not good enough to provide a good experience for all. The choice for 99.9% over an even higher percentile has been made based on a cost-benefit analysis which demonstrated a significant increase in cost to improve performance that much. Experiences with Amazon’s production systems have shown that this approach provides a better overall experience compared to those systems that meet SLAs defined based on the mean or median.
  • 45. Tuning Consistency: N: number of nodes to replicate each item to; W: number of required nodes for write success; R: number of required nodes for read success. W < N = remaining nodes will receive the write later. R < N = remaining nodes ignored. quinta-feira, 24 de Junho de 2010 45 Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...but you can find a similar diagram and similar mechanisms described about several (NoSQL) databases that partially clone Dynamo.
  • 47. Eventually Consistent Systems Banks EAI Integrations Many messaging based (SOA) systems Google Amazon Etc. quinta-feira, 24 de Junho de 2010 47 Unlike what many examples say, Banks often use Eventual Consistency on many (limited value/risk) transactions - or use “large” periodic transaction / compensation fixed windows to process large numbers of larger value movements. So much for those ACID transaction examples...
  • 48. Amazon Dynamo Lessons (according to the paper) Data returned to Shopping Cart 24h profiling: 0.00057% of requests saw 2 versions; 0.00047% of requests saw 3 versions and 0.00009% of requests saw 4 versions. In two years applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date; With coordination via Gossip protocol it is harder to scale further than a few hundred nodes. (Could be better w/ Chubby / ZK like coordinators?) quinta-feira, 24 de Junho de 2010 48 Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Wikipedia has an article on Gossip Protocols (although, at the data I write this, not as precise as other Wikipedia articles I just quoted): http://en.wikipedia.org/wiki/Gossip_protocol The solution I mention as a possibly more scalable alternative to Gossip Protocols for consensus is the use of Paxos (or derivates) Coordinators, like the proprietary Google’s Chubby or the open source Apache Hadoop Zookeeper. When I first wrote and used (at my SAPO Codebits 2009 talk) these slides, the only support I still had to my (then intuitive) belief that these more directed approaches should be more efficient than Gossip Protocols was the 6.6 part from the Dynamo paper - the paper even mentions the possibility of “introducing hierarchical extensions to Dynamo”. Thanks to my SAPO Codebits talk I met Henrique Moniz, then a Ph.D. student at the University of Lisbon. After I discussed this issue (consensus scalability) with him he pointed me to a couple of interesting papers, one of which immediately captured my attention: * Gossip-based broadcast protocols by João Leitão http://www.gsd.inesc-id.pt/~jleitao/pdf/masterthesis-leitao.pdf This paper offers a more complete description of gossip protocols overhead and, to my surprise, also pointed a few reliability weak spots on known Gossip Protocols. The paper goes on to present a more robust and efficient Gossip Protocol called “HyParView” using a more “directed” approach. HyParView sure looks like an interesting solution in terms of robustness for environments with an high incidence of system/network failures but I still believe that using coordinators will be more efficient in a well controlled data center. Not that using coordinators and making them scale out BIG is exactly trivial, as you can read here: -On the “Vertical Paxos and Primary-Backup Replication” paper, by Leslie Lamport et al, that Henrique Moniz pointed me to: http://research.microsoft.com/pubs/80907/podc09v6.pdf -Or on this interesting article from the Cloudera’s blog about the (now upcoming) Observers feature of Apache
  • 49. NoSQL Java being used at Cassandra: at Facebook, being introduced on Twitter, persistent cache at reddit, replacing MySQL at digg, etc. Voldemort: at LinkedIn, Gilt Groupe e-commerce site (check Geir Magnusson’s QCon presentation); HBase: Yahoo!, Twitter, Adobe, Ning, Stumbleupon, Meetup, etc. Often used for high volume analytics but also for other high volume stores and M-R tasks. quinta-feira, 24 de Junho de 2010 49 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION http://www.infoq.com/presentations/Project-Voldemort-at-Gilt-Groupe
  • 50. Is NoSQL better than SQL? The NoSQL vs. SQL database debate is really about ACID vs. BASE databases A query language advantage indicator is given by Hadoop Pig use at Twitter (via Kevin Well): The Pig version is 5% of the code 5% of the time Within 50% of the execution time Any one which used c-tree like DB APIs can say the same quinta-feira, 24 de Junho de 2010 50 http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/ http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
  • 51. Some interesting techniques... (...which we could all be using...) quinta-feira, 24 de Junho de 2010 51
  • 52. Wikipedia image Vector Clocks On each internal even a process increments its logical clock; Before sending a message, it increments its own clock in the vector and sends it with the message; On receiving a message, it increments its clock and updates each element on its own vector to max.(own, msg). quinta-feira, 24 de Junho de 2010 52 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and on the Wikipedia article about this algorithm: http://en.wikipedia.org/wiki/Vector_clock Vector Clocks (and other similar algorithms) have a predecessor in Lamport timestamps: http://en.wikipedia.org/wiki/Lamport_timestamps Introduced in the classic paper “Time, Clocks, and the Ordering of Events in a Distributed System” by Leslie Lamport: http://en.wikipedia.org/wiki/Lamport_timestamps
  • 53. Wikipedia image Merkle Tree / Hash Tree Used to verify / compare a set of data blocks and efficiently find where the mismatches are. quinta-feira, 24 de Junho de 2010 53 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Also based in the already mentioned Dynamo paper: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html ...and on the Wikipedia article about this algorithm: http://en.wikipedia.org/wiki/Hash_tree
  • 54. ACID and FAST (Lowest Latency - read/write - hardest stuff) quinta-feira, 24 de Junho de 2010 54
  • 55. Immediately Consistent Systems Data-grids: Oracle Coherence Trading Gigaspaces All Data in RAM Online Gambling Can do ACID Very High Speed Max. Scale-out quinta-feira, 24 de Junho de 2010 55 * FAST SLIDE - CAN READ BETTER AND CHECK REFERENCES AFTER THE LIVE PRESENTATION Trading and Online Gambling really need to do large volumes of fast ACID transactions and are the big customers of Data Grids. Why Online Gambling needs ACID transactions has all to do with the type of game and the type of rules/assets (some virtual) it involves. Why Trading really needs ACID is s bit more obvious: you might be able to compensate an overdraft at a bank (more so for limited values) but you really cannot sell shares you do not have for sale. The performance needs are obvious for both too. For Trading there are even some new reasons, like (again): http://www.nytimes.com/2009/07/24/business/24trading.html?_r=2&hp
  • 56. Tools (Most with source code to pick from) quinta-feira, 24 de Junho de 2010 56
  • 57. NoSQL Taxonomy by Steve Yen [PG] key‐value‐cache: memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracota [???] key‐value‐store: keyspace [w/Paxos], flare, schema‐free, RAMCloud [, Mnesia (Erlang), Chordless] eventually‐consistent key‐value‐store: dynamo, Voldemort, Dynomite, SubRecord, MotionDb, Dovetaildb ordered‐key‐value‐store: tokyo tyrant[, BerkleyDB, JDBM], lightcloud, NMDB, luxio, memcachedb, actord data‐structures server: redis tuple‐store: gigaspaces [?], coord, apache river object database: ZopeDB, db4o, Shoal document store: CouchDB [evC, MVCC], MongoDB [evC], Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho [evC], Scalaris [Erlang, w/Paxos] wide columnar store: BigTable, Hadoop HBase [w/ Zookeeper], [Amazon Dynamo-evC, ] Cassandra [evC], Hypertable, KAI, OpenNeptune, Qbase, KDI [graph database: Neo4J, Sones, etc.] quinta-feira, 24 de Junho de 2010 57 From Steve Yen’s slideware (slide 54) he used for his “No SQL is a Horseless Carriage” talk at NoSQL Oakland 2009: http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf I do not completely understand or agree with Steve’s criteria but it sure is a possible starting point on building a database/storage taxonomy. The stuff in square brackets is mine. “evC” means Eventually Consistent and “?” just means I have doubts / don’t understand some specific classification.
  • 58. Some related Solutions I find interesting... Zookeeper (use it, configuration, elasticity, group membership, leader election, notification...) JDBM, BerkleyDB (careful w/ the OS license) (just use them for very fast persistence storage) Voldemort and Cassandra (use them or pick code for Vector Clocks, Merkle Trees, data compression, communications and other code - nice code bases) Redis (not Java, but usable from and a kind of a Swiss Knife) The Riak Basho Bitcask store idea. Used something similar (but not generic) in Java: http://downloads.basho.com/papers/bitcask-intro.pdf EhCache (the pre-Terracotta version shows of simplistic some stuff can be and still work) Just use RMI and native Java serialization (as EhCache does) JBoss Netty (if you want to do seriously fast network communication) Varnish (an HTTP cache which knows how to use Virtual Memory) quinta-feira, 24 de Junho de 2010 58 http://downloads.basho.com/papers/bitcask-intro.pdf
  • 59. Opportunities (...to use these tools) quinta-feira, 24 de Junho de 2010 59
  • 60. Some cases we could talk about... EAI Integrations (Should use Vector Clocks?) Zookeeper at the “Farm” (Config./Coord.) Live soccer game site Web sites in general Log like / timeline systems (forums, healthcare, Twitter, etc.) Analytics Logistic Planing across EU case Trading quinta-feira, 24 de Junho de 2010 60 This is the placeholder slide to exercise the ideas and discuss possible applications of some of the mechanisms which were presented on this talk (had no time at Codebits... still tuning this not-so-easy presentation). Except for the last two scenarios (and the Twitter alternative on the “Log like” one) all others represent quite common types of problems which you can meet without having to work for a Fortune Top 50 company or for a mega web portal / service. Even an “Analytics” with enough data to justify using MapReduce is common enough. Many large (but not necessarily huge) companies often quit doing more with the data they have just because of the trouble of finding a way to do it (“more”). * “Analytics” (high data + easy on consistency as it is) is currently seem to be the playground of Map Reduce, with Hadoop stuff being used “everywhere”. Look at how many times you can find the words “analytics” or “analysis” (and “MapReduce”) on these “Powered by” Hadoop web pages: http://wiki.apache.org/hadoop/PoweredBy http://wiki.apache.org/hadoop/Hbase/PoweredBy * “Live soccer game...” is a nice problem to discuss short live caching and its consistency issues; * “Log like / timeline systems...” are systems where information is mostly “insert only” and most of the effort to keep consistency is related to keeping proper ordering information (with timestamps being usually enough), properly merging the data from different sources and respect the explicit or implicit SLAs on data synchronizations. Obviously, there are different difficulties across the several cases here mentioned, depending on data flow, necessary performance, etc.; * “EAI Integrations” often need better knowledge about ordering and are not as simples as the previous scenario. Due to factors like the use of asynchronous and event driven mechanisms and the possibility of having updates for a given document across multiple steps of a (multiple) process(es), a timestamp is often too limited as ordering information... but is often the most you get. IMO this is a good scenario for using Vector Clocks and company; * “Zookeeper” is a great system even if “just” to configure the simplest web (or webservice) farm, to coordinate the simplest cross farm operations (e.g.: cache related) or just for each server to know which are its peers; * “Logistic Planing” is a complex scenario which demands a mix of solutions. It revolves around a logistics company which transports goods across Europe, with planning offices on different countries. I will probably have to remove it from this slide for any future talk I might give on this topic even if it is the most interesting of them all. So, it does not make much sense to develop it here (maybe a blog post since, to me, this is a >10 year old
  • 61. Q&A quinta-feira, 24 de Junho de 2010 61