The document summarizes the Paxos Commit algorithm, which uses the Paxos consensus algorithm to achieve fault-tolerant distributed transaction commit. It describes the key participants in Paxos Commit - the resource managers, leader, and acceptors. The base version of Paxos Commit is explained along with optimizations. Performance is analyzed in comparison to two-phase commit. Failure scenarios are discussed, showing how Paxos Commit can handle leader failure or missing resource managers. The relationship between Paxos Commit and two-phase commit is also clarified.
1. xx
The Paxos Commit Algorithm
Paxos Commit Protocol
Jim Gray and Leslie Lamport
Microsoft Research - 1 January 2004
Review by Ahmed Hamza
2. xx
The Paxos Commit Algorithm
Agenda
Paxos Commit Algorithm: Overview
The participating processes
The resource managers
The leader
The acceptors
Paxos Commit Algorithm: the base version
Failure scenarios
Optimizations for Paxos Commit
Performance
Paxos Commit vs. Two-Phase Commit
Using a dynamic set of resource managers
3. xx
The Paxos Commit Algorithm
Paxos Commit Algorithm: Overview
Paxos was applied to Transaction Commit by L.Lamport
and Jim Gray in Consensus on Transaction Commit
One instance of Paxos (consensus algorithm) is
executed for each resource manager, in order to agree
upon a value (Prepared/Aborted) proposed by it
“Not-synchronous” Commit algorithm
Fault-tolerant (unlike 2PC)
Intended to be used in systems where failures are
fail-stop only, for both processes and network
Safety is guaranteed (unlike 3PC)
Formally specified and checked
Can be optimized to the theoretically best performance
4. xx
The Paxos Commit Algorithm
Participants: the resource managers
N resource managers (“RM”) execute the distributed
transaction, then choose a value (“locally chosen value” or
“LCV”; ‘p’ for prepared iff it is willing to commit)
Every RM tries to get its LCV accepted by a majority set of
acceptors (“MS”: any subset with a cardinality strictly greater
than half of the total).
Each RM is the first proposer in its own instance of Paxos
Participants: the leader
Coordinates the commit algorithm
All the instances of Paxos share the same leader
It is not a single point of failure (unlike 2PC)
Assumed always defined (true, many leader-(s)election
algorithms exist) and unique (not necessarily true, but unlike
3PC safety does not rely on it)
5. xx
The Paxos Commit Algorithm
Participants: the acceptors
a
A denotes the set of acceptors
All the instances of Paxos share the
same set A of acceptors
2F+1 acceptors involved in order to
achieve tolerance to F failures
We will consider only F+1
acceptors, leaving F more for
“spare” purposes (less
communication overhead)
Each acceptors keep track of its own
progress in a Nx1 vector
Vectors need to be merged into a
Nx|MS| table, called aState, in order
to take the global decision (we want
“many” p‟s)
RM1
Ok!
Consensus box (MS)
p
RM2
AC1
AC3
Paxos
Ok!
AC2
AC4
p
RM3
AC5
Ok!
aState
Acc1 Acc2 Acc3 Acc4 Acc5
1st instance
a
a
a
a
a
2nd instance
p
p
p
p
p
3rd instance
p
p
p
p
p
6. xx
The Paxos Commit Algorithm
Paxos Commit (base)
: Writes on log
rm RM
acc MS
L
AC0
AC1
AC2
RM0
RM1
RM2
RM3
(N=5)
(F=2)
A
v { p, a}
RM4
1x
p2a
0
BeginCommit
(N-1) x
(N(F+1)-1) x
Fx
p2b
0
v(0)
prepare
p2a
rm
0
v(rm)
rm 0 v(rm)
rm 0 v(rm)
rm 0 v(rm)
rm 0 v(rm)
acc rm 0 v(rm)
Opt.
Not blocked iff F acceptors respond
T2
T1
If (Global Commit)
p3
commit
then
abort
else p3
xN
7. xx
The Paxos Commit Algorithm
Global Commit Condition
Global Commit
( rm)( b)( MS)( acc MS)(
p2b acc rm b
p
was sent rec.)
That is: there must be one and only one row for each RM
involved in the commitment; in each row of those rows
there must be at least F+1 entries that have „p‟ as a
value and refer to the same ballot
8. xx
The Paxos Commit Algorithm
[T1] What if some RMs do not submit their LCV?
j
Leader
One majority
of acceptors
RM m issing
RM
v { p, a}
bL1 >0
p1a
p1b
“accept?”
“promise”
Leader: «Has resource manager j ever proposed you a
value?»
(1) Acceptori: «Yes, in my last session (ballot) bi with it
I accepted its proposal vi»
(2) Acceptori: «No, never»
(Promise not to answer any bL2<bL1)
If (at least |MS| acceptors answered)
p2a
“prepare?”
If (for ALL of them case (2) holds) then V=„a‟ [FREE]
else V=v(maximum({bi})
Leader: «I am j, I propose V»
[FORCED]
9. xx
The Paxos Commit Algorithm
[T2] What if the leader fails?
L1
ignored
trusted
If the leader fails, some leader-(s)election algorithm is
executed. A faulty election (2+ leaders) doesn‟t
preclude safety ( 3PC), but can impede progress…
MS
L2
b1 >0
trusted
b2>b1 ignored
T
ignored
trusted
b3>b2
T
b4>b3 trusted
T
Non-terminating example:
infinite sequence of p1a-p1bp2a messages from 2 leaders
Not really likely to happen
It can be avoided (random T?)
10. xx
The Paxos Commit Algorithm
Optimizations for Paxos Commit (1)
Co-Location: each acceptor is on the same node as a RM and the
initiating RM is on the same node as the initial leader
RM0
RM1
BeginCommit
p3
p2a
L
p2a
AC0
RM2
RM4
RM3
p2a
AC1
AC2
-1 message phase (BeginCommit), -(F+2) messages
“Real-Time assumptions”: RMs can prepare spontaneously. The
prepare phase is not needed anymore, RMs just “know” they have to
prepare in some amount of time
RM0
AC0
L
RM1
RM2
AC1
AC2
RM3
RM4
(N-1) x
-1 message phase (Prepare), -(N-1) messages
prepare
Not needed anymore!
11. xx
The Paxos Commit Algorithm
Optimizations for Paxos Commit (2)
RM0
AC0
Phase 3 elimination: the acceptors send their phase2b messages (the
columns of aState) directly to the RMs, that evaluate the global commit
condition
L
RM1
RM2
AC1
AC2
RM3
RM4
RM0
AC0
L
RM1
RM2
AC1
AC2
RM3
RM4
p2b
p2b
p3
Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC)
FPC + Co-location + R.T.A. = Optimal Consensus Algorithm
12. xx
The Paxos Commit Algorithm
Performance
2PC
Paxos Commit
Faster Paxos Commit
No coloc.
Coloc.
No coloc.
Coloc.
No coloc.
Coloc.
Message delays*
4
3
5
4
4
3
Messages*
3N-1
3N-3
NF+F+3N-1
NF+3N-3
2NF+3N-1
2FN-2F+3N-3
Stable storage
write delays**
2
2
2
Stable storage
writes**
N+1
N+F+1
N+F+1
*Not Assuming RMs’ concurrent preparation (slides-like scenario)
**Assuming RMs’ concurrent preparation (r.t. constraints needed)
If we deploy only one acceptor for Paxos Commit (F=0),
its fault tolerance and cost are the same as 2PC‟s. Are
they exactly the same protocol in that case?
13. xx
The Paxos Commit Algorithm
Paxos Commit vs. 2PC
Yes, but…
Other RMs
TM
RM1
2PC from Lamport
and Gray’s paper
T2
T1
2PC from the
slides of the
course
…two slightly different versions of 2PC!
14. xx
The Paxos Commit Algorithm
Using a dynamic set of RM
join
You add one process, the registrar, that
acts just like another resource
manager, despite the following:
vregistrar { p, a}
pad
vregistrar {rm : rm joined the transaction}
Pad
RMs can join the transaction until the
Commit Protocol begins
The global commit condition now holds
on the set of resource managers
proposed by the registrar and decided in
its own instance of Paxos:
a
RM1
Ok!
p
join
RM2
MS
AC1
Ok!
AC3
Paxos
join
REG
p
RM3
AC2
AC4
Ok!
RM1;RM2;RM3
AC5
Ok!
RM1
RM2
RM3
Global Commit DynRM
( rm vregistrar )( b)( MS )( acc MS )(
p2b acc rm b
p
was sent rec.)