SlideShare une entreprise Scribd logo
1  sur  33
1
Replication
Transaction Processing Concepts and Techniques
Western Institute for Computer Science at Stanford Univ.
Philip A. Bernstein
Copyright © 1999 Philip A. Bernstein
9:00
11:00
1:30
3:30
7:00
Overview
Faults
Tolerance
T Models
Party
TP mons
Lock Theory
Lock Techniq
Queues
Workflow
Log
ResMgr
CICS & Inet
Adv TM
Cyberbrick
Files &Buffers
COM+
Corba
Replication
Party
B-tree
Access Paths
Groupware
Benchmark
Mon Tue Wed Thur Fri
2
Outline
1. Introduction
2. Primary-Copy Replication
3. Multi-Master Replication
4. Other Approaches
3
1. Introduction
• Replication - using multiple copies of a server
(called replicas) for better availability and
performance.
• If you’re not careful, replication can lead to
– worse performance - updates must be applied to all
replicas and synchronized
– worse availability - some algorithms require multiple
replicas to be operational for any of them to be used
4
Replicated Server
• Can replicate servers on a common resource
– Data sharing - DB servers communicate with shared disk
Resource
Server Replica 1 Server Replica 2
Client
• Helps availability in primary-backup scenario
• Requires replica cache coherence mechanism …
• Hence, this helps performance only if
– little conflict between transactions at different servers or
– loose coherence guarantees (e.g. read committed)
5
Replicated Resource
• To get more improvement in availability,
replicate the resources (too)
• Also increases potential throughput
• This is what’s usually meant by replication
• It’s the scenario we’ll focus on
Resource replica
Server Replica 1 Server Replica 2
ClientClient
Resource replica
6
Synchronous Replication
• Replicas function just like non-replicated servers
• Synchronous replication - transaction updates all
replicas of every item it updates
Start
Write(x1)
Write(x2)
Write(x3)
Commit
x1
x2
x3
• Issues
– If you just use transactions, availability suffers. For high-
availability, the algorithms are complex and expensive.
– Too expensive for most applications, due to heavy
distributed transaction load (2-phase commit)
– Can’t control when updates are applied to replicas
7
Synchronous Replication - Issues
R1[xA]
R2[yD] W2[xB]
W1[yC]yD fails
xA fails
Not equivalent to a
one-copy execution,
even if xA and yD
never recover!
• DBMS products support it only in special situations
• Expense of a heavy distributed transaction load
(2-phase commit)
• Can’t control when updates are applied to replica
• Requires heavy-duty synchronization of failures,
so algorithms are complex and expensive:
8
Asynchronous Replication
• Asynchronous replication
– Each transaction updates one replica.
– Updates are propagated later to other replicas.
• Primary copy: All transactions update the same copy
• Multi-master: Transactions update different copies
– Useful for disconnected operation, partitioned network
• Both approaches ensure that
– Updates propagate to all replicas
– If new updates stop, replicas converge to the same state
• Only primary copy ensures serializability
– Details later …
9
2. Primary-Copy Replication
• Designate one replica as the primary copy (publisher)
• Transactions may update only the primary copy
• Updates to the primary are sent later to secondary replicas
(subscribers) in the order they were applied to the primary
T1: Start
… Write(x1) ...
Commit
x1T2
Tn
... Primary
Copy
x2
xm
...
Secondaries
10
Asynchronous Update Propagation
• Collect updates at primary using triggers or the log
• Triggers (Oracle8, Rdb, SQL Server, DB2, …)
– On every update at primary, a trigger fires to store the
update in the update propagation table.
• Post-process the log to generate update propagations
(SQL Server, DB2, Tandem Non-Stop SQL)
– Off-line, so saves trigger and triggered update overhead,
though R/W log synchronization also has a cost
– Requires admin (what if log reader fails?)
• Optionally identify updated fields to compress log
• Most DB systems support this today.
– First in IBM IMS, Tandem NS SQL, DEC/Rdb, & ad hoc
11
Request Propagation
• Like updates, must ensure requests run in the same
order at primary and replica.
– Log the requests or extend triggers to capture them.
• Could run request synchronously at all replicas,
but commit even if one replica fails.
– Need a recovery procedure for failed replicas.
– Supported by Digital’s RTR.
• Replicate a request rather than the updates produced
by the request (e.g., a stored procedure call).
SP1: Write(x)
Write(y)x, y
DB-A w[x]
w[y]
SP1: Write(x)
Write(y) x, y
DB-B
w[x]
w[y]
replicate
12
Products
• All major DBMS products have a rich primary-
copy replication mechanism
• Differences are in detailed features
– performance
– ease of management
– richness of filtering predicates
– push vs. pull propagation
– stored procedure support
– transports (e.g. Sybase SQLanywhere can use email!)
– …
• The following summary is necessarily incomplete
13
SQL Server 7.0
• Publication - a collection of articles to subscribe to
• Article – a horiz/vertical table slice or stored proc
– Customizable table filter (WHERE clause or stored proc)
– Stored proc may be transaction protected (replicate on
commit). Replicates the requests instead of each update.
• Snapshot replication makes a copy
• Transactional replication maintains the copy by
propagating updates from publisher to subscribers
– Post-processes log to store updates in Distribution DB
– Distribution DB may be separate from the publisher DB
– Updates can be pushed to or pulled from subscriber
– Can customize propagated updates using stored procs
14
SQL Server 7.0 (cont’d)
• Immediate updating subscriber
– Can update data, synchronizing with publisher via 2PC
– Uses triggers to capture updates (Not For
Replication disables trigger for publisher’s updates)
– Subscriber sends before/after row timestamp. Publisher
checks row didn’t change since subscriber’s current copy
– Publisher then forwards updates to other subscribers
• Access control lists protect publishers from
unauthorized subscribers
• Merge replication- described later
15
Oracle 8i
• Like SQL Server, can replicate updates to table
fragments or stored proc calls at the master copy
• Uses triggers to capture updates in a deferred queue
– Updates are row-oriented, identified by primary key
– Can optimize by sending keys and updated columns only
• Group updates by transaction, which are
propagated:
– Either serially in commit order or
– in parallel with some dependent transaction ordering:
each read reads the “commit number” of the data item;
updates are ordered by dependent commit number
• Snapshots are updated in a batch refresh.
– Pushed from master to snapshots, using queue scheduler
16
DB2
• Very similar feature set to SQL Server and Oracle
• Filtered subscriber (no stored proc replication (?))
– Create snapshot, then update incrementally (push or pull)
• Captures DB2 updates from the DB2 log
– For other systems, captures updates using triggers
• Many table type options:
– Read-only snapshot copy, optionally with timestamp
– Aggregates, with cumulative or incremental values
– Consistent change data, optionally with row versions
– “Replica” tables, for multi-master updating
• Interoperates with many third party DBMS’s
17
Failure Handling
• Secondary failure - nothing to do till it recovers
– At recovery, apply the updates it missed while down
– Needs to determine which updates it missed,
but this is no different than log-based recovery
– If down for too long, may be faster to get a whole copy
• Primary failure – Products just wait till it recovers
– Can get higher availability by electing a new primary
– A secondary that detects primary’s failure announces a
new election by broadcasting its unique replica identifier
– Other secondaries reply with their replica identifier
– The largest replica identifier wins
18
Failure Handling (cont’d)
• Primary failure (cont’d)
• All replicas must now check that they have the
same updates from the failed primary
– During the election, each replica reports the id of the
last log record it received from the primary
– The most up-to-date replica sends its latest updates to
(at least) the new primary.
– Could still lose an update that committed at the primary
and wasn’t forwarded before the primary failed …
but solving it requires synchronous replication
(2-phase commit to propagate updates to replicas)
19
Communications Failures
• Secondaries can’t distinguish a primary failure from
a communication failure that partitions the network.
• If the secondaries elect a new primary and the old
primary is still running, there will be a reconciliation
problem when they’re reunited. This is multi-master.
• To avoid this, one partition must know it’s the only
one that can operate, and can’t communicate with
other partitions to figure this out.
• Could make a static decision.
The partition that has the primary wins.
• Dynamic solutions are based on Majority Consensus
20
Majority Consensus
• Whenever a set of communicating replicas detects a
replica failure or recovery, they test if they have a
majority (more than half) of the replicas.
• If so, they can elect a primary
• Only one set of replicas can have a majority.
• Doesn’t work well with even number of copies.
– Useless with 2 copies
• Quorum consensus
– Give a weight to each replica
– The replica set that has a majority of the weight wins
– E.g. 2 replicas, one has weight 1, the other weight 2
21
3. Multi-Master Replication
• Some systems must operate when partitioned.
– Requires many updatable copies, not just one primary
– Conflicting updates on different copies are detected late
• Classic example - salesperson’s disconnected laptop
Customer table (rarely updated) Orders table (insert mostly)
Customer log table (append only)
– So conflicting updates from different salespeople are rare
• Use primary-copy algorithm, with multiple masters
– Each master exchanges updates (“gossips”) with other
replicas when it reconnects to the network
– Conflicting updates require reconciliation (i.e. merging)
• In Lotus Notes, Access, SQL Server, Oracle, …
22
Example of Conflicting Updates
A Classic Race Condition
Replica 1
Initially x=0
T1: X=1
Primary
Initially x=0
Send (X=1)
Replica 2
Initially x=0
T2: X=2
Send (X=1)
X=1
X=1
X=2
Send (X=2)
X=2
Send (X=2)
• Replicas end up in different states
23
Thomas’ Write Rule
• To ensure replicas end up in the same state
– Tag each data item with a timestamp
– A transaction updates the value and timestamp of data
items (timestamps monotonically increase)
– An update to a replica is applied only if the update’s
timestamp is greater than the data item’s timestamp
– You only need to keep timestamps of data items that
were recently updated (where an older update could still
be floating around the system)
• All multi-master products use some variation of this
• Robert Thomas, ACM TODS, June ’79
– Same article that invented majority consensus
24
Thomas Write Rule ⇒ Serializability
Replica 1
T1: read x=0 (TS=0)
T1: X=1, TS=1
Primary
Initially x=0,TS=0
Send (X=1, TS=1)
Replica 2
T2: read x=0 (TS=0)
T2: X=2, TS=2
Send (X=1, TS=1)
X=1, TS=1
X=1,TS=1X=2, TS=2
Send (X=2, TS=2)
• Replicas end in the same state, but neither T1 nor T2 reads
the other’s output, so the execution isn’t serializable.
X=2, TS=2
Send (X=2, TS=2)
25
Multi-Master Performance
• The longer a replica is disconnected and
performing updates, the more likely it will
need reconciliation
• The amount of propagation activity
increases with more replicas
– If each replica is performing updates,
the effect is quadratic
26
Microsoft Access and SQL Server
• Multi-master replication without a primary
• Each row R of a table has 4 additional columns
– globally unique id (GUID)
– generation number, to determine which updates from
other replicas have been applied
– version number = the number of updates to R
– array of [replica, version number] pairs, identifies the
largest version number it got for R from other replicas
• Uses Thomas’ write rule, based on version numbers
– Access uses replica id to break ties. SQL Server 7 uses
subscriber priority or custom conflict resolution.
27
Generation Numbers (Access/SQL cont’d)
• Each replica has a current generation number
• A replica updates a row’s generation number
whenever it updates the row
• A replica knows the generat’n number it had when it
last exchanged updates with R´, for every replica R´.
• A replica increments its generation number every
time it exchanges updates with another replica.
• So, when exchanging updates with R′, it should
send all rows with a generation number larger than
what it had when last exchanging updates with R′.
28
Duplicate Updates (Access/SQL cont’d)
• Some rejected updates are saved for later analysis
• To identify duplicate updates to discard them -
– When applying an update to x, replace x’s array of
[replica, version#] pairs by the update’s array.
– To avoid processing the same update via many paths,
check version number of arriving update against the array
• Consider a rejected update to x at R from R´, where
– [R´, V] describes R´ in x’s array, and
– V´ is the version number sent by R´.
– If V ≥ V´, then R saw R´’s updates
– If V < V´, then R didn’t see R´’s updates, so store it in
the conflict table for later reconciliation
29
Oracle 8i (revisited)
• Masters replicate entire tables
– Updates are pushed from master to masters and to
snapshots (synchronous or asynchronous)
– Updates include before values (you can disable if
conflicts are impossible)
– They recommend masters should always be connected
• Snapshots are updatable ⇒ “multi-master”
– Each propagation transaction updates its queue entry
(instead of update-oriented generation numbers)
• Conflict detection
– Before-value at replica is different than in update
– Uniqueness constraint is violated
– Row with the update’s key doesn’t exist
30
Oracle 8i Conflict Resolution
• Built-in resolution strategies (defined per column-group)
– Add difference between the old and new values of the originating
site to the destination site
– Average the value of the current site and the originating site
– Min or max of the two values
– The one with min or max timestamp
– The site or value with maximum priority
– Can apply methods in sequence: e.g., by time , then by priority.
• Can call custom procs to log, notify, or resolve the conflict
– Parameters - update’s before/after value and row’s current value
• For a given update, if no built-in or custom conflict
resolution applies, then the entire transaction is logged.
31
4. Other Approaches
• Non-transactional replication using timestamped
updates and variations of Thomas’ write rule
– directory services are managed this way
• Quorum consensus per-transaction
– Read and write a quorum of copies
– Each data item has a version number and timestamp
– Each read chooses a replica with largest version number
– Each write increments version number one greater than
any one it has seen
– No special work needed during a failure or recovery
32
Other Approaches (cont’d)
• Read-one replica, write-all-available replicas
– Requires careful management of failures and recoveries
• E.g., Virtual partition algorithm
– Each node knows the nodes it can communicate with,
called its view
– Transaction T can execute if its home node has a
view including a quorum of T’s readset and writeset
(i.e. the data it can read or write)
– If a node fails or recovers, run a view formation protocol
(much like an election protocol)
– For each data item with a read quorum, read the latest
version and update the others with smaller version #.
33
Summary
• State-of-the-art products have rich functionality.
– It’s a complicated world for app designers
– Lots of options to choose from
• Most failover stories are weak
– Fine for data warehousing
– For 24×7 TP, need better integration with cluster
node failover

Contenu connexe

Tendances

A First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM ChangesA First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM Changes
Willie Favero
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 

Tendances (20)

PNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving PlatformPNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving Platform
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
Dutch Lotus User Group 2009 - Domino Tuning Presentation
Dutch Lotus User Group 2009 - Domino Tuning PresentationDutch Lotus User Group 2009 - Domino Tuning Presentation
Dutch Lotus User Group 2009 - Domino Tuning Presentation
 
A First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM ChangesA First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM Changes
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2
 
Next Generation Monitoring for IBM Domino, Traveler, IMSMO, Verse
Next Generation Monitoring for IBM Domino, Traveler, IMSMO, VerseNext Generation Monitoring for IBM Domino, Traveler, IMSMO, Verse
Next Generation Monitoring for IBM Domino, Traveler, IMSMO, Verse
 
DB2 Accounting Reporting
DB2  Accounting ReportingDB2  Accounting Reporting
DB2 Accounting Reporting
 
The Windows Scheduler
The Windows SchedulerThe Windows Scheduler
The Windows Scheduler
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
 
Best practices for DB2 for z/OS log based recovery
Best practices for DB2 for z/OS log based recoveryBest practices for DB2 for z/OS log based recovery
Best practices for DB2 for z/OS log based recovery
 
Introduction 1
Introduction 1Introduction 1
Introduction 1
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
08 operating system support
08 operating system support08 operating system support
08 operating system support
 
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
 
Lotus Admin Training Part II
Lotus Admin Training Part IILotus Admin Training Part II
Lotus Admin Training Part II
 
IBM Lotus Domino Domain Monitoring (DDM)
IBM Lotus Domino Domain Monitoring (DDM)IBM Lotus Domino Domain Monitoring (DDM)
IBM Lotus Domino Domain Monitoring (DDM)
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
 
From A to Z-itrix: Setting up the most stable and fastest HCL Notes client on...
From A to Z-itrix: Setting up the most stable and fastest HCL Notes client on...From A to Z-itrix: Setting up the most stable and fastest HCL Notes client on...
From A to Z-itrix: Setting up the most stable and fastest HCL Notes client on...
 
1 intro and overview
1 intro and overview1 intro and overview
1 intro and overview
 

En vedette (9)

7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
13 tm adv
13 tm adv13 tm adv
13 tm adv
 
11 tm
11 tm11 tm
11 tm
 
02 fault tolerance
02 fault tolerance02 fault tolerance
02 fault tolerance
 
09 workflow
09 workflow09 workflow
09 workflow
 
06 07 lock
06 07 lock06 07 lock
06 07 lock
 
8 application servers_v2
8 application servers_v28 application servers_v2
8 application servers_v2
 
14 turing wics
14 turing wics14 turing wics
14 turing wics
 
6 two phasecommit
6 two phasecommit6 two phasecommit
6 two phasecommit
 

Similaire à 18 philbe replication stanford99

Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.ppt
shreesha16
 

Similaire à 18 philbe replication stanford99 (20)

MariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB High Availability Webinar
MariaDB High Availability Webinar
 
Replication in the Wild
Replication in the WildReplication in the Wild
Replication in the Wild
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
Best Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDBBest Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDB
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
 
Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.ppt
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Megastore: Providing scalable and highly available storage
Megastore: Providing scalable and highly available storageMegastore: Providing scalable and highly available storage
Megastore: Providing scalable and highly available storage
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 
Database replication
Database replicationDatabase replication
Database replication
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
Pnuts yahoo!’s hosted data serving platform
Pnuts  yahoo!’s hosted data serving platformPnuts  yahoo!’s hosted data serving platform
Pnuts yahoo!’s hosted data serving platform
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
 
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...
 
02 2017 emea_roadshow_milan_ha
02 2017 emea_roadshow_milan_ha02 2017 emea_roadshow_milan_ha
02 2017 emea_roadshow_milan_ha
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
 
Distributed systems and scalability rules
Distributed systems and scalability rulesDistributed systems and scalability rules
Distributed systems and scalability rules
 

Plus de ashish61_scs (20)

7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
Transactions
TransactionsTransactions
Transactions
 
22 levine
22 levine22 levine
22 levine
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
 
20 access paths
20 access paths20 access paths
20 access paths
 
19 structured files
19 structured files19 structured files
19 structured files
 
17 wics99 harkey
17 wics99 harkey17 wics99 harkey
17 wics99 harkey
 
16 greg hope_com_wics
16 greg hope_com_wics16 greg hope_com_wics
16 greg hope_com_wics
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
 
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
 
10b rm
10b rm10b rm
10b rm
 
10a log
10a log10a log
10a log
 
08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick
 
05 tp mon_orbs
05 tp mon_orbs05 tp mon_orbs
05 tp mon_orbs
 
04 transaction models
04 transaction models04 transaction models
04 transaction models
 
03 fault model
03 fault model03 fault model
03 fault model
 
01 whirlwind tour
01 whirlwind tour01 whirlwind tour
01 whirlwind tour
 
Solution5.2012
Solution5.2012Solution5.2012
Solution5.2012
 
Solution6.2012
Solution6.2012Solution6.2012
Solution6.2012
 
Solution7.2012
Solution7.2012Solution7.2012
Solution7.2012
 

Dernier

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Dernier (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

18 philbe replication stanford99

  • 1. 1 Replication Transaction Processing Concepts and Techniques Western Institute for Computer Science at Stanford Univ. Philip A. Bernstein Copyright © 1999 Philip A. Bernstein 9:00 11:00 1:30 3:30 7:00 Overview Faults Tolerance T Models Party TP mons Lock Theory Lock Techniq Queues Workflow Log ResMgr CICS & Inet Adv TM Cyberbrick Files &Buffers COM+ Corba Replication Party B-tree Access Paths Groupware Benchmark Mon Tue Wed Thur Fri
  • 2. 2 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4. Other Approaches
  • 3. 3 1. Introduction • Replication - using multiple copies of a server (called replicas) for better availability and performance. • If you’re not careful, replication can lead to – worse performance - updates must be applied to all replicas and synchronized – worse availability - some algorithms require multiple replicas to be operational for any of them to be used
  • 4. 4 Replicated Server • Can replicate servers on a common resource – Data sharing - DB servers communicate with shared disk Resource Server Replica 1 Server Replica 2 Client • Helps availability in primary-backup scenario • Requires replica cache coherence mechanism … • Hence, this helps performance only if – little conflict between transactions at different servers or – loose coherence guarantees (e.g. read committed)
  • 5. 5 Replicated Resource • To get more improvement in availability, replicate the resources (too) • Also increases potential throughput • This is what’s usually meant by replication • It’s the scenario we’ll focus on Resource replica Server Replica 1 Server Replica 2 ClientClient Resource replica
  • 6. 6 Synchronous Replication • Replicas function just like non-replicated servers • Synchronous replication - transaction updates all replicas of every item it updates Start Write(x1) Write(x2) Write(x3) Commit x1 x2 x3 • Issues – If you just use transactions, availability suffers. For high- availability, the algorithms are complex and expensive. – Too expensive for most applications, due to heavy distributed transaction load (2-phase commit) – Can’t control when updates are applied to replicas
  • 7. 7 Synchronous Replication - Issues R1[xA] R2[yD] W2[xB] W1[yC]yD fails xA fails Not equivalent to a one-copy execution, even if xA and yD never recover! • DBMS products support it only in special situations • Expense of a heavy distributed transaction load (2-phase commit) • Can’t control when updates are applied to replica • Requires heavy-duty synchronization of failures, so algorithms are complex and expensive:
  • 8. 8 Asynchronous Replication • Asynchronous replication – Each transaction updates one replica. – Updates are propagated later to other replicas. • Primary copy: All transactions update the same copy • Multi-master: Transactions update different copies – Useful for disconnected operation, partitioned network • Both approaches ensure that – Updates propagate to all replicas – If new updates stop, replicas converge to the same state • Only primary copy ensures serializability – Details later …
  • 9. 9 2. Primary-Copy Replication • Designate one replica as the primary copy (publisher) • Transactions may update only the primary copy • Updates to the primary are sent later to secondary replicas (subscribers) in the order they were applied to the primary T1: Start … Write(x1) ... Commit x1T2 Tn ... Primary Copy x2 xm ... Secondaries
  • 10. 10 Asynchronous Update Propagation • Collect updates at primary using triggers or the log • Triggers (Oracle8, Rdb, SQL Server, DB2, …) – On every update at primary, a trigger fires to store the update in the update propagation table. • Post-process the log to generate update propagations (SQL Server, DB2, Tandem Non-Stop SQL) – Off-line, so saves trigger and triggered update overhead, though R/W log synchronization also has a cost – Requires admin (what if log reader fails?) • Optionally identify updated fields to compress log • Most DB systems support this today. – First in IBM IMS, Tandem NS SQL, DEC/Rdb, & ad hoc
  • 11. 11 Request Propagation • Like updates, must ensure requests run in the same order at primary and replica. – Log the requests or extend triggers to capture them. • Could run request synchronously at all replicas, but commit even if one replica fails. – Need a recovery procedure for failed replicas. – Supported by Digital’s RTR. • Replicate a request rather than the updates produced by the request (e.g., a stored procedure call). SP1: Write(x) Write(y)x, y DB-A w[x] w[y] SP1: Write(x) Write(y) x, y DB-B w[x] w[y] replicate
  • 12. 12 Products • All major DBMS products have a rich primary- copy replication mechanism • Differences are in detailed features – performance – ease of management – richness of filtering predicates – push vs. pull propagation – stored procedure support – transports (e.g. Sybase SQLanywhere can use email!) – … • The following summary is necessarily incomplete
  • 13. 13 SQL Server 7.0 • Publication - a collection of articles to subscribe to • Article – a horiz/vertical table slice or stored proc – Customizable table filter (WHERE clause or stored proc) – Stored proc may be transaction protected (replicate on commit). Replicates the requests instead of each update. • Snapshot replication makes a copy • Transactional replication maintains the copy by propagating updates from publisher to subscribers – Post-processes log to store updates in Distribution DB – Distribution DB may be separate from the publisher DB – Updates can be pushed to or pulled from subscriber – Can customize propagated updates using stored procs
  • 14. 14 SQL Server 7.0 (cont’d) • Immediate updating subscriber – Can update data, synchronizing with publisher via 2PC – Uses triggers to capture updates (Not For Replication disables trigger for publisher’s updates) – Subscriber sends before/after row timestamp. Publisher checks row didn’t change since subscriber’s current copy – Publisher then forwards updates to other subscribers • Access control lists protect publishers from unauthorized subscribers • Merge replication- described later
  • 15. 15 Oracle 8i • Like SQL Server, can replicate updates to table fragments or stored proc calls at the master copy • Uses triggers to capture updates in a deferred queue – Updates are row-oriented, identified by primary key – Can optimize by sending keys and updated columns only • Group updates by transaction, which are propagated: – Either serially in commit order or – in parallel with some dependent transaction ordering: each read reads the “commit number” of the data item; updates are ordered by dependent commit number • Snapshots are updated in a batch refresh. – Pushed from master to snapshots, using queue scheduler
  • 16. 16 DB2 • Very similar feature set to SQL Server and Oracle • Filtered subscriber (no stored proc replication (?)) – Create snapshot, then update incrementally (push or pull) • Captures DB2 updates from the DB2 log – For other systems, captures updates using triggers • Many table type options: – Read-only snapshot copy, optionally with timestamp – Aggregates, with cumulative or incremental values – Consistent change data, optionally with row versions – “Replica” tables, for multi-master updating • Interoperates with many third party DBMS’s
  • 17. 17 Failure Handling • Secondary failure - nothing to do till it recovers – At recovery, apply the updates it missed while down – Needs to determine which updates it missed, but this is no different than log-based recovery – If down for too long, may be faster to get a whole copy • Primary failure – Products just wait till it recovers – Can get higher availability by electing a new primary – A secondary that detects primary’s failure announces a new election by broadcasting its unique replica identifier – Other secondaries reply with their replica identifier – The largest replica identifier wins
  • 18. 18 Failure Handling (cont’d) • Primary failure (cont’d) • All replicas must now check that they have the same updates from the failed primary – During the election, each replica reports the id of the last log record it received from the primary – The most up-to-date replica sends its latest updates to (at least) the new primary. – Could still lose an update that committed at the primary and wasn’t forwarded before the primary failed … but solving it requires synchronous replication (2-phase commit to propagate updates to replicas)
  • 19. 19 Communications Failures • Secondaries can’t distinguish a primary failure from a communication failure that partitions the network. • If the secondaries elect a new primary and the old primary is still running, there will be a reconciliation problem when they’re reunited. This is multi-master. • To avoid this, one partition must know it’s the only one that can operate, and can’t communicate with other partitions to figure this out. • Could make a static decision. The partition that has the primary wins. • Dynamic solutions are based on Majority Consensus
  • 20. 20 Majority Consensus • Whenever a set of communicating replicas detects a replica failure or recovery, they test if they have a majority (more than half) of the replicas. • If so, they can elect a primary • Only one set of replicas can have a majority. • Doesn’t work well with even number of copies. – Useless with 2 copies • Quorum consensus – Give a weight to each replica – The replica set that has a majority of the weight wins – E.g. 2 replicas, one has weight 1, the other weight 2
  • 21. 21 3. Multi-Master Replication • Some systems must operate when partitioned. – Requires many updatable copies, not just one primary – Conflicting updates on different copies are detected late • Classic example - salesperson’s disconnected laptop Customer table (rarely updated) Orders table (insert mostly) Customer log table (append only) – So conflicting updates from different salespeople are rare • Use primary-copy algorithm, with multiple masters – Each master exchanges updates (“gossips”) with other replicas when it reconnects to the network – Conflicting updates require reconciliation (i.e. merging) • In Lotus Notes, Access, SQL Server, Oracle, …
  • 22. 22 Example of Conflicting Updates A Classic Race Condition Replica 1 Initially x=0 T1: X=1 Primary Initially x=0 Send (X=1) Replica 2 Initially x=0 T2: X=2 Send (X=1) X=1 X=1 X=2 Send (X=2) X=2 Send (X=2) • Replicas end up in different states
  • 23. 23 Thomas’ Write Rule • To ensure replicas end up in the same state – Tag each data item with a timestamp – A transaction updates the value and timestamp of data items (timestamps monotonically increase) – An update to a replica is applied only if the update’s timestamp is greater than the data item’s timestamp – You only need to keep timestamps of data items that were recently updated (where an older update could still be floating around the system) • All multi-master products use some variation of this • Robert Thomas, ACM TODS, June ’79 – Same article that invented majority consensus
  • 24. 24 Thomas Write Rule ⇒ Serializability Replica 1 T1: read x=0 (TS=0) T1: X=1, TS=1 Primary Initially x=0,TS=0 Send (X=1, TS=1) Replica 2 T2: read x=0 (TS=0) T2: X=2, TS=2 Send (X=1, TS=1) X=1, TS=1 X=1,TS=1X=2, TS=2 Send (X=2, TS=2) • Replicas end in the same state, but neither T1 nor T2 reads the other’s output, so the execution isn’t serializable. X=2, TS=2 Send (X=2, TS=2)
  • 25. 25 Multi-Master Performance • The longer a replica is disconnected and performing updates, the more likely it will need reconciliation • The amount of propagation activity increases with more replicas – If each replica is performing updates, the effect is quadratic
  • 26. 26 Microsoft Access and SQL Server • Multi-master replication without a primary • Each row R of a table has 4 additional columns – globally unique id (GUID) – generation number, to determine which updates from other replicas have been applied – version number = the number of updates to R – array of [replica, version number] pairs, identifies the largest version number it got for R from other replicas • Uses Thomas’ write rule, based on version numbers – Access uses replica id to break ties. SQL Server 7 uses subscriber priority or custom conflict resolution.
  • 27. 27 Generation Numbers (Access/SQL cont’d) • Each replica has a current generation number • A replica updates a row’s generation number whenever it updates the row • A replica knows the generat’n number it had when it last exchanged updates with R´, for every replica R´. • A replica increments its generation number every time it exchanges updates with another replica. • So, when exchanging updates with R′, it should send all rows with a generation number larger than what it had when last exchanging updates with R′.
  • 28. 28 Duplicate Updates (Access/SQL cont’d) • Some rejected updates are saved for later analysis • To identify duplicate updates to discard them - – When applying an update to x, replace x’s array of [replica, version#] pairs by the update’s array. – To avoid processing the same update via many paths, check version number of arriving update against the array • Consider a rejected update to x at R from R´, where – [R´, V] describes R´ in x’s array, and – V´ is the version number sent by R´. – If V ≥ V´, then R saw R´’s updates – If V < V´, then R didn’t see R´’s updates, so store it in the conflict table for later reconciliation
  • 29. 29 Oracle 8i (revisited) • Masters replicate entire tables – Updates are pushed from master to masters and to snapshots (synchronous or asynchronous) – Updates include before values (you can disable if conflicts are impossible) – They recommend masters should always be connected • Snapshots are updatable ⇒ “multi-master” – Each propagation transaction updates its queue entry (instead of update-oriented generation numbers) • Conflict detection – Before-value at replica is different than in update – Uniqueness constraint is violated – Row with the update’s key doesn’t exist
  • 30. 30 Oracle 8i Conflict Resolution • Built-in resolution strategies (defined per column-group) – Add difference between the old and new values of the originating site to the destination site – Average the value of the current site and the originating site – Min or max of the two values – The one with min or max timestamp – The site or value with maximum priority – Can apply methods in sequence: e.g., by time , then by priority. • Can call custom procs to log, notify, or resolve the conflict – Parameters - update’s before/after value and row’s current value • For a given update, if no built-in or custom conflict resolution applies, then the entire transaction is logged.
  • 31. 31 4. Other Approaches • Non-transactional replication using timestamped updates and variations of Thomas’ write rule – directory services are managed this way • Quorum consensus per-transaction – Read and write a quorum of copies – Each data item has a version number and timestamp – Each read chooses a replica with largest version number – Each write increments version number one greater than any one it has seen – No special work needed during a failure or recovery
  • 32. 32 Other Approaches (cont’d) • Read-one replica, write-all-available replicas – Requires careful management of failures and recoveries • E.g., Virtual partition algorithm – Each node knows the nodes it can communicate with, called its view – Transaction T can execute if its home node has a view including a quorum of T’s readset and writeset (i.e. the data it can read or write) – If a node fails or recovers, run a view formation protocol (much like an election protocol) – For each data item with a read quorum, read the latest version and update the others with smaller version #.
  • 33. 33 Summary • State-of-the-art products have rich functionality. – It’s a complicated world for app designers – Lots of options to choose from • Most failover stories are weak – Fine for data warehousing – For 24×7 TP, need better integration with cluster node failover