SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Parallel Recovery in
PostgreSQL
Koichi Suzuki
May 31st, 2023
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
2
Agenda
• Recovery Overview
• Possibility of recovery parallelism
• Additional rules and synchronization
• Implementation architecture
• Current achievement
• Remaining works
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
3
Recovery Overview
• WAL (Write-Ahead-Log) takes essential role in backup and recovery.
• This is also used in log-shipping replication.
• Recovery is done by dedicated startup process.
Database Data
WAL
Checkpoint
Buffer flush
Data update
Serialized
Data
WAL
Copy
Archive
Log shipping
Database in
recovery
mode
Read each WAL
record in series
Replay each WAL
record in series
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
4
Single replay thread, why?
● Single thread replay guarantees consistent replay
○ This is the order of write to the database
○ Replaying in the write order is simple and the safest.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
5
Current WAL replay architecture
WAL record type:
● Categorized by “resource manager”
○ For types of data object
■ Heap, transaction, Btree, GIN, ….
○ Each resource manager provides replay function for each record.
● Replay (“redo” in the source) reads the resource manager ID (type RmgrId) of
each WAL record and calls replay function for these resources.
○ StartupXLOG() in xlog.c
● Replay is done by dedicated startup process.
● Many more to handle in StartupXLOG()
○ Determine if postmaster can begin to accept connection,
○ Feedback commit/abort back to the primary,
○ Recovery target handling,
….(many more)...
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
6
Current list of RmgrId (as of V14/15)
XLOG
Transaction
Storage
CLOG
Database
Tablespace
MultiXact
RelMap
Standby
Heap2
Heap
Btree
Hash
Gin
Gist
Sequence
SPGist
BRIN
CommitTs
ReplicationOrigin
Generic
LogicalMessage
When new resource manager (for example, access method and index)
are added, new RmgrID and its replay function are added. No changes
are needed in current redo framework.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
7
Structure of WAL (simplified)
・・・
initdb
WAL segment,
fixed size, 16 MB
by default
WAL
record
LSN is the position of the WAL record from the very beginning.
64bit
● Each WAL segment has its own header,
● WAL record may span across multiple WAL segment files.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
8
Motivation of parallelism in the recovery?
Motivation
• Save time of restoration from the backup.
• Improve replication lag in log-shipping replication.
• At least we can replay WAL record for different database page in parallel.
Challenge (for example)
• There are many constraints which we must follow during WAL replay
○ When replaying COMMIT/ABORT, all the WAL records for the
transaction must have been replayed.
○ Some WAL records has replay information for multiple pages.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
9
Basic idea of constraints in parallel replay
1. For each database page, replay must be done in LSN order.
2. Replaying WAL for different pages can be done in different worker process in
parallel.
3. If WAL is not associated with database page, they are replayed in LSN order by
dedicated thread (transaction worker).
4. For some WAL record, such as timeline change, all the preceding WAL records
must have been replayed before replaying such WAL record.
5. If a WAL record is associated with multiple pages, they will be replayed by one of
the workers.
○ To maintain the first constraint, we need some synchronization (mentioned
later)
6. Before replaying commit/abort/prepare, it must be confirmed that all the WAL
record for the transaction must have been replayed. (mentioned later).
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
10
Worker processes for parallel recovery
All the worker processes are startup process or its child process.
● Reader worker (dedicated)
○ This is startup process itself. Read each WAL record and ques to the
distributor worker
● Distributor worker (dedicated)
○ Parses each WAL record and determine which worker process to handle.
○ If WAL record is associated with block data, then it is assigned to one of
the block worker based on the hash value of the block,
○ Add each WAL record to the queue of each worker process in LSN order.
○ Can be merged into reader worker.
● Transaction worker (dedicated)
○ Replays all the WAL records without block information.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
11
Worker processes for parallel recovery (cont.)
● Block worker (multiple, configurable)
○ Replays assigned WAL record.
○ Handling of multi-page WAL will be mentioned later.
● Invalid page worker (dedicated)
○ Target database page may be missing or inconsistent because table
might have been dropped by subsequent WALs.
○ Current recovery registers this as invalid page and make correction when
replaying corresponding DELETE/DROP WAL.
○ Now dedicated process to reuse existing code using hash functions
based on the memory context belonging to a process,
○ May receive queue from txn and block workers.
○ Can be modified to use shared memory. In this case, this worker is not
needed.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
12
Worker processes
Configured with three block workers.
Child of startup process.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
13
Shared data among workers
● WAL record and its parsed data is shared among workers through dynamic
shared memory (dsm).
○ Allocated by reader worker (startup) before other workers are folked.
○ Handling of multi-page WAL will be mentioned later.
● Other shared data structures are accommodated in the same dsm.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
14
Shared data among workers (cont.)
● Buffer for WAL and its parsed result,
● Status of outstanding transaction,
○ Latest LSN assigned to each block worker,
● Status of each worker,
● Queue to assign WAL to workers,
● History of outstanding WAL waiting for replay,
○ Used to advance replayed LSN.
● Data protection using spinlock.
○ No I/O or wait event should be associated while a spinlock is acquired.
○ Only one spinlock should be acquired by a worker.
■ May have to allow exception though…
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
15
WAL buffer is allocated in cyclic manner
● WAL buffer is allocated in cyclic manner to avoid fragmentation.
● WAL is eventually replayed and its data is freed.
● If we allocate WAL buffer in cyclic manner, we don’t have to worry about memory
fragmentation.
● Buffer is allocated at the start of main free area.
● When a buffer is freed, it is concatenated with surrounding free area and if
necessary, main free area info is updated.
● If free area is among allocated buffers, we don’t take care of it until it is
concatenated to main free area.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
16
WAL buffer is allocated in cyclic manner (cont.)
Main free
area
Main free
area
Allocated
Main free
area
Main free
area
Allocated
Main free
area
Allocated
Allocated
Main free
area
Allocated
● If sufficient main free area is not available, wait until other workers replay WALs and free buffers.
● If all workers is waiting for the buffer, reply aborts and suggests to enlarge shared buffer (configurable through GUC).
● With three block workers and ten queues for each worker, 4MB of buffer looks sufficient. We may need more with more
block workers and more queue depth.
(head)
(tail)
Alloc start Alloc start Alloc start
Allocated area grows.
Allocated area grows.
Allocated area grows.
Allocated area grows.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
17
Transaction status
● Composed of set of latest assigned WAL LSNs to each block worker for each
transaction.
● When the worker replays this LSN, it is cleaned.
● When transaction worker replays commit/abort/prepare, it checks all the WALs for
this transaction has been replayed.
● If not, waits until such workers replay all the assigned WALs (mentioned later).
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
For each outstanding
transaction
Waits until all the workers replay corresponding WAL
before commit/abort/prepare
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
18
Worker status
● Latest assigned WAL LSN,
● Latest replayed WAL LSN,
● Flags to indicate
○ the worker is waiting for sync from other workers,
○ If the worker failed,
○ If the worker terminated.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
19
WAL history
● Recovered point can be advanced only all the WALs are replayed.
● If any WAL has not been replayed, we need to wait to advance recovered point
until such WAL is replayed.
Wal history tracks this status.
Replayed Not replayed yet
Recovered point
Recovered point advances if all the previous WALs are replayed
Not yet Replayed
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
20
Synchronization among workers
• Dedicated data is queued asking for sync message.
• Sync message is sent via socket (unix domain udp). Each worker has its own
socket to receive synch.
• For multi-page WAL, another mechanism is implemented for performance.
Wal history tracks this status.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
21
Waiting for all the preceding WAL replay
● Distributor worker sends queue asking for synchronization.
● Distributor reads data from the target worker from the socket.
● When the target worker receives synchronization request, writes synchronization
data to the socket.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
22
Synchronization for multi-page WAL
● Multi-page WAL is assigned to all the block workers based on the hash value of the
page.
● WAL is associated with array of assigned workers and number of remaining
workers (initial value is number of assigned block worker).
● When a block worker receives WAL, it checks if remaining worker is one (only
myself).
○ If so, then the worker replays the WAL and then write synch to all the other
workers.
○ If not, then the worker waits sync from the above worker.
● This guarantees that WAL replay to each page is in the order of WAL LSN.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
23
Synchronization for multi-page WAL (cont.)
Wait some other block
worker replays
Last block worker to
pick up WAL. Replay
and sync others.
Multi-page WAL is assigned to all the block workers for each page (by
dispatcher worker).
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
24
Current status of the work and
achievement
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
25
Current Status
● Code is almost done.
● Still with many debug codes.
● Runs under gdb control.
● Parallelism looks working without issue.
● Existing redo functions are running without modification.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
26
Affected major source files
Modified/Added Source Outline
A src/backend/access/transam/parallel_replay.c
src/include/access/parallel_replay.h
Main parallel recovery logic.
Each worker loop.
M src/backend/access/transam/xlog.c
src/backend/access/transam/xlogreader.c
src/backend/access/transam/xlogutils.c
src/include/access/xlog.h
src/include/access/xlogreader.h
src/include/access/xlogutils.h
Read and analyze WAL into shared buffer
Pass WAL to distributor worker.
M src/backend/postmaster/postmaster.c
src/backend/postmaster/startup.c
src/backend/storage/lmgr/proc.c
src/backend/utils/activity/backend_status.c
src/backend/utils/adt/pgstatfuncs.c
src/include/miscadmin.h
src/include/postmaster/postmaster.h
src/include/storage/proc.h
src/include/utils/backend_status.h
Fork worker processes.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
27
Affected major source files (cont.)
Modified/Added Source Outline
M src/backend/utils/misc/guc.c Additional GUC parameters
M src/backend/utils/error/elog.c
src/backend/utils/activity/backend_status.c
src/backend/utils/misc/ps_status.c
Add workers to log and status.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
28
Additional major GUC parameters
Name Type Default Description
parallel_replay bool off Enable parallel recovery,
num_preplay_workers int 5 Number of parallel replay worker, including
reader/distributor/txn and invalid pag.
num_preplay_queues int 128 Number of queue element used to pass WAL to
various workers.
num_preplay_max_txn int max_connec
tions
Number of transactions running in parallel in
the recovery.
preplay_buffers int (calculated) Shared buffer size, allocated using dynamic
shared memory.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
29
Remaining work
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
30
Remaining work
● Cleasnup test code.
● Cleanup structure, which still have some unused data from past
try-and-error attempt.
● Validation of consistent recovery point determination.
● Determination of the end of WAL read.
● Port to version 16 or later.
● Run without gdb and measure the performance gain.
○ Archive recovery,
○ Log-shipping replications.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
31
Issues for further
discussion/development
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
32
Issues for discussion
● Synchronization:
○ Now using unix domain socket directly. May need some wrapper for portability.
● Queueing
○ Now using POSIX mqueue directly. It is very fast and simple and is supporting
multi to one queueing.
○ Is it portable to other environment such as Windows?
● Bringing to PG core
○ Large amount of code: around 7k lines with debug code.
○ Maybe around 4k lines without debug code.
○ Need expert’s help to divide this into manageable smaller pieces for commit
fest.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
33
Resources
Source repo
https://github.com/koichi-szk/postgres.git
Branch: parallel_replay_14_6
Test tools/environment
https://github.com/koichi-szk/parallel_recovery_test.git
Branch: main
PostgreSQL Wiki:
https://wiki.postgresql.org/wiki/Parallel_Recovery
They are now all public. Please let me know if you are interested in.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
34
Thank you very much
Koichi Suzuki
koichi.suzuki@enterprisedb.com
koichi.dbms@gmail.com
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
35
Challenge in the test
● Startup process starts before postmaster begins to accept connection.
● No direct means to pause and notice its PID to attach to gdb.
● Writes these information to a debug file, wait until it is attached to gdb and signal file is touched.
Paste to
terminal
Each terminal
runs gdb for
each worker

Contenu connexe

Similaire à Using Parallel Recovery in PostgreSQL 17

War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafkaconfluent
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBMariaDB plc
 
Oracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionOracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionMarkus Michalewicz
 
OpenShift.io on Gluster
OpenShift.io on GlusterOpenShift.io on Gluster
OpenShift.io on Glustermountpoint.io
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Aerospike, Inc.
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!Vitor Oliveira
 
The Future of zHeap
The Future of zHeapThe Future of zHeap
The Future of zHeapEDB
 
MongoDb scalability and high availability with Replica-Set
MongoDb scalability and high availability with Replica-SetMongoDb scalability and high availability with Replica-Set
MongoDb scalability and high availability with Replica-SetVivek Parihar
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
A presentaion on Panasas HPC NAS
A presentaion on Panasas HPC NASA presentaion on Panasas HPC NAS
A presentaion on Panasas HPC NASRahul Janghel
 
IO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performanceIO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performanceAlireza Kamrani
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoSRohit Jnagal
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarKarthik Ramasamy
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
 

Similaire à Using Parallel Recovery in PostgreSQL 17 (20)

War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafka
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
Oracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionOracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion Edition
 
OpenShift.io on Gluster
OpenShift.io on GlusterOpenShift.io on Gluster
OpenShift.io on Gluster
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!
 
Mutexes 2
Mutexes 2Mutexes 2
Mutexes 2
 
The Future of zHeap
The Future of zHeapThe Future of zHeap
The Future of zHeap
 
MongoDb scalability and high availability with Replica-Set
MongoDb scalability and high availability with Replica-SetMongoDb scalability and high availability with Replica-Set
MongoDb scalability and high availability with Replica-Set
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
A presentaion on Panasas HPC NAS
A presentaion on Panasas HPC NASA presentaion on Panasas HPC NAS
A presentaion on Panasas HPC NAS
 
Oracle Database 12c : Multitenant
Oracle Database 12c : MultitenantOracle Database 12c : Multitenant
Oracle Database 12c : Multitenant
 
IO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performanceIO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performance
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 

Dernier

Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfQ-Advise
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfMehmet Akar
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionMohammed Fazuluddin
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)Max Lee
 

Dernier (20)

Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 

Using Parallel Recovery in PostgreSQL 17

  • 2. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 2 Agenda • Recovery Overview • Possibility of recovery parallelism • Additional rules and synchronization • Implementation architecture • Current achievement • Remaining works
  • 3. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 3 Recovery Overview • WAL (Write-Ahead-Log) takes essential role in backup and recovery. • This is also used in log-shipping replication. • Recovery is done by dedicated startup process. Database Data WAL Checkpoint Buffer flush Data update Serialized Data WAL Copy Archive Log shipping Database in recovery mode Read each WAL record in series Replay each WAL record in series
  • 4. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 4 Single replay thread, why? ● Single thread replay guarantees consistent replay ○ This is the order of write to the database ○ Replaying in the write order is simple and the safest.
  • 5. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 5 Current WAL replay architecture WAL record type: ● Categorized by “resource manager” ○ For types of data object ■ Heap, transaction, Btree, GIN, …. ○ Each resource manager provides replay function for each record. ● Replay (“redo” in the source) reads the resource manager ID (type RmgrId) of each WAL record and calls replay function for these resources. ○ StartupXLOG() in xlog.c ● Replay is done by dedicated startup process. ● Many more to handle in StartupXLOG() ○ Determine if postmaster can begin to accept connection, ○ Feedback commit/abort back to the primary, ○ Recovery target handling, ….(many more)...
  • 6. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 6 Current list of RmgrId (as of V14/15) XLOG Transaction Storage CLOG Database Tablespace MultiXact RelMap Standby Heap2 Heap Btree Hash Gin Gist Sequence SPGist BRIN CommitTs ReplicationOrigin Generic LogicalMessage When new resource manager (for example, access method and index) are added, new RmgrID and its replay function are added. No changes are needed in current redo framework.
  • 7. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 7 Structure of WAL (simplified) ・・・ initdb WAL segment, fixed size, 16 MB by default WAL record LSN is the position of the WAL record from the very beginning. 64bit ● Each WAL segment has its own header, ● WAL record may span across multiple WAL segment files.
  • 8. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 8 Motivation of parallelism in the recovery? Motivation • Save time of restoration from the backup. • Improve replication lag in log-shipping replication. • At least we can replay WAL record for different database page in parallel. Challenge (for example) • There are many constraints which we must follow during WAL replay ○ When replaying COMMIT/ABORT, all the WAL records for the transaction must have been replayed. ○ Some WAL records has replay information for multiple pages.
  • 9. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 9 Basic idea of constraints in parallel replay 1. For each database page, replay must be done in LSN order. 2. Replaying WAL for different pages can be done in different worker process in parallel. 3. If WAL is not associated with database page, they are replayed in LSN order by dedicated thread (transaction worker). 4. For some WAL record, such as timeline change, all the preceding WAL records must have been replayed before replaying such WAL record. 5. If a WAL record is associated with multiple pages, they will be replayed by one of the workers. ○ To maintain the first constraint, we need some synchronization (mentioned later) 6. Before replaying commit/abort/prepare, it must be confirmed that all the WAL record for the transaction must have been replayed. (mentioned later).
  • 10. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 10 Worker processes for parallel recovery All the worker processes are startup process or its child process. ● Reader worker (dedicated) ○ This is startup process itself. Read each WAL record and ques to the distributor worker ● Distributor worker (dedicated) ○ Parses each WAL record and determine which worker process to handle. ○ If WAL record is associated with block data, then it is assigned to one of the block worker based on the hash value of the block, ○ Add each WAL record to the queue of each worker process in LSN order. ○ Can be merged into reader worker. ● Transaction worker (dedicated) ○ Replays all the WAL records without block information.
  • 11. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 11 Worker processes for parallel recovery (cont.) ● Block worker (multiple, configurable) ○ Replays assigned WAL record. ○ Handling of multi-page WAL will be mentioned later. ● Invalid page worker (dedicated) ○ Target database page may be missing or inconsistent because table might have been dropped by subsequent WALs. ○ Current recovery registers this as invalid page and make correction when replaying corresponding DELETE/DROP WAL. ○ Now dedicated process to reuse existing code using hash functions based on the memory context belonging to a process, ○ May receive queue from txn and block workers. ○ Can be modified to use shared memory. In this case, this worker is not needed.
  • 12. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 12 Worker processes Configured with three block workers. Child of startup process.
  • 13. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 13 Shared data among workers ● WAL record and its parsed data is shared among workers through dynamic shared memory (dsm). ○ Allocated by reader worker (startup) before other workers are folked. ○ Handling of multi-page WAL will be mentioned later. ● Other shared data structures are accommodated in the same dsm.
  • 14. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 14 Shared data among workers (cont.) ● Buffer for WAL and its parsed result, ● Status of outstanding transaction, ○ Latest LSN assigned to each block worker, ● Status of each worker, ● Queue to assign WAL to workers, ● History of outstanding WAL waiting for replay, ○ Used to advance replayed LSN. ● Data protection using spinlock. ○ No I/O or wait event should be associated while a spinlock is acquired. ○ Only one spinlock should be acquired by a worker. ■ May have to allow exception though…
  • 15. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 15 WAL buffer is allocated in cyclic manner ● WAL buffer is allocated in cyclic manner to avoid fragmentation. ● WAL is eventually replayed and its data is freed. ● If we allocate WAL buffer in cyclic manner, we don’t have to worry about memory fragmentation. ● Buffer is allocated at the start of main free area. ● When a buffer is freed, it is concatenated with surrounding free area and if necessary, main free area info is updated. ● If free area is among allocated buffers, we don’t take care of it until it is concatenated to main free area.
  • 16. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 16 WAL buffer is allocated in cyclic manner (cont.) Main free area Main free area Allocated Main free area Main free area Allocated Main free area Allocated Allocated Main free area Allocated ● If sufficient main free area is not available, wait until other workers replay WALs and free buffers. ● If all workers is waiting for the buffer, reply aborts and suggests to enlarge shared buffer (configurable through GUC). ● With three block workers and ten queues for each worker, 4MB of buffer looks sufficient. We may need more with more block workers and more queue depth. (head) (tail) Alloc start Alloc start Alloc start Allocated area grows. Allocated area grows. Allocated area grows. Allocated area grows.
  • 17. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 17 Transaction status ● Composed of set of latest assigned WAL LSNs to each block worker for each transaction. ● When the worker replays this LSN, it is cleaned. ● When transaction worker replays commit/abort/prepare, it checks all the WALs for this transaction has been replayed. ● If not, waits until such workers replay all the assigned WALs (mentioned later). LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ For each outstanding transaction Waits until all the workers replay corresponding WAL before commit/abort/prepare
  • 18. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 18 Worker status ● Latest assigned WAL LSN, ● Latest replayed WAL LSN, ● Flags to indicate ○ the worker is waiting for sync from other workers, ○ If the worker failed, ○ If the worker terminated.
  • 19. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 19 WAL history ● Recovered point can be advanced only all the WALs are replayed. ● If any WAL has not been replayed, we need to wait to advance recovered point until such WAL is replayed. Wal history tracks this status. Replayed Not replayed yet Recovered point Recovered point advances if all the previous WALs are replayed Not yet Replayed
  • 20. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 20 Synchronization among workers • Dedicated data is queued asking for sync message. • Sync message is sent via socket (unix domain udp). Each worker has its own socket to receive synch. • For multi-page WAL, another mechanism is implemented for performance. Wal history tracks this status.
  • 21. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 21 Waiting for all the preceding WAL replay ● Distributor worker sends queue asking for synchronization. ● Distributor reads data from the target worker from the socket. ● When the target worker receives synchronization request, writes synchronization data to the socket.
  • 22. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 22 Synchronization for multi-page WAL ● Multi-page WAL is assigned to all the block workers based on the hash value of the page. ● WAL is associated with array of assigned workers and number of remaining workers (initial value is number of assigned block worker). ● When a block worker receives WAL, it checks if remaining worker is one (only myself). ○ If so, then the worker replays the WAL and then write synch to all the other workers. ○ If not, then the worker waits sync from the above worker. ● This guarantees that WAL replay to each page is in the order of WAL LSN.
  • 23. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 23 Synchronization for multi-page WAL (cont.) Wait some other block worker replays Last block worker to pick up WAL. Replay and sync others. Multi-page WAL is assigned to all the block workers for each page (by dispatcher worker).
  • 24. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 24 Current status of the work and achievement
  • 25. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 25 Current Status ● Code is almost done. ● Still with many debug codes. ● Runs under gdb control. ● Parallelism looks working without issue. ● Existing redo functions are running without modification.
  • 26. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 26 Affected major source files Modified/Added Source Outline A src/backend/access/transam/parallel_replay.c src/include/access/parallel_replay.h Main parallel recovery logic. Each worker loop. M src/backend/access/transam/xlog.c src/backend/access/transam/xlogreader.c src/backend/access/transam/xlogutils.c src/include/access/xlog.h src/include/access/xlogreader.h src/include/access/xlogutils.h Read and analyze WAL into shared buffer Pass WAL to distributor worker. M src/backend/postmaster/postmaster.c src/backend/postmaster/startup.c src/backend/storage/lmgr/proc.c src/backend/utils/activity/backend_status.c src/backend/utils/adt/pgstatfuncs.c src/include/miscadmin.h src/include/postmaster/postmaster.h src/include/storage/proc.h src/include/utils/backend_status.h Fork worker processes.
  • 27. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 27 Affected major source files (cont.) Modified/Added Source Outline M src/backend/utils/misc/guc.c Additional GUC parameters M src/backend/utils/error/elog.c src/backend/utils/activity/backend_status.c src/backend/utils/misc/ps_status.c Add workers to log and status.
  • 28. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 28 Additional major GUC parameters Name Type Default Description parallel_replay bool off Enable parallel recovery, num_preplay_workers int 5 Number of parallel replay worker, including reader/distributor/txn and invalid pag. num_preplay_queues int 128 Number of queue element used to pass WAL to various workers. num_preplay_max_txn int max_connec tions Number of transactions running in parallel in the recovery. preplay_buffers int (calculated) Shared buffer size, allocated using dynamic shared memory.
  • 29. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 29 Remaining work
  • 30. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 30 Remaining work ● Cleasnup test code. ● Cleanup structure, which still have some unused data from past try-and-error attempt. ● Validation of consistent recovery point determination. ● Determination of the end of WAL read. ● Port to version 16 or later. ● Run without gdb and measure the performance gain. ○ Archive recovery, ○ Log-shipping replications.
  • 31. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 31 Issues for further discussion/development
  • 32. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 32 Issues for discussion ● Synchronization: ○ Now using unix domain socket directly. May need some wrapper for portability. ● Queueing ○ Now using POSIX mqueue directly. It is very fast and simple and is supporting multi to one queueing. ○ Is it portable to other environment such as Windows? ● Bringing to PG core ○ Large amount of code: around 7k lines with debug code. ○ Maybe around 4k lines without debug code. ○ Need expert’s help to divide this into manageable smaller pieces for commit fest.
  • 33. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 33 Resources Source repo https://github.com/koichi-szk/postgres.git Branch: parallel_replay_14_6 Test tools/environment https://github.com/koichi-szk/parallel_recovery_test.git Branch: main PostgreSQL Wiki: https://wiki.postgresql.org/wiki/Parallel_Recovery They are now all public. Please let me know if you are interested in.
  • 34. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 34 Thank you very much Koichi Suzuki koichi.suzuki@enterprisedb.com koichi.dbms@gmail.com
  • 35. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 35 Challenge in the test ● Startup process starts before postmaster begins to accept connection. ● No direct means to pause and notice its PID to attach to gdb. ● Writes these information to a debug file, wait until it is attached to gdb and signal file is touched. Paste to terminal Each terminal runs gdb for each worker