3. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
What people say about NFS
4. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
What people say about NFS
IT’S FAST!
CC0 1.0
5. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
PostgreSQL & NFS
Myths & Truths
Jonathan Battiato
www.2ndquadrant.com
6. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Jonathan Battiato
Linux SysAdmin & DBA
Jonathan Battiato
Linux SysAdmin & DBA
jonny_lee_84
jonathan.battiato@2ndquadrant.it
jonny-lee31
~$ whoami
7. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Thanks to:
Giuseppe Broccolo, PhD
PostgreSQL & PostGIS consultant
@giubro gbroccolo7
giuseppe.broccolo@2ndquadrant.it
gbroccolo gemini__81
8. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
#PGNFS
#PostgreSQL
#NFS
TAGs
@jonny_lee_84
@2ndquadrant_it
@PGDayIT
9. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Network File System
general characteristics
Part 1
10. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Network File System (Sun Microsystem®
, 1984)
●
A protocol for distributed file system – v2, v3, v4
●
servers export the file system, clients mounts the export locally
●
Many clients, one server
●
High performances through fast network
LAN
11. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
NFS v3 vs. NFS v2
●
Strenght
– TCP (UDP is still a choice)
– Asynchronous communication (caching)
– A daemon manages file locking and connections
– Larger block size than 8KB (up to 64KB - depends on Kernel)
●
Weakness
– Stateless (NFS)
– Plain text data transmissions
– Host authentication only
– Dynamic port assignment
12. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
NFS v4 vs. NFS v3
●
Partial solutions
– Single port (2049)
easy to apply firewall rules to filter NFS traffic
– Authentication and locking deamons included in the protocol
– operations are always stateful
– UDP not allowed anymore
NFS v4 is the default on RHEL 6
13. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Client-Server Optimization
●
Block size (wsize, rsize – nfs protocol level)
Requests are organised in data blocks exchanged between client & server
●
NFS relies on TCP or UDP protocols (transport level)
UDP is statless, TCP is stateful
TCP choose automatically the packet size
●
Packet size & MTU (network level)
Packets size can be optimised to transfer blocks within the network: jumbo frames
CHOOSE WISELY!
14. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Main concerns about synchronisation
●
NO Time synchronisation
Use NTP!
●
File Locking
v2 does not support file locking
v3 uses daemon for locking, performances drop when in use
●
Delayed write cache
Caching writes could cause data loss
Without write cache performances drop
●
Read (metadata) cache
NFS server may show not-updated attributes, any program that relies on file attributes may not work
15. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Part 2
NFS & PostgreSQL
16. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
NFS & PostgreSQL
www.postgresql.org/docs/9.6/static/creating-cluster.html#CREATING-CLUSTER-NFS
●
NFS disks are completely transparent for PostgreSQL
●
If NFS client/server implementation does not provide
standard file system semantics, this can cause
reliability problems
●
PostgreSQL advices
– NFS mount options
●
avoid soft-mounting: hard
– General mount options
●
avoid asynchronous writes: fsync
LAN
17. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
hard-mount option
NFS calls must be retried indefinitely
both data and WAL entries sequentiality may not be preserved
write cache
WALs have to be flushed as a database action is committed
many processes are involved during WALs flush on disk
→ several NFS clients
→ WAL entries sequentiality may not be preserved
attribute cache
many processes are involved during WALs flush on disk
→ several NFS clients
→ files attributes may not have consistent views
Sources of corruptions
18. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Synchronous writes
sync vs. fsync=on:
– NFS v2:
●
if sync is specified, the server will not complete a request until both data/metadata are
written to the disk
– NFS v3 / v4:
●
if sync is specified, the server will complete a request returning the status of the write:
– NFS_FILE_SYNC, NFS_DATA_SYNC, NFS_UNSTABLE
●
data is effectively forced to be flushed on disk once a sync method system call is issued
– be careful to set fsync=on
19. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Reliability vs. Performance
●
NFS exports must be mounted with safe options
●
Are these option fine for a database?
– is the performance deeply impacted?
– is data safely guaranteed in case of crash?
20. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Part 3
NFS & PostgreSQL
performance
&
reliability benchmarks
21. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
The customer
NetApp
FAS 8080 Full Flash
Clustered Mode
physical host
storage
2 CPUs x 8GB RAM
RHEL 6
Kernel v. 2.6.32
PostgreSQL 9.5.3
Ethernet
10Gb/s
No routers
22. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Reliability tests
●
try different crash scenarios under high concurrency load, check if
the instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq:
●
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
23. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
●
try different crash scenarios under high concurrency load, check if
the instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq:
●
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
24. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
●
try different crash scenarios under high concurrency load, check if
the instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq:
●
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
25. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
●
try different crash scenarios under high concurrency load, check if
the instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq:
●
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
26. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
●
try different crash scenarios under high concurrency load, check if
the instance recovers properly, then execute a VACUUM FULL:
– kill -9 to postmaster
– forced reboot through kernel SysReq:
●
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
– power off of the VM from the VMWare vSphere® remote panel
– kill the TCP/IP connections through tcpkill
[root@pgsql] ~# tcpkill host X.X.X.X
tcpkill: listening on eth0 [host X.X.X.X]
Reliability tests
27. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
T. Vondra – test the persistence of recycled WALs in ext4
– update attributes of recycled WALs flush them on disk through→ fdatasync
– fdatasync does not force the flush of metadata the update may get lost after a crash→
– logged changes are contained in file “in the future” data loss!→
github.com/2ndQuadrant/ext4-data-loss
●
INSERT/UPDATE new records in parallel and synchronously on the db and on a file
●
simulate a crash compare db & file contents after the crash recovery→
Reliability tests
28. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
●
– ac
●
Reliability tests
29. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
●
several tests, no data loss!
– ac
●
Reliability tests
30. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
●
several tests, no data loss!
– ac
●
several tests, no data loss!
Reliability tests
31. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Check NFS behaviour for recycled WALs (file attributes caching)
– noac
●
several tests, no data loss!
– ac
●
several tests, no data loss!
Reliability tests
32. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
All tests passed, but this does not ensure that it is totally safe!
Reliability tests
33. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Enhance reliability for PostgreSQL DBs
●
allow page checksums (9.3+): initdb –data-checksums ...
– execute CRC32 calculation for each 8KB data block
●
a checksum failure means that the data block is corrupted
– force wal_log_hint=true
●
write the entire 8KB page to the WAL, even for hint bits modification
– take into account the impact to the performance:
●
R: checksum extra-calculation every 8KB
●
W: increase the amount of information logged into WALs
34. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Enhance performance for PostgreSQL DBs
●
sync’ed writes are slow, even if NFS caching is enabled
●
if possible, consider asynchronous commit:
– let the server return success as soon as the transaction is logically completed
●
synchronous_commit=off – it can be set per user/session
●
WAL entries will be flushed in a second moment,
but not later than 3x wal_writer_delay
●
in case of crashes, the DB can be recovered in a consistent state,
but there could be a window of data loss
35. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
file system I/O speed:
– reads: 700MB/s
– writes: 700MB/s
synchronous_commit=on:
– INSERT:
●
noac ~ ac (single & multi client)
– SELECT:
●
noac ~ 16x ac
DB commit rate and SELECT rate:
synchronous_commit=off:
– INSERT:
●
noac~ 8x ac (single & multi client)
– SELECT:
●
noac ~ 16x ac
36. Jonathan Battiato 2ndQuadrant Italia
@2ndquadrant_it @jonny_lee_84
PGDay.IT 2016 - 10th
edition
Prato Dec 13th
2016
Conclusions
●
NFS is not natively thought for reliability purposes
– the protocol is thought to enhance the performance
– NFS v4 is preferable
●
PostgreSQL allows to adopt many countermeasures
– it is at least able to promptly detect data corruptions
●
PostgreSQL can be used with NFS
– ready to accept minimal data loss