1. Big Bad PostgreSQL: A Case Study
Moving a
“large,”
“complicated,” and
mission-critical
datawarehouse
from Oracle
to PostgreSQL
for cost control.
1
Monday, August 2, 2010
2. Who am I? @postwait on twitter
Author of “Scalable Internet Architectures”
Pearson, ISBN: 067232699X
CEO of OmniTI
We build scalable and secure web applications
I am an Engineer
A practitioner of academic computing.
IEEE member and Senior ACM member.
On the Editorial Board of ACM’s Queue magazine.
I work on/with a lot of Open Source software:
Apache, perl, Linux, Solaris, PostgreSQL,
Varnish, Spread, Reconnoiter, etc.
I have experience.
I’ve had the unique opportunity to watch a great many catastrophes.
I enjoy immersing myself in the pathology of architecture failures.
Monday, August 2, 2010
3. Overall Architecture
Oracle 8i
OLTP instance:
0.5 TB 0.25 TB
drives the site
Hitachi JBOD
OLTP
Log import and Oracle 8i
processing
Oracle 8i
0.75 TB
JBOD
MySQL
log importer 0.5 TB 1.5 TB
Hitachi MTI OLTP warm backup
Warm spare
1.2 TB
SATA Datawarehouse
RAID
Log Importer
MySQL 4.1
1.2 TB
IDE RAID
Data Exporter
bulk selects / data exports
Monday, August 2, 2010
4. Database Situation
• The problems:
• The database is growing.
• The OLTP and ODS/warehouse are too slow.
• A lot of application code against the OLTP system.
• Minimal application code against the ODS system.
• Oracle:
• Licensed per processor.
• Really, really, really expensive on a large scale.
• PostgreSQL:
• No licensing costs.
• Good support for complex queries.
Monday, August 2, 2010
5. Database Choices
• Must keep Oracle on OLTP
• Complex, Oracle-specific web application.
• Need more processors.
• ODS: Oracle not required.
• Complex queries from limited sources.
• Needs more space and power.
• Result:
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS
Monday, August 2, 2010
6. PostgreSQL gotchas
• For an OLTP system that does thousands of
updates per second, vacuuming is a hassle.
• No upgrades?!
• pg_migrator is coming along.
• Less community experience with large
databases.
• Replication features less evolved
• though PostgreSQL 9.0 ups the ante
Monday, August 2, 2010
7. PostgreSQL ♥ ODS
• Mostly inserts.
• Updates/Deletes controlled, not real-time.
• pl/perl (leverage DBI/DBD for remote
database connectivity).
• Monster queries.
• Extensible.
Monday, August 2, 2010
8. Choosing Linux
• Popular, liked, good community support.
• Chronic problems:
• kernel panics
• filesystems remounting read-only
• filesystems don’t support snapshots
• LVM is clunky on enterprise storage
• 20 outages in 4 months
Monday, August 2, 2010
9. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
• prstat, iostat, vmstat, smf, fault-
management.
• ZFS
• snapshots (persistent), BLI backups.
• Excellent support for enterprise storage.
• DTrace.
• Free (too).
Monday, August 2, 2010
10. Oracle features we need
• Partitioning
• Statistics and Aggregations
• rank over partition, lead, lag, etc.
• Large selects (100GB)
• Autonomous transactions
• Replication from Oracle (to Oracle)
Monday, August 2, 2010
11. Partitioning
For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
count
------------
3247286017
(1 row)
• Next biggest tables: billions
• Allows us to cluster data over specific ranges
(by date in our case)
• Simple, cheap archiving and removal of data.
• Can put ranges used less often in different
tablespaces (slower, cheaper storage)
Monday, August 2, 2010
12. Partitioning PostgreSQL style
• PostgreSQL doesn’t support partition...
• It supports inheritance... (what’s this?)
• some crazy object-relation paradigm.
• We can use it to implement partitioning:
• One master table with no rows.
• Child tables that have our partition constraints.
• Rules on the master table for insert/update/delete.
Monday, August 2, 2010
13. Partitioning PostgreSQL realized
• Cheaply add new empty partitions
• Cheaply remove old partitions
• Migrate less-often-accessed partitions to slower
storage
• Different indexes strategies per partition
• PostgreSQL >8.1 supports constraint checking on
inherited tables.
• smarter planning
• smarter executing
Monday, August 2, 2010
14. RANK OVER PARTITION
• In Oracle:
select userid, email from (
! ! select u.userid, u.email,
! ! row_number() over
(partition by u.email order by userid desc) as position
! ! from (...)) where position = 1
• In PostgreSQL:
FOR v_row IN select u.userid, u.email from (...) order by email, userid desc
LOOP
! IF v_row.email != v_last_email THEN
! ! RETURN NEXT v_row;
! ! v_last_email := v_row.email;
! ! v_rownum := v_rownum + 1;
! END IF;
END LOOP;
With 8.4, we have windowing functions
Monday, August 2, 2010
15. Large SELECTs
• Application code does:
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
• The width of these rows is about 2k
• 50 million row return set
• > 100 GB of data
Monday, August 2, 2010
16. The Large SELECT Problem
• libpq will buffer the entire result in memory.
• This affects language bindings (DBD::Pg).
• This is an utterly deficient default behavior.
• This can be avoided by using cursors
• Requires the app to be PostgreSQL specific.
• You open a cursor.
• Then FETCH the row count you desire.
Monday, August 2, 2010
17. Big SELECTs the Postgres way
The previous “big” query becomes:
DECLARE CURSOR bigdump FOR
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
Then, in a loop:
FETCH FORWARD 10000 FROM bigdump;
Monday, August 2, 2010
18. Autonomous Transactions
• In Oracle we have over 2000 custom stored procedures.
• During these procedures, we like to:
• COMMIT incrementally
Useful for long transactions (update/delete) that
need not be atomic -- incremental COMMITs.
• start a new top-level txn that can COMMIT
Useful for logging progress in a stored procedure so
that you know how far you progessed and how long
each step took even if it rolls back.
Monday, August 2, 2010
19. PostgreSQL shortcoming
• PostgreSQL simply does not support
Autonomous transactions and to quote core
developers “that would be hard.”
• When in doubt, use brute force.
• Use pl/perl to use DBD::Pg to connect to
ourselves (a new backend) and execute a new
top-level transaction.
Monday, August 2, 2010
20. Replication
• Cross vendor database replication isn’t too difficult.
• Helps a lot when you can do it inside the database.
• Using dbi-link (based on pl/perl and DBI) we can.
• We can connect to any remote database.
• INSERT into local tables directly from remote
SELECT statements.
[snapshots]
• LOOP over remote SELECT statements and
process them row-by-row.
[replaying remote DML logs]
Monday, August 2, 2010
21. Replication (really)
• Through a combination of snapshotting and DML
replay we:
• replicate over into over 2000 tables in PostgreSQL
from Oracle
• snapshot replication of 200
• DML replay logs for 1800
• PostgreSQL to Oracle is a bit harder
• out-of-band export and imports
Monday, August 2, 2010
22. New Architecture
• Master: Sun v890 and Hitachi AMS + warm standby
running Oracle
(1TB)
• Logs: several customs
running MySQL instances
(2TB each)
• ODS BI: 2x Sun v40
running PostgreSQL 8.3
(6TB on Sun JBODs on ZFS each)
• ODS archive: 2x custom
running PostgreSQL 8.3
(14TB internal storage on ZFS each)
Monday, August 2, 2010
23. PostgreSQL is Lacking
• pg_dump is too intrusive.
• Poor system-level instrumentation.
• Poor methods to determine specific contention.
• It relies on the operating system’s filesystem cache.
(which make PostgreSQL inconsistent across it’s
supported OS base)
Monday, August 2, 2010
24. Enter Solaris
• Solaris is a UNIX from Sun Microsystems.
• Is it different than other UNIX/UNIX-like systems?
• Mostly it isn’t different (hence the term UNIX)
• It does have extremely strong ABI backward
compatibility.
• It’s stable and works well on large machines.
• Solaris 10 shakes things up a bit:
• DTrace
• ZFS
• Zones
Monday, August 2, 2010
25. Solaris / ZFS
• ZFS: Zettaback Filesystem.
• 264 snapshots, 248 files/directory, 264 bytes/filesystem,
278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system
• Extremely cheap differential backups.
• I have a 5 TB database, I need a backup!
• No rollback in your database? What is this? MySQL?
• No rollback in your filesystem?
• ZFS has snapshots, rollback, clone and promote.
• OMG! Life altering features.
• Caveat: ZFS is slower than alternatives, by about 10% with tuning.
Monday, August 2, 2010
26. Solaris / Zones
• Zones: Virtual Environments.
• Shared kernel.
• Can share filesystems.
• Segregated processes and privileges.
• No big deal for databases, right?
But Wait!
Monday, August 2, 2010
27. Solaris / ZFS + Zones = Magic Juju
https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh
• ZFS snapshot, clone, delegate to zone, boot and run.
• When done, halt zone, destroy clone.
• We get a point-in-time copy of our PostgreSQL database:
• read-write,
• low disk-space requirements,
• NO LOCKS! Welcome back pg_dump,
you don’t suck (as much) anymore.
• Fast snapshot to usable copy time:
• On our 20 GB database: 1 minute.
• On our 1.2 TB database: 2 minutes.
Monday, August 2, 2010
28. ZFS: how I saved my soul.
• Database crash. Bad. 1.2 TB of data... busted.
The reason Robert Treat looks a bit older than he
should.
• xlogs corrupted. catalog indexes corrupted.
• Fault? PostgreSQL bug? Bad memory? Who knows?
• Trial & error on a 1.2 TB data set is a cruel experience.
• In real-life, most recovery actions are destructive
actions.
• PostgreSQL is no different.
• Rollback to last checkpoint (ZFS), hack postgres code,
try, fail, repeat.
Monday, August 2, 2010
29. Let DTrace open your eyes
• DTrace: Dynamic Tracing
• Dynamically instrument “stuff” in the system:
• system calls (like strace/truss/ktrace).
• process/scheduler activity (on/off cpu, semaphores, conditions).
• see signals sent and received.
• trace kernel functions, networking.
• watch I/O down to the disk.
• user-space processes, each function... each machine instruction!
• Add probes into apps where it makes sense to you.
Monday, August 2, 2010
30. Can you see what I see?
• There is EXPLAIN... when that isn’t enough...
• There is EXPLAIN ANALYZE... when that isn’t enough.
• There is DTrace.
; dtrace -q -n ‘
postgresql*:::statement-start
{
self->query = copyinstr(arg0);
self->ok=1;
}
io:::start
/self->ok/
{
@[self->query,
args[0]->b_flags & B_READ ? "read" : "write",
args[1]->dev_statname] = sum(args[0]->b_bcount);
}’
dtrace: description 'postgres*:::statement-start' matched 14 probes
^C
select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000;
read sd1 16384
select division, sum(amount), avg(amount) from ods.billings where txn_timestamp
between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division;
read sd2 71647232
Monday, August 2, 2010
32. Results
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS
• Save $800k in license costs.
• Spend $100k in labor costs.
• Learn a lot.
Monday, August 2, 2010
33. Thanks!
• Thank you.
• http://omniti.com/does/postgresql
• We’re hiring, but only if you love:
• lots of data on lots of disks on lots of big boxes
• smart people
• hard problems
• more than one database technology (including PostgreSQL)
• responsibility
Monday, August 2, 2010