Big Bad PostgreSQL: BI on a Budget

Big Bad PostgreSQL: A Case Study

Moving a
“large,”
“complicated,” and
mission-critical
datawarehouse
from Oracle
to PostgreSQL
for cost control.

1
Monday, August 2, 2010

Who am I? @postwait on twitter

Author of “Scalable Internet Architectures”
Pearson, ISBN: 067232699X

CEO of OmniTI
We build scalable and secure web applications

I am an Engineer
A practitioner of academic computing.
IEEE member and Senior ACM member.
On the Editorial Board of ACM’s Queue magazine.

I work on/with a lot of Open Source software:
Apache, perl, Linux, Solaris, PostgreSQL,
Varnish, Spread, Reconnoiter, etc.

I have experience.
I’ve had the unique opportunity to watch a great many catastrophes.
I enjoy immersing myself in the pathology of architecture failures.


Overall Architecture

Oracle 8i
OLTP instance:
0.5 TB 0.25 TB
drives the site
Hitachi JBOD

OLTP

Log import and Oracle 8i

processing
Oracle 8i

0.75 TB
JBOD
MySQL
log importer 0.5 TB 1.5 TB
Hitachi MTI OLTP warm backup

Warm spare
1.2 TB
SATA Datawarehouse
RAID

Log Importer

MySQL 4.1

1.2 TB
IDE RAID

Data Exporter

bulk selects / data exports

Database Situation

• The problems:
• The database is growing.
• The OLTP and ODS/warehouse are too slow.
• A lot of application code against the OLTP system.
• Minimal application code against the ODS system.
• Oracle:
• Licensed per processor.
• Really, really, really expensive on a large scale.
• PostgreSQL:
• No licensing costs.
• Good support for complex queries.


Database Choices

• Must keep Oracle on OLTP
• Complex, Oracle-specific web application.
• Need more processors.
• ODS: Oracle not required.
• Complex queries from limited sources.
• Needs more space and power.
• Result:
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS


PostgreSQL gotchas

• For an OLTP system that does thousands of
updates per second, vacuuming is a hassle.

• No upgrades?!

• pg_migrator is coming along.

• Less community experience with large
databases.

• Replication features less evolved

• though PostgreSQL 9.0 ups the ante


PostgreSQL ♥ ODS

• Mostly inserts.

• Updates/Deletes controlled, not real-time.

• pl/perl (leverage DBI/DBD for remote
database connectivity).

• Monster queries.

• Extensible.


Choosing Linux

• Popular, liked, good community support.

• Chronic problems:

• kernel panics

• filesystems remounting read-only

• filesystems don’t support snapshots

• LVM is clunky on enterprise storage

• 20 outages in 4 months


Choosing Solaris 10

• Switched to Solaris 10

• No crashes, better system-level tools.

• prstat, iostat, vmstat, smf, fault-
management.

• ZFS

• snapshots (persistent), BLI backups.

• Excellent support for enterprise storage.

• DTrace.

• Free (too).


Oracle features we need

• Partitioning

• Statistics and Aggregations

• rank over partition, lead, lag, etc.

• Large selects (100GB)

• Autonomous transactions

• Replication from Oracle (to Oracle)


Partitioning

For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
count
------------
3247286017
(1 row)

• Next biggest tables: billions
• Allows us to cluster data over specific ranges
(by date in our case)
• Simple, cheap archiving and removal of data.
• Can put ranges used less often in different
tablespaces (slower, cheaper storage)


Partitioning PostgreSQL style

• PostgreSQL doesn’t support partition...

• It supports inheritance... (what’s this?)

• some crazy object-relation paradigm.

• We can use it to implement partitioning:

• One master table with no rows.

• Child tables that have our partition constraints.

• Rules on the master table for insert/update/delete.


Partitioning PostgreSQL realized

• Cheaply add new empty partitions

• Cheaply remove old partitions

• Migrate less-often-accessed partitions to slower
storage

• Different indexes strategies per partition

• PostgreSQL >8.1 supports constraint checking on
inherited tables.
• smarter planning

• smarter executing


RANK OVER PARTITION

• In Oracle:
select userid, email from (
! ! select u.userid, u.email,
! ! row_number() over
(partition by u.email order by userid desc) as position
! ! from (...)) where position = 1

• In PostgreSQL:
FOR v_row IN select u.userid, u.email from (...) order by email, userid desc
LOOP
! IF v_row.email != v_last_email THEN
! ! RETURN NEXT v_row;
! ! v_last_email := v_row.email;
! ! v_rownum := v_rownum + 1;
! END IF;
END LOOP;

With 8.4, we have windowing functions


Large SELECTs

• Application code does:
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;

• The width of these rows is about 2k

• 50 million row return set

• > 100 GB of data


The Large SELECT Problem

• libpq will buffer the entire result in memory.

• This affects language bindings (DBD::Pg).

• This is an utterly deficient default behavior.

• This can be avoided by using cursors

• Requires the app to be PostgreSQL specific.

• You open a cursor.

• Then FETCH the row count you desire.


Big SELECTs the Postgres way

The previous “big” query becomes:
DECLARE CURSOR bigdump FOR
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;

Then, in a loop:
FETCH FORWARD 10000 FROM bigdump;


Autonomous Transactions

• In Oracle we have over 2000 custom stored procedures.
• During these procedures, we like to:
• COMMIT incrementally
Useful for long transactions (update/delete) that
need not be atomic -- incremental COMMITs.

• start a new top-level txn that can COMMIT
Useful for logging progress in a stored procedure so
that you know how far you progessed and how long
each step took even if it rolls back.


PostgreSQL shortcoming

• PostgreSQL simply does not support
Autonomous transactions and to quote core
developers “that would be hard.”

• When in doubt, use brute force.

• Use pl/perl to use DBD::Pg to connect to
ourselves (a new backend) and execute a new
top-level transaction.


Replication

• Cross vendor database replication isn’t too difficult.
• Helps a lot when you can do it inside the database.
• Using dbi-link (based on pl/perl and DBI) we can.
• We can connect to any remote database.
• INSERT into local tables directly from remote
SELECT statements.
[snapshots]
• LOOP over remote SELECT statements and
process them row-by-row.
[replaying remote DML logs]


Replication (really)

• Through a combination of snapshotting and DML
replay we:

• replicate over into over 2000 tables in PostgreSQL
from Oracle

• snapshot replication of 200

• DML replay logs for 1800

• PostgreSQL to Oracle is a bit harder

• out-of-band export and imports


New Architecture

• Master: Sun v890 and Hitachi AMS + warm standby
running Oracle
(1TB)

• Logs: several customs
running MySQL instances
(2TB each)

• ODS BI: 2x Sun v40
running PostgreSQL 8.3
(6TB on Sun JBODs on ZFS each)

• ODS archive: 2x custom
running PostgreSQL 8.3
(14TB internal storage on ZFS each)


PostgreSQL is Lacking

• pg_dump is too intrusive.

• Poor system-level instrumentation.

• Poor methods to determine specific contention.

• It relies on the operating system’s filesystem cache.
(which make PostgreSQL inconsistent across it’s
supported OS base)


Enter Solaris

• Solaris is a UNIX from Sun Microsystems.

• Is it different than other UNIX/UNIX-like systems?

• Mostly it isn’t different (hence the term UNIX)

• It does have extremely strong ABI backward
compatibility.

• It’s stable and works well on large machines.

• Solaris 10 shakes things up a bit:
• DTrace

• ZFS

• Zones


Solaris / ZFS

• ZFS: Zettaback Filesystem.

• 264 snapshots, 248 files/directory, 264 bytes/filesystem,
278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system

• Extremely cheap differential backups.

• I have a 5 TB database, I need a backup!

• No rollback in your database? What is this? MySQL?

• No rollback in your filesystem?

• ZFS has snapshots, rollback, clone and promote.

• OMG! Life altering features.

• Caveat: ZFS is slower than alternatives, by about 10% with tuning.


Solaris / Zones

• Zones: Virtual Environments.

• Shared kernel.

• Can share filesystems.

• Segregated processes and privileges.

• No big deal for databases, right?

But Wait!


Solaris / ZFS + Zones = Magic Juju
https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh

• ZFS snapshot, clone, delegate to zone, boot and run.

• When done, halt zone, destroy clone.

• We get a point-in-time copy of our PostgreSQL database:

• read-write,

• low disk-space requirements,

• NO LOCKS! Welcome back pg_dump,
you don’t suck (as much) anymore.

• Fast snapshot to usable copy time:

• On our 20 GB database: 1 minute.

• On our 1.2 TB database: 2 minutes.


ZFS: how I saved my soul.

• Database crash. Bad. 1.2 TB of data... busted.
The reason Robert Treat looks a bit older than he
should.

• xlogs corrupted. catalog indexes corrupted.

• Fault? PostgreSQL bug? Bad memory? Who knows?

• Trial & error on a 1.2 TB data set is a cruel experience.

• In real-life, most recovery actions are destructive
actions.

• PostgreSQL is no different.

• Rollback to last checkpoint (ZFS), hack postgres code,
try, fail, repeat.


Let DTrace open your eyes

• DTrace: Dynamic Tracing

• Dynamically instrument “stuff” in the system:
• system calls (like strace/truss/ktrace).

• process/scheduler activity (on/off cpu, semaphores, conditions).

• see signals sent and received.

• trace kernel functions, networking.

• watch I/O down to the disk.

• user-space processes, each function... each machine instruction!

• Add probes into apps where it makes sense to you.


Can you see what I see?

• There is EXPLAIN... when that isn’t enough...

• There is EXPLAIN ANALYZE... when that isn’t enough.

• There is DTrace.

; dtrace -q -n ‘
postgresql*:::statement-start
{
self->query = copyinstr(arg0);
self->ok=1;
}
io:::start
/self->ok/
{
@[self->query,
args[0]->b_flags & B_READ ? "read" : "write",
args[1]->dev_statname] = sum(args[0]->b_bcount);
}’
dtrace: description 'postgres*:::statement-start' matched 14 probes
^C

select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000;
read sd1 16384
select division, sum(amount), avg(amount) from ods.billings where txn_timestamp
between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division;
read sd2 71647232


OmniTI Labs / pgtreats

• https://labs.omniti.com/labs/pgtreats

• Where we stick out PostgreSQL goodies...

• like pg_file_stress

FILENAME/DBOBJECT READS WRITES
# min avg max # min avg max
alldata1__idx_remove_domain_external 1 12 12 12 398 0 0 0
slowdata1__pg_rewrite 1 12 12 12 0 0 0 0
slowdata1__pg_class_oid_index 1 0 0 0 0 0 0 0
slowdata1__pg_attribute 2 0 0 0 0 0 0 0
alldata1__mv_users 0 0 0 0 4 0 0 0
slowdata1__pg_statistic 1 0 0 0 0 0 0 0
slowdata1__pg_index 1 0 0 0 0 0 0 0
slowdata1__pg_index_indexrelid_index 1 0 0 0 0 0 0 0
alldata1__remove_domain_external 0 0 0 0 502 0 0 0
alldata1__promo_15_tb_full_2 19 0 0 0 11 0 0 0
slowdata1__pg_class_relname_nsp_index 2 0 0 0 0 0 0 0
alldata1__promo_177intaoltest_tb 0 0 0 0 1053 0 0 0
slowdata1__pg_attribute_relid_attnum_index 2 0 0 0 0 0 0 0
alldata1__promo_15_tb_full_2_pk 2 0 0 0 0 0 0 0
alldata1__all_mailable_2 1403 0 0 423 0 0 0 0
alldata1__mv_users_pkey 0 0 0 0 4 0 0 0


Results

• Move ODS Oracle licenses to OLTP

• Run PostgreSQL on ODS

• Save $800k in license costs.

• Spend $100k in labor costs.

• Learn a lot.


Thanks!

• Thank you.

• http://omniti.com/does/postgresql

• We’re hiring, but only if you love:

• lots of data on lots of disks on lots of big boxes

• smart people

• hard problems

• more than one database technology (including PostgreSQL)

• responsibility


Big Bad PostgreSQL: BI on a Budget

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (14)

Similaire à Big Bad PostgreSQL: BI on a Budget

Similaire à Big Bad PostgreSQL: BI on a Budget (20)

Plus de Joshua L. Davis

Plus de Joshua L. Davis (16)

Dernier

Dernier (20)

Big Bad PostgreSQL: BI on a Budget