SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
Really Big Elephants

                          Josh Berkus
           MySQL User Conference 2011
I will cover:               ●   I won't cover:
   ●   advantages of            ●   hardware selection
       Postgres for DW          ●   EAV / blobs
   ●   configuration            ●   denormalization
   ●   tablespaces              ●   DW query tuning
   ●   ETL/ELT                  ●   external DW tools
   ●   windowing                ●   backups &
   ●   partitioning                 upgrades
   ●   materialized views
What is a
“data warehouse”?
synonyms etc.
●   Business Intelligence
    ●   also BI/DW
●   Analytics database
●   OnLine Analytical Processing
●   Data Mining
●   Decision Support
OLTP            vs       DW
●   many single-row      ●   few large batch
    writes                   imports
●   current data         ●   years of data
●   queries generated    ●   queries generated
    by user activity         by large reports
●   < 1s response        ●   queries can run for
    times                    hours
●   0.5 to 5x RAM        ●   5x to 2000x RAM
OLTP            vs       DW
●   100 to 1000 users    ●   1 to 10 users
●   constraints          ●   no constraints
Why use
 PostgreSQL for
data warehousing?
Complex Queries
      CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) +
SUM(changes.transferred_in-changes.transferred_out)) <> 0) THEN ROUND((CAST(SUM(changes.sold_and_closed +
changes.returned_and_closed) AS numeric) * 100) / CAST(SUM(starting.closed_on_hand) + SUM(changes.received) +
SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out) AS numeric), 5) ELSE 0 END AS "Percent_Sold",
      CASE WHEN (SUM(changes.sold_and_closed) <> 0) THEN ROUND(100*((SUM(changes.closed_markdown_units_sold)*1.0) /
SUM(changes.sold_and_closed)), 5) ELSE 0 END AS "Percent_of_Units_Sold_with_Markdown",
      CASE WHEN (SUM(changes.sold_and_closed * _sku.retail_price) <> 0) THEN
ROUND(100*(SUM(changes.closed_markdown_dollars_sold)*1.0) / SUM(changes.sold_and_closed * _sku.retail_price), 5) ELSE 0 END AS
      '0' AS "Percent_of_Total_Sales",
      CASE WHEN SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) IS NULL THEN 0 ELSE
SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) END AS "Net_Sales_at_Retail",
      '0' AS "Percent_of_Ending_Inventory_at_Retail", SUM(inventory.closed_on_hand * _sku.retail_price) AS
      "_store"."label" AS "Store",
      "_department"."label" AS "Department",
      "_vendor"."name" AS "Vendor_Name"
        JOIN inventory as starting
                ON inventory.warehouse_id = starting.warehouse_id
                        AND inventory.sku_id = starting.sku_id
                ( SELECT warehouse_id, sku_id,
                        sum(received) as received,
                        sum(transferred_in) as transferred_in,
                        sum(transferred_out) as transferred_out,
                        sum(adjustments) as adjustments,
                        sum(sold) as sold
                FROM movement
                WHERE movement.movement_date BETWEEN '2010-08-05' AND '2010-08-19'
                GROUP BY sku_id, warehouse_id ) as changes
                ON inventory.warehouse_id = changes.warehouse_id
                        AND inventory.sku_id = changes.sku_id
      JOIN _sku ON = inventory.sku_id
      JOIN _warehouse ON = inventory.warehouse_id
      JOIN _location_hierarchy AS _store ON = _warehouse.store_id
                AND _store.type = 'Store'
      JOIN _product ON = _sku.product_id
      JOIN _merchandise_hierarchy AS _department
Complex Queries
●   JOIN optimization
    ●   5 different JOIN types
    ●   approximate planning for 20+ table joins
●   subqueries in any clause
    ●   plus nested subqueries
●   windowing queries
●   recursive queries
Big Data Features
●   big tables      partitioning
●   big databases   tablespaces
●   big backups     PITR
●   big updates     binary replication
●   big queries     resource control
●   add data analysis functionality from
    external libraries inside the database
    ●   financial analysis
    ●   genetic sequencing
    ●   approximate queries
●   create your own:
    ●   data types            functions
    ●   aggregates            operators
“I'm running a partitioning scheme using 256 tables with a maximum
of 16 million rows (namely IPv4-addresses) and a current total of
about 2.5 billion rows, there are no deletes though, but lots of

“I use PostgreSQL basically as a data warehouse to store all the
genetic data that our lab generates … With this configuration I figure
I'll have ~3TB for my main data tables and 1TB for indexes. ”

 ●   lots of experience with large databases
 ●   blogs, tools, online help
Sweet Spot
              0   5   10   15   20   25   30



DW Database

              0   5   10   15   20   25   30
DW Databases
●   Vertica         ●   Netezza
●   Greenplum       ●   HadoopDB
●   Aster Data      ●   LucidDB
●   Infobright      ●   MonetDB
●   Teradata        ●   SciDB
●   Hadoop/HBase    ●   Paraccel
DW Databases
●   Vertica         ●   Netezza
●   Greenplum       ●   HadoopDB
●   Aster Data      ●   LucidDB
●   Infobright      ●   MonetDB
●   Teradata        ●   SciDB
●   Hadoop/HBase    ●   Paraccel
How do I configure
 PostgreSQL for
data warehousing?
General Setup
●   Latest version of PostgreSQL
●   System with lots of drives
    ●   6 to 48 drives
        –   or 2 to 12 SSDs
    ●   High-throughput RAID
●   Write ahead log (WAL) on separate disk(s)
    ●   10 to 50 GB space
separate the
DW workload
onto its own
few connections
max_connections = 10 to 40

raise those memory limits!
shared_buffers = 1/8 to ¼ of RAM
work_mem = 128MB to 1GB
maintenance_work_mem = 512MB to 1GB
temp_buffers = 128MB to 1GB
effective_cache_size = ¾ of RAM
wal_buffers = 16MB
No autovacuum
autovacuum = off
vacuum_cost_delay = off
●   do your VACUUMs and ANALYZEs as part
    of the batch load process
    ●   usually several of them
●   also maintain tables by partitioning
What are
logical data extents
●   lets you put some of your data on specific
    devices / disks

LOCATION '/mnt/san2/history_log';
tablespace reasons
●   parallelize access
    ●   your largest “fact table” on one tablespace
    ●   its indexes on another
        –   not as useful if you have a good SAN
●   temp tablespace for temp tables
●   move key join tables to SSD
●   migrate to new storage one table at a time
What is ETL
and how do I do it?
Extract, Transform, Load
●   how you turn external raw data into
    normalized database data
    ●   Apache logs → web analytics DB
    ●   CSV POS files → financial reporting DB
    ●   OLTP server → 10-year data warehouse
●   also called ELT when the transformation is
    done inside the database
    ●   PostgreSQL is particularly good for ELT
●   batch INSERTs into 100's or 1000's per
    ●   row-at-a-time is very slow
●   create and load import tables in one
●   add indexes and constraints after load
●   insert several streams in parallel
    ●   but not more than CPU cores
●   Powerful, efficient delimited file loader
    ●   almost bug-free - we use it for backup
    ●   3-5X faster than inserts
    ●   works with most delimited files
●   Not fault-tolerant
    ●   also have to know structure in advance
    ●   try pg_loader for better COPY
COPY weblog_new FROM
20110605.csv' with csv;
COPY traffic_snapshot FROM
'traffic_20110605192241' delimiter
'|' nulls as 'N';
copy weblog_summary_june TO
'Desktop/weblog-june2011.csv' with
csv header;
L: in 9.1: FDW
( hit_time TIMESTAMP,
  page TEXT )
SERVER file_fdw
OPTIONS (format 'csv', delimiter
';', filename '/var/log/hits.log');
L: in 9.1: FDW
CREATE TABLE hits_2011041617 AS
SELECT page, count(*)
FROM raw_hits
WHERE hit_time >
  '2011-04-16 16:00:00' AND
  hit_time <= '2011-04-16 17:00:00'
GROUP BY page;
T: temporary tables
sales_records_june_rollup AS
SELECT seller_id, location,
  sell_date, sum(sale_amount),
FROM raw_sales
WHERE sell_date BETWEEN '2011-06-01'
  AND '2011-06-30 23:59:59.999'
GROUP BY seller_id, location,
in 9.1: unlogged tables
●   like myISAM without the risk

AS SELECT hit_time, page
FROM raw_hits, hit_watermark
WHERE hit_time > last_watermark
  AND is_valid(page);
T: stored procedures
●   multiple languages
    ●   SQL PL/pgSQL
    ●   PL/Perl PL/Python PL/PHP
    ●   PL/R PL/Java
    ●   allows you to use exernal data processing
        libraries in the database
●   custom aggregates, operators, more
CREATE OR REPLACE FUNCTION normalize_query ( queryin text )
# this function "normalizes" queries by stripping out constants.
# some regexes by Guillaume Smet under The PostgreSQL License.
local $_ = $_[0];
#first cleanup the whitespace
        s/s+/ /g;
        s/,(S)/, $1/g;
#remove any double quotes and quoted text
#remove TRUE and FALSE
#remove any bare numbers or hex numbers
#normalize any IN statements
#return the normalized query
return $_;
sql <- paste("SELECT id as x,hit as y FROM mytemp LIMIT
str <- c(pg.spi.exec(sql));
mymain <- "Graph 2";
mysub <- paste("The worst offender is: ",str[1,3]," with
",str[1,2]," hits",sep="");
myxlab <- "Top 30 IP Addresses";
myylab <- "Number of Hits";
mtext("Probes by intrusive IP Addresses",side=3);;
ELT Tips
●   bulk insert into a new table instead of
    updating/deleting an existing table
●   update all columns in one operation
    instead of one at a time
●   use views and custom functions to simplify
    your queries
●   inserting into your long-term tables should
    be the very last step – no updates after!
What's a
windowing query?
regular aggregate
windowing function
TABLE events (
  event_id INT,
  event_type TEXT,
  duration INTERVAL,
  event_desc TEXT
SELECT MAX(concurrent)
  SELECT SUM(tally)
    OVER (ORDER BY start)
    AS concurrent
   FROM (
    SELECT start, 1::INT as tally
      FROM events
      SELECT (start + duration), -1
      FROM events )
   AS event_vert) AS ec;
UPDATE partition_name SET drop_month = dropit
SELECT round_id,
       CASE WHEN ( ( row_number() over
       (partition by team_id order by team_id, total_points) )
              <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit
       SELECT team.team_id, round.round_id, month_points as total_points,
              row_number() OVER (
                     partition by team.team_id, kal.positions
                     order by team.team_id, kal.positions,
                     month_points desc ) as ordinal,
                     at_least, numdrop as drop_lowest
       FROM partition_name as rdrop
              JOIN round USING (round_id)
              JOIN team USING (team_id)
              JOIN pick ON round.round_id = pick.round_id
                     and pick.pick_period @> this_period
              LEFT OUTER JOIN keep_at_least kal
                     ON rdrop.pool_id = kal.pool_id
                     and pick.position_id = any ( kal.positions )
                     WHERE rdrop.pool_id = this_pool
                            AND team.team_id = this_team ) as ranking
       WHERE ordinal > at_least or at_least is null
       ) as droplow
 WHERE droplow.round_id = partition_name .round_id
     AND partition_name .pool_id = this_pool AND dropit = 0;
SELECT round_id,
    CASE WHEN ( ( row_number()
    (partition by team_id
       order by team_id, total_points) )
          <= ( drop_lowest ) )
     THEN 0 ELSE 1 END as dropit
    SELECT team.team_id, round.round_id,
         month_points as total_points,
          row_number() OVER (
              partition by team.team_id,
              order by team.team_id,
                 kal.positions, month_points
                   desc ) as ordinal
stream processing SQL
●   replace multiple queries with a single
    ●   avoid scanning large tables multiple times
●   replace pages of application code
    ●   and MB of data transmission
●   SQL alternative to map/reduce
    ●   (for some data mining tasks)
How do I partition
  my tables?
Postgres partitioning
●   based on table inheritance and constraint
    ●   partitions are also full tables
    ●   explicit constraints define the range of the
    ●   triggers or RULEs handle insert/update
  seller_id INT NOT NULL,
  item_id INT NOT NULL,
  sale_amount NUMERIC NOT NULL,
  narrative TEXT );
CREATE TABLE sales_2011_06 (
  CONSTRAINT partition_date_range
  CHECK (sell_date >= '2011-06-01'
    AND sell_date < '2011-07-01' )
  ) INHERITS ( sales );
CREATE FUNCTION sales_insert ()
RETURNS trigger
LANGUAGE plpgsql AS $f$
   CASE WHEN sell_date < '2011-06-01'
      THEN INSERT INTO sales_2011_05 VALUES (NEW.*)
   WHEN sell_date < '2011-07-01'
      THEN INSERT INTO sales_2011_06 VALUES (NEW.*)
   WHEN sell_date >= '2011-07-01'
      THEN INSERT INTO sales_2011_07 VALUES (NEW.*)
      INSERT INTO sales_overflow VALUES (NEW.*)

Postgres partitioning
●   Good for:                ●   Bad for:
    ●   “rolling off” data       ●   administration
    ●   DB maintenance           ●   queries which do
    ●   queries which use            not use the
        the partition key            partition key
    ●   under 300
                                 ●   JOINs
        partitions               ●   over 300 partitions
    ●   insert performance       ●   update
you need a data
           expiration policy
●   you can't plan your DW otherwise
    ●   sets your storage requirements
    ●   lets you project how queries will run when
        database is “full”
●   will take a lot of meetings
    ●   people don't like talking about deleting data
you need a data
         expiration policy
●   raw import data              1 month
●   detail-level transactions    3 years
●   detail-level web logs        1 year
●   rollups                     10 years
What's a
materialized view?
query results as table
●   calculate once, read many time
    ●   complex/expensive queries
    ●   frequently referenced
●   not necessarily a whole query
    ●   often part of a query
●   manually maintained in PostgreSQL
    ●   automagic support not complete yet
SELECT page,
  COUNT(*) as total_hits
FROM hit_counter
WHERE date_trunc('day', hit_date)
  BETWEEN ( now()
    AND now() - INTERVAL '7 days' )
ORDER BY total_hits DESC LIMIT 10;
CREATE TABLE page_hits (
  page TEXT,
  hit_day DATE,
  total_hits INT,
  CONSTRAINT page_hits_pk
  PRIMARY KEY(hit_day, page)
each day:

INSERT INTO page_hits
SELECT page,
  date_trunc('day', hit_date)
    as hit_day,
  COUNT(*) as total_hits
FROM hit_counter
WHERE date_trunc('day', hit_date)
  = date_trunc('day',
      now() - INTERVAL '1 day')
ORDER BY total_hits DESC;
SELECT page, total_hits
FROM page_hits
  now() AND
  now() - INTERVAL '7 days';
maintaining matviews
BEST:         update matviews
              at batch load time
GOOD:         update matview according
              to clock/calendar
BAD for DW:   update matviews
              using a trigger
matview tips
●   matviews should be small
    ●   1/10 to ¼ of RAM
●   each matview should support several
    ●   or one really really important one
●   truncate + insert, don't update
●   index matviews like crazy
●   Josh Berkus:
    ●   blog:
●   PostgreSQL:
    ●   pgexperts:
●   Upcoming Events
    ●   pgCon: Ottawa: May 17-20
    ●   OpenSourceBridge: Portland: June
         This talk is copyright 2010 Josh Berkus and is licensed under the creative commons attribution
         license. Special thanks for materials to: Elein Mustain (PL/R), Hitoshi Harada and David Fetter
         (windowing functions), Andrew Dunstan (file_FDW)

Contenu connexe


Loadays MySQL
Loadays MySQLLoadays MySQL
Loadays MySQLlefredbe
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) Ontico
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.Alexey Lesovsky
Tool it Up! - Session #3 - MySQL
Tool it Up! - Session #3 - MySQLTool it Up! - Session #3 - MySQL
Tool it Up! - Session #3 - MySQLtoolitup
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014Ryusuke Kajiyama
Autovacuum, explained for engineers, new improved version 2015 Vienna
Autovacuum, explained for engineers, new improved version 2015 ViennaAutovacuum, explained for engineers, new improved version 2015 Vienna
Autovacuum, explained for engineers, new improved version 2015 ViennaPostgreSQL-Consulting
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Denish Patel
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Denish Patel
PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013Andrew Dunstan
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...DataStax
Pgbr 2013 postgres on aws
Pgbr 2013   postgres on awsPgbr 2013   postgres on aws
Pgbr 2013 postgres on awsEmanuel Calvo
15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performanceguest9912e5
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQLInMobi Technology
Как PostgreSQL работает с диском
Как PostgreSQL работает с дискомКак PostgreSQL работает с диском
Как PostgreSQL работает с дискомPostgreSQL-Consulting
Development to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersDevelopment to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersSeveralnines
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XIDIntroduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XIDPGConf APAC
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Ashnikbiz
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...Equnix Business Solutions

Tendances (19)

Loadays MySQL
Loadays MySQLLoadays MySQL
Loadays MySQL
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas) PostgreSQL Write-Ahead Log (Heikki Linnakangas)
PostgreSQL Write-Ahead Log (Heikki Linnakangas)
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.
Tool it Up! - Session #3 - MySQL
Tool it Up! - Session #3 - MySQLTool it Up! - Session #3 - MySQL
Tool it Up! - Session #3 - MySQL
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014
Autovacuum, explained for engineers, new improved version 2015 Vienna
Autovacuum, explained for engineers, new improved version 2015 ViennaAutovacuum, explained for engineers, new improved version 2015 Vienna
Autovacuum, explained for engineers, new improved version 2015 Vienna
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Pgbr 2013 postgres on aws
Pgbr 2013   postgres on awsPgbr 2013   postgres on aws
Pgbr 2013 postgres on aws
15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Как PostgreSQL работает с диском
Как PostgreSQL работает с дискомКак PostgreSQL работает с диском
Как PostgreSQL работает с диском
Development to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB ClustersDevelopment to Production with Sharded MongoDB Clusters
Development to Production with Sharded MongoDB Clusters
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XIDIntroduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XID
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...

En vedette

Managing a 14 TB reporting datawarehouse with postgresql
Managing a 14 TB reporting datawarehouse with postgresql Managing a 14 TB reporting datawarehouse with postgresql
Managing a 14 TB reporting datawarehouse with postgresql Soumya Ranjan Subudhi
Pattern driven Enterprise Architecture
Pattern driven Enterprise ArchitecturePattern driven Enterprise Architecture
Pattern driven Enterprise ArchitectureWSO2
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best PracticesEduardo Castro
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALEPostgreSQL Experts, Inc.
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1Federico Campoli
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
One Coin One Brick project
One Coin One Brick projectOne Coin One Brick project
One Coin One Brick projectHuy Nguyen
Open source data_warehousing_overview
Open source data_warehousing_overviewOpen source data_warehousing_overview
Open source data_warehousing_overviewAlex Meadows
cstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQLcstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQLCitus Data
Monitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaMonitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaJan Wieck
Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Huy Nguyen
Table partitioning in PostgreSQL + Rails
Table partitioning in PostgreSQL + RailsTable partitioning in PostgreSQL + Rails
Table partitioning in PostgreSQL + RailsAgnieszka Figiel
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technologyphanleson
Fun facts about Vietnam
Fun facts about VietnamFun facts about Vietnam
Fun facts about VietnamHuy Nguyen

En vedette (20)

Managing a 14 TB reporting datawarehouse with postgresql
Managing a 14 TB reporting datawarehouse with postgresql Managing a 14 TB reporting datawarehouse with postgresql
Managing a 14 TB reporting datawarehouse with postgresql
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
Pattern driven Enterprise Architecture
Pattern driven Enterprise ArchitecturePattern driven Enterprise Architecture
Pattern driven Enterprise Architecture
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
One Coin One Brick project
One Coin One Brick projectOne Coin One Brick project
One Coin One Brick project
Open source data_warehousing_overview
Open source data_warehousing_overviewOpen source data_warehousing_overview
Open source data_warehousing_overview
cstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQLcstore_fdw: Columnar Storage for PostgreSQL
cstore_fdw: Columnar Storage for PostgreSQL
Monitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaMonitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafana
Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?
Table partitioning in PostgreSQL + Rails
Table partitioning in PostgreSQL + RailsTable partitioning in PostgreSQL + Rails
Table partitioning in PostgreSQL + Rails
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technology
Fun facts about Vietnam
Fun facts about VietnamFun facts about Vietnam
Fun facts about Vietnam
Database Health Check
Database Health CheckDatabase Health Check
Database Health Check

Similaire à Really Big Elephants: PostgreSQL DW

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Knoldus Inc.
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfMiguel Angel Fajardo
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)Ontico
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB plc
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
PostgreSQL talk, Database 2011 conference
PostgreSQL talk, Database 2011 conferencePostgreSQL talk, Database 2011 conference
PostgreSQL talk, Database 2011 conferenceReuven Lerner
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceCalpont
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowSQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowBob Ward
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden

Similaire à Really Big Elephants: PostgreSQL DW (20)

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
PostgreSQL talk, Database 2011 conference
PostgreSQL talk, Database 2011 conferencePostgreSQL talk, Database 2011 conference
PostgreSQL talk, Database 2011 conference
Making MySQL Great For Business Intelligence
Making MySQL Great For Business IntelligenceMaking MySQL Great For Business Intelligence
Making MySQL Great For Business Intelligence
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowSQL Server In-Memory OLTP: What Every SQL Professional Should Know
SQL Server In-Memory OLTP: What Every SQL Professional Should Know
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale

Plus de PostgreSQL Experts, Inc.

Elephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsElephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsPostgreSQL Experts, Inc.

Plus de PostgreSQL Experts, Inc. (20)

Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
Fail over fail_back
Fail over fail_backFail over fail_back
Fail over fail_back
HowTo DR
HowTo DRHowTo DR
HowTo DR
Give A Great Tech Talk 2013
Give A Great Tech Talk 2013Give A Great Tech Talk 2013
Give A Great Tech Talk 2013
Pg py-and-squid-pypgday
Pg py-and-squid-pypgdayPg py-and-squid-pypgday
Pg py-and-squid-pypgday
92 grand prix_2013
92 grand prix_201392 grand prix_2013
92 grand prix_2013
Five steps perform_2013
Five steps perform_2013Five steps perform_2013
Five steps perform_2013
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
PWNage: Producing a newsletter with Perl
PWNage: Producing a newsletter with PerlPWNage: Producing a newsletter with Perl
PWNage: Producing a newsletter with Perl
10 Ways to Destroy Your Community
10 Ways to Destroy Your Community10 Ways to Destroy Your Community
10 Ways to Destroy Your Community
Open Source Press Relations
Open Source Press RelationsOpen Source Press Relations
Open Source Press Relations
5 (more) Ways To Destroy Your Community
5 (more) Ways To Destroy Your Community5 (more) Ways To Destroy Your Community
5 (more) Ways To Destroy Your Community
Preventing Community (from Linux Collab)
Preventing Community (from Linux Collab)Preventing Community (from Linux Collab)
Preventing Community (from Linux Collab)
Development of 8.3 In India
Development of 8.3 In IndiaDevelopment of 8.3 In India
Development of 8.3 In India
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
50 Ways To Love Your Project
50 Ways To Love Your Project50 Ways To Love Your Project
50 Ways To Love Your Project
8.4 Upcoming Features
8.4 Upcoming Features 8.4 Upcoming Features
8.4 Upcoming Features
Elephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsElephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and Variants
Writeable CTEs: The Next Big Thing
Writeable CTEs: The Next Big ThingWriteable CTEs: The Next Big Thing
Writeable CTEs: The Next Big Thing


The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf

Really Big Elephants: PostgreSQL DW

  • 1. Really Big Elephants Data Warehousing with PostgreSQL Josh Berkus MySQL User Conference 2011
  • 2. Included/Excluded I will cover: ● I won't cover: ● advantages of ● hardware selection Postgres for DW ● EAV / blobs ● configuration ● denormalization ● tablespaces ● DW query tuning ● ETL/ELT ● external DW tools ● windowing ● backups & ● partitioning upgrades ● materialized views
  • 3. What is a “data warehouse”?
  • 4. synonyms etc. ● Business Intelligence ● also BI/DW ● Analytics database ● OnLine Analytical Processing (OLAP) ● Data Mining ● Decision Support
  • 5. OLTP vs DW ● many single-row ● few large batch writes imports ● current data ● years of data ● queries generated ● queries generated by user activity by large reports ● < 1s response ● queries can run for times hours ● 0.5 to 5x RAM ● 5x to 2000x RAM
  • 6. OLTP vs DW ● 100 to 1000 users ● 1 to 10 users ● constraints ● no constraints
  • 7. Why use PostgreSQL for data warehousing?
  • 8. Complex Queries SELECT CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out)) <> 0) THEN ROUND((CAST(SUM(changes.sold_and_closed + changes.returned_and_closed) AS numeric) * 100) / CAST(SUM(starting.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out) AS numeric), 5) ELSE 0 END AS "Percent_Sold", CASE WHEN (SUM(changes.sold_and_closed) <> 0) THEN ROUND(100*((SUM(changes.closed_markdown_units_sold)*1.0) / SUM(changes.sold_and_closed)), 5) ELSE 0 END AS "Percent_of_Units_Sold_with_Markdown", CASE WHEN (SUM(changes.sold_and_closed * _sku.retail_price) <> 0) THEN ROUND(100*(SUM(changes.closed_markdown_dollars_sold)*1.0) / SUM(changes.sold_and_closed * _sku.retail_price), 5) ELSE 0 END AS "Markdown_Percent", '0' AS "Percent_of_Total_Sales", CASE WHEN SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) IS NULL THEN 0 ELSE SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) END AS "Net_Sales_at_Retail", '0' AS "Percent_of_Ending_Inventory_at_Retail", SUM(inventory.closed_on_hand * _sku.retail_price) AS "Ending_Inventory_at_Retail", "_store"."label" AS "Store", "_department"."label" AS "Department", "_vendor"."name" AS "Vendor_Name" FROM inventory JOIN inventory as starting ON inventory.warehouse_id = starting.warehouse_id AND inventory.sku_id = starting.sku_id LEFT OUTER JOIN ( SELECT warehouse_id, sku_id, sum(received) as received, sum(transferred_in) as transferred_in, sum(transferred_out) as transferred_out, sum(adjustments) as adjustments, sum(sold) as sold FROM movement WHERE movement.movement_date BETWEEN '2010-08-05' AND '2010-08-19' GROUP BY sku_id, warehouse_id ) as changes ON inventory.warehouse_id = changes.warehouse_id AND inventory.sku_id = changes.sku_id JOIN _sku ON = inventory.sku_id JOIN _warehouse ON = inventory.warehouse_id JOIN _location_hierarchy AS _store ON = _warehouse.store_id AND _store.type = 'Store' JOIN _product ON = _sku.product_id JOIN _merchandise_hierarchy AS _department
  • 9. Complex Queries ● JOIN optimization ● 5 different JOIN types ● approximate planning for 20+ table joins ● subqueries in any clause ● plus nested subqueries ● windowing queries ● recursive queries
  • 10. Big Data Features ● big tables partitioning ● big databases tablespaces ● big backups PITR ● big updates binary replication ● big queries resource control
  • 11. Extensibility ● add data analysis functionality from external libraries inside the database ● financial analysis ● genetic sequencing ● approximate queries ● create your own: ● data types functions ● aggregates operators
  • 12. Community “I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates.” “I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates … With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes. ” ● lots of experience with large databases ● blogs, tools, online help
  • 13. Sweet Spot 0 5 10 15 20 25 30 MySQL PostgreSQL DW Database 0 5 10 15 20 25 30
  • 14. DW Databases ● Vertica ● Netezza ● Greenplum ● HadoopDB ● Aster Data ● LucidDB ● Infobright ● MonetDB ● Teradata ● SciDB ● Hadoop/HBase ● Paraccel
  • 15. DW Databases ● Vertica ● Netezza ● Greenplum ● HadoopDB ● Aster Data ● LucidDB ● Infobright ● MonetDB ● Teradata ● SciDB ● Hadoop/HBase ● Paraccel
  • 16. How do I configure PostgreSQL for data warehousing?
  • 17. General Setup ● Latest version of PostgreSQL ● System with lots of drives ● 6 to 48 drives – or 2 to 12 SSDs ● High-throughput RAID ● Write ahead log (WAL) on separate disk(s) ● 10 to 50 GB space
  • 19. Settings few connections max_connections = 10 to 40 raise those memory limits! shared_buffers = 1/8 to ¼ of RAM work_mem = 128MB to 1GB maintenance_work_mem = 512MB to 1GB temp_buffers = 128MB to 1GB effective_cache_size = ¾ of RAM wal_buffers = 16MB
  • 20. No autovacuum autovacuum = off vacuum_cost_delay = off ● do your VACUUMs and ANALYZEs as part of the batch load process ● usually several of them ● also maintain tables by partitioning
  • 22. logical data extents ● lets you put some of your data on specific devices / disks CREATE TABLESPACE 'history_log' LOCATION '/mnt/san2/history_log'; ALTER TABLE history_log TABLESPACE history_log;
  • 23. tablespace reasons ● parallelize access ● your largest “fact table” on one tablespace ● its indexes on another – not as useful if you have a good SAN ● temp tablespace for temp tables ● move key join tables to SSD ● migrate to new storage one table at a time
  • 24. What is ETL and how do I do it?
  • 25. Extract, Transform, Load ● how you turn external raw data into normalized database data ● Apache logs → web analytics DB ● CSV POS files → financial reporting DB ● OLTP server → 10-year data warehouse ● also called ELT when the transformation is done inside the database ● PostgreSQL is particularly good for ELT
  • 26. L: INSERT ● batch INSERTs into 100's or 1000's per transaction ● row-at-a-time is very slow ● create and load import tables in one transaction ● add indexes and constraints after load ● insert several streams in parallel ● but not more than CPU cores
  • 27. L: COPY ● Powerful, efficient delimited file loader ● almost bug-free - we use it for backup ● 3-5X faster than inserts ● works with most delimited files ● Not fault-tolerant ● also have to know structure in advance ● try pg_loader for better COPY
  • 28. L: COPY COPY weblog_new FROM '/mnt/transfers/weblogs/weblog- 20110605.csv' with csv; COPY traffic_snapshot FROM 'traffic_20110605192241' delimiter '|' nulls as 'N'; copy weblog_summary_june TO 'Desktop/weblog-june2011.csv' with csv header;
  • 29. L: in 9.1: FDW CREATE FOREIGN TABLE raw_hits ( hit_time TIMESTAMP, page TEXT ) SERVER file_fdw OPTIONS (format 'csv', delimiter ';', filename '/var/log/hits.log');
  • 30. L: in 9.1: FDW CREATE TABLE hits_2011041617 AS SELECT page, count(*) FROM raw_hits WHERE hit_time > '2011-04-16 16:00:00' AND hit_time <= '2011-04-16 17:00:00' GROUP BY page;
  • 31. T: temporary tables CREATE TEMPORARY TABLE ON COMMIT DROP sales_records_june_rollup AS SELECT seller_id, location, sell_date, sum(sale_amount), array_agg(item_id) FROM raw_sales WHERE sell_date BETWEEN '2011-06-01' AND '2011-06-30 23:59:59.999' GROUP BY seller_id, location, sell_date;
  • 32. in 9.1: unlogged tables ● like myISAM without the risk CREATE UNLOGGED TABLE cleaned_log_import AS SELECT hit_time, page FROM raw_hits, hit_watermark WHERE hit_time > last_watermark AND is_valid(page);
  • 33. T: stored procedures ● multiple languages ● SQL PL/pgSQL ● PL/Perl PL/Python PL/PHP ● PL/R PL/Java ● allows you to use exernal data processing libraries in the database ● custom aggregates, operators, more
  • 34. CREATE OR REPLACE FUNCTION normalize_query ( queryin text ) RETURNS TEXT LANGUAGE PLPERL STABLE STRICT AS $f$ # this function "normalizes" queries by stripping out constants. # some regexes by Guillaume Smet under The PostgreSQL License. local $_ = $_[0]; #first cleanup the whitespace s/s+/ /g; s/s,/,/g; s/,(S)/, $1/g; s/^s//g; s/s$//g; #remove any double quotes and quoted text s/'//g; s/'[^']*'/''/g; s/''('')+/''/g; #remove TRUE and FALSE s/(W)TRUE(W)/$1BOOL$2/gi; s/(W)FALSE(W)/$1BOOL$2/gi; #remove any bare numbers or hex numbers s/([^a-zA-Z_$-])-?([0-9]+)/${1}0/g; s/([^a-z_$-])0x[0-9a-f]{1,10}/${1}0x/ig; #normalize any IN statements s/(INs*)(['0x,s]*)/${1}(...)/ig; #return the normalized query return $_; $f$;
  • 35. CREATE OR REPLACE FUNCTION f_graph2() RETURNS text AS ' sql <- paste("SELECT id as x,hit as y FROM mytemp LIMIT 30",sep=""); str <- c(pg.spi.exec(sql)); mymain <- "Graph 2"; mysub <- paste("The worst offender is: ",str[1,3]," with ",str[1,2]," hits",sep=""); myxlab <- "Top 30 IP Addresses"; myylab <- "Number of Hits"; pdf(''/tmp/graph2.pdf''); plot(str,type="b",main=mymain,sub=mysub,xlab=myxlab,ylab =myylab,lwd=3); mtext("Probes by intrusive IP Addresses",side=3);; print(''DONE''); ' LANGUAGE plr;
  • 36.
  • 37. ELT Tips ● bulk insert into a new table instead of updating/deleting an existing table ● update all columns in one operation instead of one at a time ● use views and custom functions to simplify your queries ● inserting into your long-term tables should be the very last step – no updates after!
  • 41. TABLE events ( event_id INT, event_type TEXT, start TIMESTAMPTZ, duration INTERVAL, event_desc TEXT );
  • 42. SELECT MAX(concurrent) FROM ( SELECT SUM(tally) OVER (ORDER BY start) AS concurrent FROM ( SELECT start, 1::INT as tally FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;
  • 43. UPDATE partition_name SET drop_month = dropit FROM ( SELECT round_id, CASE WHEN ( ( row_number() over (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal, at_least, numdrop as drop_lowest FROM partition_name as rdrop JOIN round USING (round_id) JOIN team USING (team_id) JOIN pick ON round.round_id = pick.round_id and pick.pick_period @> this_period LEFT OUTER JOIN keep_at_least kal ON rdrop.pool_id = kal.pool_id and pick.position_id = any ( kal.positions ) WHERE rdrop.pool_id = this_pool AND team.team_id = this_team ) as ranking WHERE ordinal > at_least or at_least is null ) as droplow WHERE droplow.round_id = partition_name .round_id AND partition_name .pool_id = this_pool AND dropit = 0;
  • 44. SELECT round_id, CASE WHEN ( ( row_number() OVER (partition by team_id order by team_id, total_points) ) <= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM ( SELECT team.team_id, round.round_id, month_points as total_points, row_number() OVER ( partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal
  • 45. stream processing SQL ● replace multiple queries with a single query ● avoid scanning large tables multiple times ● replace pages of application code ● and MB of data transmission ● SQL alternative to map/reduce ● (for some data mining tasks)
  • 46. How do I partition my tables?
  • 47. Postgres partitioning ● based on table inheritance and constraint exclusion ● partitions are also full tables ● explicit constraints define the range of the partion ● triggers or RULEs handle insert/update
  • 48. CREATE TABLE sales ( sell_date TIMESTAMPTZ NOT NULL, seller_id INT NOT NULL, item_id INT NOT NULL, sale_amount NUMERIC NOT NULL, narrative TEXT );
  • 49. CREATE TABLE sales_2011_06 ( CONSTRAINT partition_date_range CHECK (sell_date >= '2011-06-01' AND sell_date < '2011-07-01' ) ) INHERITS ( sales );
  • 50. CREATE FUNCTION sales_insert () RETURNS trigger LANGUAGE plpgsql AS $f$ BEGIN CASE WHEN sell_date < '2011-06-01' THEN INSERT INTO sales_2011_05 VALUES (NEW.*) WHEN sell_date < '2011-07-01' THEN INSERT INTO sales_2011_06 VALUES (NEW.*) WHEN sell_date >= '2011-07-01' THEN INSERT INTO sales_2011_07 VALUES (NEW.*) ELSE INSERT INTO sales_overflow VALUES (NEW.*) END; RETURN NULL; END;$f$; CREATE TRIGGER sales_insert BEFORE INSERT ON sales FOR EACH ROW EXECUTE PROCEDURE sales_insert();
  • 51. Postgres partitioning ● Good for: ● Bad for: ● “rolling off” data ● administration ● DB maintenance ● queries which do ● queries which use not use the the partition key partition key ● under 300 ● JOINs partitions ● over 300 partitions ● insert performance ● update performance
  • 52. you need a data expiration policy ● you can't plan your DW otherwise ● sets your storage requirements ● lets you project how queries will run when database is “full” ● will take a lot of meetings ● people don't like talking about deleting data
  • 53. you need a data expiration policy ● raw import data 1 month ● detail-level transactions 3 years ● detail-level web logs 1 year ● rollups 10 years
  • 55. query results as table ● calculate once, read many time ● complex/expensive queries ● frequently referenced ● not necessarily a whole query ● often part of a query ● manually maintained in PostgreSQL ● automagic support not complete yet
  • 56. SELECT page, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) BETWEEN ( now() AND now() - INTERVAL '7 days' ) ORDER BY total_hits DESC LIMIT 10;
  • 57. CREATE TABLE page_hits ( page TEXT, hit_day DATE, total_hits INT, CONSTRAINT page_hits_pk PRIMARY KEY(hit_day, page) );
  • 58. each day: INSERT INTO page_hits SELECT page, date_trunc('day', hit_date) as hit_day, COUNT(*) as total_hits FROM hit_counter WHERE date_trunc('day', hit_date) = date_trunc('day', now() - INTERVAL '1 day') ORDER BY total_hits DESC;
  • 59. SELECT page, total_hits FROM page_hits WHERE hit_date BETWEEN now() AND now() - INTERVAL '7 days';
  • 60. maintaining matviews BEST: update matviews at batch load time GOOD: update matview according to clock/calendar BAD for DW: update matviews using a trigger
  • 61. matview tips ● matviews should be small ● 1/10 to ¼ of RAM ● each matview should support several queries ● or one really really important one ● truncate + insert, don't update ● index matviews like crazy
  • 62. Contact ● Josh Berkus: ● blog: ● PostgreSQL: ● pgexperts: ● Upcoming Events ● pgCon: Ottawa: May 17-20 ● OpenSourceBridge: Portland: June This talk is copyright 2010 Josh Berkus and is licensed under the creative commons attribution license. Special thanks for materials to: Elein Mustain (PL/R), Hitoshi Harada and David Fetter (windowing functions), Andrew Dunstan (file_FDW)