SlideShare a Scribd company logo
1 of 70
Download to read offline
Data Warehousing 101
Everything
you never wanted
to know about
big databases
but were forced
to find out anyway
Josh Berkus
Open Source Bridge 2011
contents
covering
● concepts of DW
● some DW
techniques
● databases
not covering
● hardware
● analytics/reporting
tools
BIG
DATA
1970
What is a
“data warehouse”?
Big Data?
OLTP vs DW
● many single-row
writes
● current data
● queries generated
by user activity
● < 1s response
times
● 1000's of users
● few large batch
imports
● years of data
● queries generated
by large reports
● queries can run for
minutes/hours
● 10's of users
OLTP vs DW
big data for
many
concurrent
requests to
small amounts
of data each
big data
for low
concurrency
requests to very
large amounts
of data each
synonyms
&
subclasses
archiving
archiving
WORN data: “write once, read never”
● grows indefinitely
● usually a result of regulatory compliance
● main concern: storage efficiency
data mining
data mining
the database where you don't know what's in
there, but you want to find out
● lots of data (TB to PB)
● mostly “semi-structured”
● data produced as a side effect of other
business processes
● needs CPU-intensive processing
BI: Business Intelligence
DSS: Decision Support
OLAP: Online Analytical
Processing
Analytics
BI/DSS/OLAP/Analytics
BI/DSS/OLAP/Analytics
databases which support visualization of
large amounts of data
● data is fairly well understood
● most data can be reduced to categories,
geography, and taxonomy
● primarily about indexing
What is a
“dimension”?
dimensions vs. facts
Fact
Table
customers
/ accounts
category
subcategory
sub-subcategory
dimension examples
● location/region/country/quadrant
● product categorization
● URL
● transaction type
● account heirarchy
● IP address
● OS/version/build
dimension synonyms
● facet
● taxonomy
● secondary index
● view
What is ETL?
Extract, Transform, Load
● how you turn external raw data into useful
database data
● Apache logs → web analytics DB
● CSV POS files → financial reporting DB
● OLTP server → 10-year data warehouse
● also called ELT when the transformation is
done inside the database
Purpose of ETL/ELT
getting data into the data warehouse
● clean up garbage data
● split out attributes
● “normalize” dimensional data
● deduplication
● calculate materialized views / indexes
ETL Tools
K.E.T.T.L.E.
ETL Tools
Ad-hoc scripting
ELT Tips
think volume
● bulk processing or parallel processing
● no row-at-a-time, document-at-a-time
● insert into permanent storage should be
the last step
● no updates
Queues not Extract
What kind of
database should I
use for DW?
5 Types
1. Standard Relational
2. MPP
3. Column Store
4. Map/Reduce
5. Enterprise Search
` `
standard relational
standard relational
the all-purpose solution for not-that-big data
● adequate for all tasks
● but not excellent at any of them
● easy to use
● low resource requirements
● well-supported by all software
● familiar
● not suitable for really big data
MySQL
PostgreSQL
DW Database
0 5 10 15 20 25 30
0 5 10 15 20 25 30
Sweet Spots
What's MPP?
Massively
Parallel
Processing
appliance software
MPP
cpu-intensive data warehousing
● data mining, some analytics
● supporting complex query logic
● moderately big data (1-200TB)
● drawbacks: proprietary, expensive
● now hybridizes
● with other types
What's a
column store?
column store
column store
inversion of a row store:
indexes become data
data becomes indexes
column stores
column stores
for aggregations and transformations of
highly structured data
● good for BI, analytics, some archiving
● moderately big data (0.5-100TB)
● bad for data mining
● slow to add new data / purge data
● usually support compression
What's
map/reduce?
map/reduce
map/reduce
map/reduce
// map
function(doc) {
for (var i in doc.links)
emit([doc.parent, i], null);
}
}
// reduce
function(keys, values) {
return null;
}
map/reduce
// Map
function (doc) {
emit(doc.val, doc.val)
}
// Reduce
function (keys, values, rereduce) {
// This computes the standard deviation of the mapped results
var stdDeviation=0.0;
var count=0;
var total=0.0;
var sqrTotal=0.0;
if (!rereduce) {
// This is the reduce phase, we are reducing over emitted values from
// the map functions.
for(var i in values) {
total = total + values[i];
sqrTotal = sqrTotal + (values[i] * values[i]);
}
count = values.length;
}
else {
// This is the rereduce phase, we are re-reducing previosuly
// reduced values.
for(var i in values) {
count = count + values[i].count;
total = total + values[i].total;
sqrTotal = sqrTotal + values[i].sqrTotal;
}
}
var variance = (sqrTotal - ((total * total)/count)) / count;
stdDeviation = Math.sqrt(variance);
// the reduce result. It contains enough information to be rereduced
// with other reduce results.
return {"stdDeviation":stdDeviation,"count":count,
"total":total,"sqrTotal":sqrTotal};
};
map/reduce vs. MPP
● open source
● petabytes
● write routines by
hand
● inefficient
● generic
● cheap HW / cloud
● DIY tools
● proprietary
● terabytes
● advanced query
support
● efficient
● specific
● needs good HW
● integrated tools
What's enterprise
search?
enterprise search
ElasticSearch
enterprise search
when you need to do DW with a huge pile of
partly processed “documents”
● does: light data mining, light BI/analytics
● best “full text” and keyword search
● supports “approximate results”
● lots of special features for web data
E.S. vs. C-Store
● batch load
● semi-structured
data
● uncompressed
● star schema
● sharding
● approximate results
● batch load
● fully normalized
data
● compressed
● snowflake schema
● parallel query
● exact results
What's a
windowing query?
regular aggregate
windowing function
TABLE events (
event_id INT,
event_type TEXT,
start TIMESTAMPTZ,
duration INTERVAL,
event_desc TEXT
);
SELECT MAX(concurrent)
FROM (
SELECT SUM(tally)
OVER (ORDER BY start)
AS concurrent
FROM (
SELECT start, 1::INT as tally
FROM events
UNION ALL
SELECT (start + duration), -1
FROM events )
AS event_vert) AS ec;
stream processing SQL
● replace multiple queries with a single
query
● avoid scanning large tables multiple times
● replace pages of application code
● and MB of data transmission
● SQL alternative to map/reduce
● (for some data mining tasks)
What's a
materialized view?
query results as table
● calculate once, read many time
● complex/expensive queries
● frequently referenced
● not necessarily a whole query
● often part of a query
● might be manually or automatically
updated
● depends on product
non-relational matviews
● CouchDB Views
● cache results of map/reduce jobs
● updated on data read
● Solr / Elastic Search “Faceted Search”
● cached indexed results of complex searches
● updated on data change
maintaining matviews
BEST: update matviews
at batch load time
GOOD: update matview according
to clock/calendar
FAIR: update matview on data
request
BAD for DW: update matviews
using a trigger
matview tips
● matviews should be small
● 1/10 to ¼ of RAM on each node
● each matview should support several
queries
● or one really really important one
● truncate + append, don't update
● index matviews like crazy
● if they are not indexes themselves
What's OLAP?
cubes
Site
R
e
p
e
a
t
V
i
s
i
t
o
r
s
B
r
o
w
s
e
r
drill-down
OLAP
● OnLine Analytical Processing
● Visualization technique
● all data as a multi-dimensional space
● great for decision support
● CPU & RAM intensive
● hard to do on really big data
● Works well with column stores
Contact
● Josh Berkus: josh@pgexperts.com
● blog: blogs.ittoolbox.com/database/soup
● twitter: @fuzzychef
● PostgreSQL: www.postgresql.org
● pgexperts: www.pgexperts.com
This talk is copyright 2011 Josh Berkus and is licensed under the Creative Commons Attribution
license. Many images were taken from google images and are copyright their original creators,
whom I don't actually know. Logos are trademark their respective owners, and are used here
under fair use.

More Related Content

What's hot

Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPBob Ward
 
Inside sql server in memory oltp sql sat nyc 2017
Inside sql server in memory oltp sql sat nyc 2017Inside sql server in memory oltp sql sat nyc 2017
Inside sql server in memory oltp sql sat nyc 2017Bob Ward
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PGConf APAC
 
15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performanceguest9912e5
 
Postgresql Database Administration Basic - Day1
Postgresql  Database Administration Basic  - Day1Postgresql  Database Administration Basic  - Day1
Postgresql Database Administration Basic - Day1PoguttuezhiniVP
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseAnil Gupta
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Changing your huge table's data types in production
Changing your huge table's data types in productionChanging your huge table's data types in production
Changing your huge table's data types in productionJimmy Angelakos
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyDbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyFranck Pachot
 
Big Data and PostgreSQL
Big Data and PostgreSQLBig Data and PostgreSQL
Big Data and PostgreSQLPGConf APAC
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuningCarlos del Cacho
 
An introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP EngineAn introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP EngineKrishnakumar S
 
Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits ScyllaDB
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan:  the good, the bad and the ugly param...CBO choice between Index and Full Scan:  the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...Franck Pachot
 

What's hot (20)

Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTP
 
Inside sql server in memory oltp sql sat nyc 2017
Inside sql server in memory oltp sql sat nyc 2017Inside sql server in memory oltp sql sat nyc 2017
Inside sql server in memory oltp sql sat nyc 2017
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance
 
Postgresql Database Administration Basic - Day1
Postgresql  Database Administration Basic  - Day1Postgresql  Database Administration Basic  - Day1
Postgresql Database Administration Basic - Day1
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
 
Oracle NOLOGGING
Oracle NOLOGGINGOracle NOLOGGING
Oracle NOLOGGING
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Changing your huge table's data types in production
Changing your huge table's data types in productionChanging your huge table's data types in production
Changing your huge table's data types in production
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyDbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
 
Big Data and PostgreSQL
Big Data and PostgreSQLBig Data and PostgreSQL
Big Data and PostgreSQL
 
Redshift performance tuning
Redshift performance tuningRedshift performance tuning
Redshift performance tuning
 
An introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP EngineAn introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP Engine
 
Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits Latest performance changes by Scylla - Project optimus / Nolimits
Latest performance changes by Scylla - Project optimus / Nolimits
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan:  the good, the bad and the ugly param...CBO choice between Index and Full Scan:  the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
 
Pgbr 2013 fts
Pgbr 2013 ftsPgbr 2013 fts
Pgbr 2013 fts
 

Viewers also liked

Giulietta 2.0 jt dm 2 170-cv distinctive
Giulietta 2.0 jt dm 2 170-cv distinctiveGiulietta 2.0 jt dm 2 170-cv distinctive
Giulietta 2.0 jt dm 2 170-cv distinctiveDavide Ciambelli
 
2012 07 ijl2012_seminar_diamond-club-brunch
2012 07 ijl2012_seminar_diamond-club-brunch2012 07 ijl2012_seminar_diamond-club-brunch
2012 07 ijl2012_seminar_diamond-club-brunchEddie Prentice
 
El mejor trabajo que he visto en mi vida
El mejor trabajo que he visto en mi vidaEl mejor trabajo que he visto en mi vida
El mejor trabajo que he visto en mi vidallucNapoleonpro
 
Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011
Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011
Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011Goldmedia Group
 
Relaciones industriales mafer
Relaciones industriales maferRelaciones industriales mafer
Relaciones industriales mafermafer_12
 
How to Engage Your Customer With Mobile Marketing in 2013
How to Engage Your Customer With Mobile Marketing in 2013How to Engage Your Customer With Mobile Marketing in 2013
How to Engage Your Customer With Mobile Marketing in 2013Lorel Marketing Group
 
Los hermanos (Brother's) su origen, desarrollo y testimonio
Los hermanos (Brother's)   su origen, desarrollo y testimonioLos hermanos (Brother's)   su origen, desarrollo y testimonio
Los hermanos (Brother's) su origen, desarrollo y testimonioantoniomd
 
Edmodo red social educativa
Edmodo red social educativaEdmodo red social educativa
Edmodo red social educativamiguellibra2012
 
Parque Nacional Peneda Gerês
Parque Nacional Peneda GerêsParque Nacional Peneda Gerês
Parque Nacional Peneda Gerêsmonadela
 
G1.pacheco.guallasamin.rina.comercio electronico
G1.pacheco.guallasamin.rina.comercio electronicoG1.pacheco.guallasamin.rina.comercio electronico
G1.pacheco.guallasamin.rina.comercio electronicoRina Pacheco
 
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...DATAVERSITY
 
NREL’s Research Support Facility: Making Plug Loads Count
NREL’s Research Support Facility:  Making Plug Loads CountNREL’s Research Support Facility:  Making Plug Loads Count
NREL’s Research Support Facility: Making Plug Loads CountShanti Pless
 
Benedetti, mario primavera con una esquina rota
Benedetti, mario   primavera con una esquina rotaBenedetti, mario   primavera con una esquina rota
Benedetti, mario primavera con una esquina rotaClarita Cra
 
Catálogo Deportivo Petzl-2014
Catálogo Deportivo Petzl-2014Catálogo Deportivo Petzl-2014
Catálogo Deportivo Petzl-2014IANASA
 
Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...
Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...
Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...No to mining in Palawan
 
Goods4Cast - Inventory Optimization Solution
Goods4Cast - Inventory Optimization SolutionGoods4Cast - Inventory Optimization Solution
Goods4Cast - Inventory Optimization SolutionSergey Kotik
 

Viewers also liked (20)

Giulietta 2.0 jt dm 2 170-cv distinctive
Giulietta 2.0 jt dm 2 170-cv distinctiveGiulietta 2.0 jt dm 2 170-cv distinctive
Giulietta 2.0 jt dm 2 170-cv distinctive
 
Movie pitch
Movie pitchMovie pitch
Movie pitch
 
2012 07 ijl2012_seminar_diamond-club-brunch
2012 07 ijl2012_seminar_diamond-club-brunch2012 07 ijl2012_seminar_diamond-club-brunch
2012 07 ijl2012_seminar_diamond-club-brunch
 
El mejor trabajo que he visto en mi vida
El mejor trabajo que he visto en mi vidaEl mejor trabajo que he visto en mi vida
El mejor trabajo que he visto en mi vida
 
Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011
Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011
Goldmedia Mobile Monitor 2011- Vortrag ConLife 2011
 
Relaciones industriales mafer
Relaciones industriales maferRelaciones industriales mafer
Relaciones industriales mafer
 
How to Engage Your Customer With Mobile Marketing in 2013
How to Engage Your Customer With Mobile Marketing in 2013How to Engage Your Customer With Mobile Marketing in 2013
How to Engage Your Customer With Mobile Marketing in 2013
 
Summary Design Trends Report 08/09
Summary Design Trends Report 08/09Summary Design Trends Report 08/09
Summary Design Trends Report 08/09
 
Los hermanos (Brother's) su origen, desarrollo y testimonio
Los hermanos (Brother's)   su origen, desarrollo y testimonioLos hermanos (Brother's)   su origen, desarrollo y testimonio
Los hermanos (Brother's) su origen, desarrollo y testimonio
 
Edmodo red social educativa
Edmodo red social educativaEdmodo red social educativa
Edmodo red social educativa
 
Parque Nacional Peneda Gerês
Parque Nacional Peneda GerêsParque Nacional Peneda Gerês
Parque Nacional Peneda Gerês
 
BonDia Lleida 14122011
BonDia Lleida 14122011BonDia Lleida 14122011
BonDia Lleida 14122011
 
Nano formulas
Nano formulasNano formulas
Nano formulas
 
G1.pacheco.guallasamin.rina.comercio electronico
G1.pacheco.guallasamin.rina.comercio electronicoG1.pacheco.guallasamin.rina.comercio electronico
G1.pacheco.guallasamin.rina.comercio electronico
 
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
Practical Applications for Data Warehousing, Analytics, BI, and Meta-Integrat...
 
NREL’s Research Support Facility: Making Plug Loads Count
NREL’s Research Support Facility:  Making Plug Loads CountNREL’s Research Support Facility:  Making Plug Loads Count
NREL’s Research Support Facility: Making Plug Loads Count
 
Benedetti, mario primavera con una esquina rota
Benedetti, mario   primavera con una esquina rotaBenedetti, mario   primavera con una esquina rota
Benedetti, mario primavera con una esquina rota
 
Catálogo Deportivo Petzl-2014
Catálogo Deportivo Petzl-2014Catálogo Deportivo Petzl-2014
Catálogo Deportivo Petzl-2014
 
Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...
Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...
Impacts of Lafayette Mining In The Island of Rapu-Rapu Albay Resulting From C...
 
Goods4Cast - Inventory Optimization Solution
Goods4Cast - Inventory Optimization SolutionGoods4Cast - Inventory Optimization Solution
Goods4Cast - Inventory Optimization Solution
 

Similar to Data Warehousing 101(and a video)

Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or WorseEric Sun
 
Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Marco Tusa
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big DataVoxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big DataVoxxed Athens
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Knoldus Inc.
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Oracle 12 c new-features
Oracle 12 c new-featuresOracle 12 c new-features
Oracle 12 c new-featuresNavneet Upneja
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, CriteoParis Open Source Summit
 

Similar to Data Warehousing 101(and a video) (20)

Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Druid
DruidDruid
Druid
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big DataVoxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data
 
Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!Run your queries 14X faster without any investment!
Run your queries 14X faster without any investment!
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
try
trytry
try
 
Oracle 12 c new-features
Oracle 12 c new-featuresOracle 12 c new-features
Oracle 12 c new-features
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
 

More from PostgreSQL Experts, Inc.

PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALEPostgreSQL Experts, Inc.
 
Elephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsElephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsPostgreSQL Experts, Inc.
 

More from PostgreSQL Experts, Inc. (20)

Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
 
Fail over fail_back
Fail over fail_backFail over fail_back
Fail over fail_back
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
HowTo DR
HowTo DRHowTo DR
HowTo DR
 
Give A Great Tech Talk 2013
Give A Great Tech Talk 2013Give A Great Tech Talk 2013
Give A Great Tech Talk 2013
 
Pg py-and-squid-pypgday
Pg py-and-squid-pypgdayPg py-and-squid-pypgday
Pg py-and-squid-pypgday
 
92 grand prix_2013
92 grand prix_201392 grand prix_2013
92 grand prix_2013
 
Five steps perform_2013
Five steps perform_2013Five steps perform_2013
Five steps perform_2013
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
 
PWNage: Producing a newsletter with Perl
PWNage: Producing a newsletter with PerlPWNage: Producing a newsletter with Perl
PWNage: Producing a newsletter with Perl
 
10 Ways to Destroy Your Community
10 Ways to Destroy Your Community10 Ways to Destroy Your Community
10 Ways to Destroy Your Community
 
Open Source Press Relations
Open Source Press RelationsOpen Source Press Relations
Open Source Press Relations
 
5 (more) Ways To Destroy Your Community
5 (more) Ways To Destroy Your Community5 (more) Ways To Destroy Your Community
5 (more) Ways To Destroy Your Community
 
Preventing Community (from Linux Collab)
Preventing Community (from Linux Collab)Preventing Community (from Linux Collab)
Preventing Community (from Linux Collab)
 
Development of 8.3 In India
Development of 8.3 In IndiaDevelopment of 8.3 In India
Development of 8.3 In India
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
50 Ways To Love Your Project
50 Ways To Love Your Project50 Ways To Love Your Project
50 Ways To Love Your Project
 
8.4 Upcoming Features
8.4 Upcoming Features 8.4 Upcoming Features
8.4 Upcoming Features
 
Elephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and VariantsElephant Roads: PostgreSQL Patches and Variants
Elephant Roads: PostgreSQL Patches and Variants
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Data Warehousing 101(and a video)

  • 1. Data Warehousing 101 Everything you never wanted to know about big databases but were forced to find out anyway Josh Berkus Open Source Bridge 2011
  • 2. contents covering ● concepts of DW ● some DW techniques ● databases not covering ● hardware ● analytics/reporting tools
  • 3.
  • 6. What is a “data warehouse”?
  • 8.
  • 9. OLTP vs DW ● many single-row writes ● current data ● queries generated by user activity ● < 1s response times ● 1000's of users ● few large batch imports ● years of data ● queries generated by large reports ● queries can run for minutes/hours ● 10's of users
  • 10. OLTP vs DW big data for many concurrent requests to small amounts of data each big data for low concurrency requests to very large amounts of data each
  • 13. archiving WORN data: “write once, read never” ● grows indefinitely ● usually a result of regulatory compliance ● main concern: storage efficiency
  • 15. data mining the database where you don't know what's in there, but you want to find out ● lots of data (TB to PB) ● mostly “semi-structured” ● data produced as a side effect of other business processes ● needs CPU-intensive processing
  • 16. BI: Business Intelligence DSS: Decision Support OLAP: Online Analytical Processing Analytics
  • 18. BI/DSS/OLAP/Analytics databases which support visualization of large amounts of data ● data is fairly well understood ● most data can be reduced to categories, geography, and taxonomy ● primarily about indexing
  • 20. dimensions vs. facts Fact Table customers / accounts category subcategory sub-subcategory
  • 21. dimension examples ● location/region/country/quadrant ● product categorization ● URL ● transaction type ● account heirarchy ● IP address ● OS/version/build
  • 22. dimension synonyms ● facet ● taxonomy ● secondary index ● view
  • 24. Extract, Transform, Load ● how you turn external raw data into useful database data ● Apache logs → web analytics DB ● CSV POS files → financial reporting DB ● OLTP server → 10-year data warehouse ● also called ELT when the transformation is done inside the database
  • 25. Purpose of ETL/ELT getting data into the data warehouse ● clean up garbage data ● split out attributes ● “normalize” dimensional data ● deduplication ● calculate materialized views / indexes
  • 29. ELT Tips think volume ● bulk processing or parallel processing ● no row-at-a-time, document-at-a-time ● insert into permanent storage should be the last step ● no updates
  • 31. What kind of database should I use for DW?
  • 32. 5 Types 1. Standard Relational 2. MPP 3. Column Store 4. Map/Reduce 5. Enterprise Search ` `
  • 34. standard relational the all-purpose solution for not-that-big data ● adequate for all tasks ● but not excellent at any of them ● easy to use ● low resource requirements ● well-supported by all software ● familiar ● not suitable for really big data
  • 35. MySQL PostgreSQL DW Database 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Sweet Spots
  • 39. MPP cpu-intensive data warehousing ● data mining, some analytics ● supporting complex query logic ● moderately big data (1-200TB) ● drawbacks: proprietary, expensive ● now hybridizes ● with other types
  • 42. column store inversion of a row store: indexes become data data becomes indexes
  • 44. column stores for aggregations and transformations of highly structured data ● good for BI, analytics, some archiving ● moderately big data (0.5-100TB) ● bad for data mining ● slow to add new data / purge data ● usually support compression
  • 48. map/reduce // map function(doc) { for (var i in doc.links) emit([doc.parent, i], null); } } // reduce function(keys, values) { return null; }
  • 49. map/reduce // Map function (doc) { emit(doc.val, doc.val) } // Reduce function (keys, values, rereduce) { // This computes the standard deviation of the mapped results var stdDeviation=0.0; var count=0; var total=0.0; var sqrTotal=0.0; if (!rereduce) { // This is the reduce phase, we are reducing over emitted values from // the map functions. for(var i in values) { total = total + values[i]; sqrTotal = sqrTotal + (values[i] * values[i]); } count = values.length; } else { // This is the rereduce phase, we are re-reducing previosuly // reduced values. for(var i in values) { count = count + values[i].count; total = total + values[i].total; sqrTotal = sqrTotal + values[i].sqrTotal; } } var variance = (sqrTotal - ((total * total)/count)) / count; stdDeviation = Math.sqrt(variance); // the reduce result. It contains enough information to be rereduced // with other reduce results. return {"stdDeviation":stdDeviation,"count":count, "total":total,"sqrTotal":sqrTotal}; };
  • 50. map/reduce vs. MPP ● open source ● petabytes ● write routines by hand ● inefficient ● generic ● cheap HW / cloud ● DIY tools ● proprietary ● terabytes ● advanced query support ● efficient ● specific ● needs good HW ● integrated tools
  • 53. enterprise search when you need to do DW with a huge pile of partly processed “documents” ● does: light data mining, light BI/analytics ● best “full text” and keyword search ● supports “approximate results” ● lots of special features for web data
  • 54. E.S. vs. C-Store ● batch load ● semi-structured data ● uncompressed ● star schema ● sharding ● approximate results ● batch load ● fully normalized data ● compressed ● snowflake schema ● parallel query ● exact results
  • 58. TABLE events ( event_id INT, event_type TEXT, start TIMESTAMPTZ, duration INTERVAL, event_desc TEXT );
  • 59. SELECT MAX(concurrent) FROM ( SELECT SUM(tally) OVER (ORDER BY start) AS concurrent FROM ( SELECT start, 1::INT as tally FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;
  • 60. stream processing SQL ● replace multiple queries with a single query ● avoid scanning large tables multiple times ● replace pages of application code ● and MB of data transmission ● SQL alternative to map/reduce ● (for some data mining tasks)
  • 62. query results as table ● calculate once, read many time ● complex/expensive queries ● frequently referenced ● not necessarily a whole query ● often part of a query ● might be manually or automatically updated ● depends on product
  • 63. non-relational matviews ● CouchDB Views ● cache results of map/reduce jobs ● updated on data read ● Solr / Elastic Search “Faceted Search” ● cached indexed results of complex searches ● updated on data change
  • 64. maintaining matviews BEST: update matviews at batch load time GOOD: update matview according to clock/calendar FAIR: update matview on data request BAD for DW: update matviews using a trigger
  • 65. matview tips ● matviews should be small ● 1/10 to ¼ of RAM on each node ● each matview should support several queries ● or one really really important one ● truncate + append, don't update ● index matviews like crazy ● if they are not indexes themselves
  • 69. OLAP ● OnLine Analytical Processing ● Visualization technique ● all data as a multi-dimensional space ● great for decision support ● CPU & RAM intensive ● hard to do on really big data ● Works well with column stores
  • 70. Contact ● Josh Berkus: josh@pgexperts.com ● blog: blogs.ittoolbox.com/database/soup ● twitter: @fuzzychef ● PostgreSQL: www.postgresql.org ● pgexperts: www.pgexperts.com This talk is copyright 2011 Josh Berkus and is licensed under the Creative Commons Attribution license. Many images were taken from google images and are copyright their original creators, whom I don't actually know. Logos are trademark their respective owners, and are used here under fair use.