3. What is CitusDB?
• CitusDB is a scalable analytics database that
extends PostgreSQL
– Citus shards your data and automatically parallelizes
your queries
– Citus isn’t a fork of Postgres. Rather, it hooks onto the
planner and executor for distributed query execution.
– Always rebased to newest Postgres version
– Natively supports new data types and extensions
4. A C
D
worker node #1
(extended PostgreSQL)
C
worker node #2
(extended PostgreSQL)
A
worker node #3
(extended PostgreSQL)
1 shard =
1 Postgres
table
. . . .
master node
(extended PostgreSQL)
shard and shard
placement metadata
17. Columnar Store Motivation
• Read subset of columns to reduce I/O
• Better compression
– Less disk usage
– Less disk I/O
18. State of the Columnar Store
1. Fork a popular database, swap in your
storage engine, and never look back
2. Develop an open columnar store format for
the Hadoop Distributed Filesystem (HDFS)
3. Use PostgreSQL extension machinery for in-memory
stores / external databases
19. ORC File Layout benefits
1. Columnar layout – reads columns only
related to the query
2. Compression – groups column values
(10K) together and compresses them
3. Skip indexes – applies predicate filtering
to skip over unrelated values
21. Compression
• Current compression method is PG_LZ
from PostgreSQL core
• Easy to add new compression methods
depending on the CPU / disk trade-off
• cstore_fdw enables using different
compression methods at the column block
level
23. Drawbacks to ORC
• Support for limited data types. Each data
type further needs to have a separate
code path for min/max value collection and
constraint exclusion.
• Gathering statistics from the data and
table JOINs are an afterthought.
24. Recent Benchmark Results
• TPC-H is a standard benchmark
• Performed in-memory, SSD, and HDD tests
on 10 GB of data
• Used m2.2xlarge and m3.2xlarge on EC2
• Compared vanilla PostgreSQL, CStore,
CStore with compression
29. 1.1 Release
• CStore is an open source project actively in
development: github.com/citusdata/cstore_fdw
– Improved statistics gathering
– Automatic management of table filenames
– Management of table file data
30. Future Work
– Improve memory usage
– Native Delete / Insert / Update support
– Improve read query performance (vectorized
execution)
– Different compression codecs
– Many more; contribute to the discussion on
GitHub!
31. Summary
• CStore: Open source columnar store fdw for
Postgres
• Improves query times, reduces disk I/O, and
reduces disk utilization
• Uses foreign wrapper APIs
1 Supports all PostgreSQL data types
2 Statistics collection for better query plans
3 Load extension. Create Table. Copy
32. cstore_fdw – Columnar Store
for Analytic Workloads
Hadi Moshayedi – hadi@citusdata.com
Ben Redman – ben@citusdata.com
Notes de l'éditeur
Columnar store for PostgreSQL
Ozgun .. founder at Citus Data
SF and Istanbul <short bio>
Hadi did bulk of the work on the columnar store
Have about 30 slides and a demo. I’ll put things into context with 2 slides on Citus
Technical talk. If you have questions, please feel free to interrupt
Speak slowly.
Team trip in Ayvalek
Why did we build cstore_fdw? Context around what we build and why cstore_fdw is very applicable to our users
When I say extends, we didn’t take a particular version of Postgres and forked from there. Instead we went from 8.4 to 9.0, etc.
We used the existing API and integration points: query planner and executor hooks are an example.
Let’s take an example distributed table, and see how it’s spread across the worker nodes.
The yellow boxes here are shards that make up the distributed table.
Worker node extensions
Master node extensions
1 shard = 1 postgres table = 1 cstore table
I/O bottle necks can be even more of an issue because of parallelism
Column “Id” is sequentially laid out on disk. And then we have size sequentially laid out on disk, and so forth.
I just spoke about how we reduce I/O, you also get better compress; why?
Now that we’re motivated, let’s do a demo!
Before we started, we wanted to get a picture of the landscape
(1/ you could integrate your storage engine back into a popular database)
Talk about how Hadoop was working on solving this problem because they have similar needs (read/write bytes), all open source and shared. ORC file format developed by FB and Hortonworks
* Pick the best of the latter two approaches
RCFile paper publised in ICDE ’11 – Performance comparisons in the paper. FB and Ohio State
First do some horizontal partitioning them do vertical partitioning (use examples)
Adopted by Hive and Pig – projects within the Hadoop ecosystem
The second generation specification supersedes the first one
The specification is open on the web
Reiterate how indexes work
Second generation. Developed by Hortonworks and Facebook
Talk
ORC columnar file layout
Lightweight indexes fit into memory (min/max values for each column)
(Stripes allow you to benefit from sequential I/O read benefits – you read in bigger chunks from disk – not so applicable to SSDs)
Decompress only related blocks (lower decompression overhead) -> evolutionary approach
Block – indexes are per block
Block – compression is per block (talk a bit about this in a second)
Index data kept in protocol buffer format -> backward compatible
Difference between toast tables and us is that we do block level so hopefully better
What does this mean for you?
Lineitem goes from 9.1GB -> 2.4GB
1/ In-memory: Effective memory size increases (If you have 1GB of RAM, your working set of 3-4GB can now fit into RAM)
2/ SSD: SSDs are expensive and you save notably from storage costs. You also read less from disk. Reduce disk bottlenecks.
3/ Rotational: Your disk I/O bound query performance significantly improves. Also, if the user stores PB of data in a distributed cluster, the customer saves from hardware costs.
Cstore can also keep min, max, sum, count, etc.
Limited set of types INT**, BOOL, TEXT**, DECIMAL**, TIMESTAMP
Decided to use PostgreSQL’s datum representation for saving the data.
- How to restructure this slide?
FDWs offer a nice API to collect a random sample from the data.
Looking to improve cost estimation for cstore_fdw query costs.
TPC-H is an ad-hoc, decision support benchmark.
Each table has between 10-20 columns. So not the best benchmark to demonstrate column store performance.
Talk about what graphs are going to show
m3.2xlarge (2 x 80G SSD, 30G ram, 4x3.25 ECU - 10G tests)
m2.2xlarge (1 x 850G HDD, 34.2G ram, 4x3.25 ECU - 10G tests)
Representative queries
Q6: 68s -> 25s (Q3: 85s -> 44s)
1/ Reduces disk bottlenecks
2/ Saves disk prices
cstore is slightly faster. cstore with compression is slightly slower due to the compression’s CPU cost.
Effective memory size increases
1/ Compression (Instead of fitting 1GB, users can now fit in 2-3GB)
DONE?
2/ If queries always selects a subset of the columns, then they occupy the working set
3/ Ideally, skip indexes are always kept in memory (they get referenced on each query)
Bug fixes!
Better cost estimates for join operations!
Improves query times, reduces disk I/O, and reduces disk utilization
Improves query times, reduces disk I/O, and reduces disk utilization