cstore_fdw: Columnar Storage for PostgreSQL

1. cstore_fdw – Columnar store for analytic workloads Hadi Moshayedi & Ben Redman

3. What is CitusDB? • CitusDB is a scalable analytics database that extends PostgreSQL – Citus shards your data and automatically parallelizes your queries – Citus isn’t a fork of Postgres. Rather, it hooks onto the planner and executor for distributed query execution. – Always rebased to newest Postgres version – Natively supports new data types and extensions

4. A C D worker node #1 (extended PostgreSQL) C worker node #2 (extended PostgreSQL) A worker node #3 (extended PostgreSQL) 1 shard = 1 Postgres table . . . . master node (extended PostgreSQL) shard and shard placement metadata

5. Talk Overview 1. Why customers want columnar stores 2. Live demo 3. Optimized Row Columnar (ORC) format 4. PostgreSQL benefits 5. Benchmark numbers

6. Id Sz Ln Ht … … … … … … … … … … … 1 4 3 4 … … … … … … … … … … … 2 4 11 3 … … … … … … … … … … … 3 1 4 2 … … … … … … … … … … … 4 8 4 12 … … … … … … … … … … … … 4 … … … … … … … … … … … … … … … 4 … … … … … … … … … … … … … … … 4 … … … … … … … … … … … … … … … 30M rows 700 columns

7. Example SQL query SELECT id, AVG(price), MAX(price) FROM items WHERE quantity > 100 AND last_stock_date < ‘2013-10-01’ GROUP BY weight;

8. Row-oriented store Id … price … … quant … … last_stm … … … … … weight 1 … 3.90 … … 31 … … 2013-… … … … … … 0.6 2 … 13 … … 70 … … 2010-… … … … … … 0.8 3 … 4.25 … … 432 … … 2013-… … … … … … 1 4 … 4 … … 45 … … 2013-… … … … … … 6 … 4… … 95 … … 37 … … 2013-… … … … … … 0.6 4… … 59 … … 90 … … 2012-… … … … … … 1.5

12. Cost of row storage • Read 700 columns instead of 5 • >39 GB of unnecessary I/O Input Type Estimated Input Rate Cost to query performance Memory 10 GB/s 3.9 seconds SSD 600 MB/s >60 seconds

13. Example SQL query SELECT id, AVG(price), MAX(price) FROM items WHERE quantity > 100 AND last_stock_date < ‘2013-10-01’ GROUP BY weight;

14. Column-oriented store Id sz price … … quant … … last_stm … … … … … weight 1 4 3.90 … … 31 … … 2013-… … … … … … 0.6 2 3 13 … … 70 … … 2010-… … … … … … 0.8 3 2 4.25 … … 432 … … 2013-… … … … … … 1 4 4 4 … … 45 … … 2013-… … … … … … 6 … 4… 19 95 … … 37 … … 2013-… … … … … … 0.6 4… 2 59 … … 90 … … 2012-… … … … … … 1.5

17. Columnar Store Motivation • Read subset of columns to reduce I/O • Better compression – Less disk usage – Less disk I/O

18. State of the Columnar Store 1. Fork a popular database, swap in your storage engine, and never look back 2. Develop an open columnar store format for the Hadoop Distributed Filesystem (HDFS) 3. Use PostgreSQL extension machinery for in-memory stores / external databases

19. ORC File Layout benefits 1. Columnar layout – reads columns only related to the query 2. Compression – groups column values (10K) together and compresses them 3. Skip indexes – applies predicate filtering to skip over unrelated values

20. Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 150K rows (configurable) 150K rows (configurable) 10K column values (configurable) per block

21. Compression • Current compression method is PG_LZ from PostgreSQL core • Easy to add new compression methods depending on the CPU / disk trade-off • cstore_fdw enables using different compression methods at the column block level

22. Table sizes normalized to 1.0

23. Drawbacks to ORC • Support for limited data types. Each data type further needs to have a separate code path for min/max value collection and constraint exclusion. • Gathering statistics from the data and table JOINs are an afterthought.

24. Recent Benchmark Results • TPC-H is a standard benchmark • Performed in-memory, SSD, and HDD tests on 10 GB of data • Used m2.2xlarge and m3.2xlarge on EC2 • Compared vanilla PostgreSQL, CStore, CStore with compression

25. 10GB of uncached data on m2.2xlarge

26. 10GB of uncached data on m3.2xlarge

27. Total issued disk I/O measures with iotop

28. 10GB of cached data on m2/m3.2xlarge

29. 1.1 Release • CStore is an open source project actively in development: github.com/citusdata/cstore_fdw – Improved statistics gathering – Automatic management of table filenames – Management of table file data

30. Future Work – Improve memory usage – Native Delete / Insert / Update support – Improve read query performance (vectorized execution) – Different compression codecs – Many more; contribute to the discussion on GitHub!

31. Summary • CStore: Open source columnar store fdw for Postgres • Improves query times, reduces disk I/O, and reduces disk utilization • Uses foreign wrapper APIs 1 Supports all PostgreSQL data types 2 Statistics collection for better query plans 3 Load extension. Create Table. Copy

32. cstore_fdw – Columnar Store for Analytic Workloads Hadi Moshayedi – hadi@citusdata.com Ben Redman – ben@citusdata.com

cstore_fdw: Columnar Storage for PostgreSQL

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à cstore_fdw: Columnar Storage for PostgreSQL

Similaire à cstore_fdw: Columnar Storage for PostgreSQL (20)

Plus de Citus Data

Plus de Citus Data (20)

cstore_fdw: Columnar Storage for PostgreSQL

Notes de l'éditeur