What's the great thing about a database? Why, it stores data of course! However, one feature that makes a database useful is the different data types that can be stored in it, and the breadth and sophistication of the data types in PostgreSQL is second-to-none, including some novel data types that do not exist in any other database software!
This talk will take an in-depth look at the special data types built right into PostgreSQL version 9.4, including:
* INET types
* UUIDs
* Geometries
* Arrays
* Ranges
* Document-based Data Types:
* Key-value store (hstore)
* JSON (text [JSON] & binary [JSONB])
We will also have some cleverly concocted examples to show how all of these data types can work together harmoniously.
AWS Community Day CPH - Three problems of Terraform
On Beyond (PostgreSQL) Data Types
1. On Beyond Data Types
Jonathan S. Katz
PostgreSQL España
February 16, 2015
2. About
• CTO, VenueBook
• Co-Organizer, NYC PostgreSQL User Group
(NYCPUG)
• Director, United States PostgreSQL Association
• ¡Primera vez en España!
• @jkatz05
2
3. A Brief Note on NYCPUG
• Active since 2010
• Over 1,300 members
• Monthly Meetups
• PGConf NYC 2014
• 259 attendees
• PGConf US 2015:
• Mar 25 - 27 @ New York Marriott
Downtown
• Already 160+ registrations
3
12. I kid you not,
I can spend close to an hour
on just those data types
12
13. PostgreSQL Primitives
Oversimplified Summary
• Strings
• Use "text" unless you need actual limit on strings, o/w use "varchar"
• Don't use "char"
• Integers
• Use "int"
• If you seriously have big numbers, use "bigint"
• Numerical types
• Use "numeric" almost always
• If have IEEE 754 data source you need to record, use "float"
13
14. And If We Had More Time
• (argh no pun intended)
• timestamp with time zone, timestamp without time
zone
• date
• time with time zone, time without time zone
• interval
14
15. Summary of PostgreSQL
Date/Time Types
• They are AWESOME
• Flexible input that you can customize
• Can perform mathematical operations in native
format
• Thank you intervals!
• IMO better support than most programming
languages have, let alone databases
15
17. PostgreSQL is a ORDBMS
• Designed to support more complex data types
• Complex data types => additional functionality
• Data Integrity
• Performance
17
24. Geometric Performance
24
CREATE TABLE houses (plot box);!
!
INSERT INTO houses!
SELECT box(!
! point((500 * random())::int, (500 * random())::int),!
! point((750 * random() + 500)::int, (750 * random() + 500)::int)!
)!
FROM generate_series(1, 1000000);
obdt=# CREATE INDEX houses_plot_idx ON houses (plot);!
ERROR: data type box has no default operator class for access
method "btree"!
HINT: You must specify an operator class for the index or define
a default operator class for the data type.
25. Solution #1: Expression Indexes
25
obdt=# EXPLAIN ANALYZE SELECT * FROM houses WHERE area(plot) BETWEEN 50000 AND 75000;!
-------------!
Seq Scan on houses (cost=0.00..27353.00 rows=5000 width=32) (actual
time=0.077..214.431 rows=26272 loops=1)!
Filter: ((area(plot) >= 50000::double precision) AND (area(plot) <= 75000::double
precision))!
Rows Removed by Filter: 973728!
Total runtime: 215.965 ms
obdt=# CREATE INDEX houses_plot_area_idx ON houses (area(plot));!
!
obdt=# EXPLAIN ANALYZE SELECT * FROM houses WHERE area(plot) BETWEEN 50000 AND 75000;!
------------!
Bitmap Heap Scan on houses (cost=107.68..7159.38 rows=5000 width=32) (actual
time=5.433..14.686 rows=26272 loops=1)!
Recheck Cond: ((area(plot) >= 50000::double precision) AND (area(plot) <=
75000::double precision))!
-> Bitmap Index Scan on houses_plot_area_idx (cost=0.00..106.43 rows=5000
width=0) (actual time=4.300..4.300 rows=26272 loops=1)!
Index Cond: ((area(plot) >= 50000::double precision) AND (area(plot) <=
75000::double precision))!
Total runtime: 16.025 ms
http://www.postgresql.org/docs/current/static/indexes-expressional.html
26. Solution #2: GiST Indexes
26
obdt=# EXPLAIN ANALYZE SELECT * FROM houses WHERE plot @> '((100,100),(300,300))'::box;!
------------!
Seq Scan on houses (cost=0.00..19853.00 rows=1000 width=32) (actual time=0.009..96.680
rows=40520 loops=1)!
Filter: (plot @> '(300,300),(100,100)'::box)!
Rows Removed by Filter: 959480!
Total runtime: 98.662 ms
obdt=# CREATE INDEX houses_plot_gist_idx ON houses USING gist(plot);!
!
obdt=# EXPLAIN ANALYZE SELECT * FROM houses WHERE plot @> '((100,100),(300,300))'::box;!
------------!
Bitmap Heap Scan on houses (cost=56.16..2813.20 rows=1000 width=32) (actual
time=12.053..24.468 rows=40520 loops=1)!
Recheck Cond: (plot @> '(300,300),(100,100)'::box)!
-> Bitmap Index Scan on houses_plot_gist_idx (cost=0.00..55.91 rows=1000 width=0)
(actual time=10.700..10.700 rows=40520 loops=1)!
Index Cond: (plot @> '(300,300),(100,100)'::box)!
Total runtime: 26.451 ms
http://www.postgresql.org/docs/current/static/indexes-types.html
27. Solution #2+: KNN-Gist
27
obdt=# CREATE INDEX locations_geocode_gist_idx ON locations USING gist(geocode);!
!
obdt=# EXPLAIN ANALYZE SELECT * FROM locations ORDER BY geocode <-> point(41.88853,-87.628852) LIMIT 10;!
------------!
Limit (cost=0.29..1.06 rows=10 width=16) (actual time=0.098..0.235 rows=10 loops=1)!
-> Index Scan using locations_geocode_gist_idx on locations (cost=0.29..77936.29 rows=1000000
width=16) (actual time=0.097..0.234 rows=10 loops=1)!
Order By: (geocode <-> '(41.88853,-87.628852)'::point)!
Total runtime: 0.257 ms
obdt=# CREATE TABLE locations (geocode point);!
!
obdt=# INSERT INTO locations!
SELECT point(90 * random(), 180 * random())!
FROM generate_series(1, 1000000);
obdt=# EXPLAIN ANALYZE SELECT * FROM locations ORDER BY geocode <-> point(41.88853,-87.628852) LIMIT 10;!
------------!
Limit (cost=39519.39..39519.42 rows=10 width=16) (actual time=319.306..319.309 rows=10 loops=1)!
-> Sort (cost=39519.39..42019.67 rows=1000110 width=16) (actual time=319.305..319.307 rows=10
loops=1)!
Sort Key: ((geocode <-> '(41.88853,-87.628852)'::point))!
Sort Method: top-N heapsort Memory: 25kB!
-> Seq Scan on locations (cost=0.00..17907.38 rows=1000110 width=16) (actual
time=0.019..189.687 rows=1000000 loops=1)!
Total runtime: 319.332 ms
http://www.slideshare.net/jkatz05/knn-39127023
28. • For when you are doing real things with shapes
28
• (and geographic information systems)
Solution #3: PostGIS
29. For more on PostGIS, please
go back in time to yesterday
and see Regina & Leo's tutorial
29
30. Let's Take a Break With UUIDs
30
2024e06c-44ff-5047-b1ae-00def276d043
34. Networks can do Math
34
http://www.postgresql.org/docs/current/static/functions-net.html
35. Postgres Can Help Manage
Your Routing Tables
35
http://www.postgresql.org/docs/current/static/functions-net.html
...perhaps with a foreign data wrapper and a background
worker, perhaps it can fully mange your routing tables?
36. Arrays
• ...because a database is an "array" of tuples
• ...and a "tuple" is kind of like an array
• ...can we have an array within a tuple?
36
43. Array Functions
43
obdt=# SELECT array_to_string(ARRAY[1,2,NULL,4], ',', '*');!
-----------------!
1,2,*,4
obdt=# SELECT unnest(ARRAY[1,2,3]);!
unnest !
--------!
1!
2!
3
Array to String
Array to Set
http://www.postgresql.org/docs/current/static/functions-array.html
44. array_agg
• useful for variable-length lists or "unknown # of columns"
obdt=# SELECT!
! t.title!! array_agg(s.full_name)!
FROM talk t!JOIN speakers_talks st ON st.talk_id = t.id!JOIN speaker s ON s.id = st.speaker_id!GROUP BY t.title;!
"
title | array_agg !
---------------------+-----------!
Data Types | {Jonathan, Jim}!
Administration | {Bruce}!
User Groups | {Josh, Jonathan, Magnus}
44
http://www.postgresql.org/docs/current/static/functions-array.html
47. Before Postgres 9.2
• OVERLAPS
"
"
"
• Limitations:
• Only date/time
• Start <= x <= End
SELECT!
! ('2013-01-08`::date, '2013-01-10'::date) OVERLAPS
('2013-01-09'::date, '2013-01-12'::date);
47
48. Postgres 9.2+
• INT4RANGE (integer)!
• INT8RANGE (bigint)!
• NUMRANGE (numeric)!
• TSRANGE (timestamp without time zone)!
• TSTZRANGE (timestamp with time zone)!
• DATERANGE (date)
48
http://www.postgresql.org/docs/current/static/rangetypes.html
49. Range Type Size
• Size on disk = 2 * (data type) + 1
• sometimes magic if bounds are
equal
obdt=# SELECT pg_column_size(daterange(CURRENT_DATE, CURRENT_DATE));!
----------------!
9!
"
obdt=# SELECT pg_column_size(daterange(CURRENT_DATE,CURRENT_DATE + 1));!
----------------!
17
49
50. Range Bounds
• Ranges can be inclusive, exclusive or both
• [2,4] => 2 ≤ x ≤ 4
• [2,4) => 2 ≤ x < 4
• (2,4] => 2 < x ≤ 4
• (2,4) => 2 < x < 4
"
• Can also be empty
50
51. Infinite Ranges
• Ranges can be infinite
– [2,) => 2 ≤ x < ∞
– (,2] => -∞ < x ≤ 2
• CAVEAT EMPTOR
– “infinity” has special meaning with timestamp ranges
– [CURRENT_TIMESTAMP,) = [CURRENT_TIMESTAMP,]
– [CURRENT_TIMESTAMP, 'infinity') <> [CURRENT_TIMEAMP, 'infinity']
51
54. Finding Overlapping Ranges
obdt=# SELECT *!
FROM cars!
WHERE cars.price_range && int4range(13000, 15000, '[]')!
ORDER BY lower(cars.price_range);!
-----------!
id | name | price_range !
----+---------------------+---------------!
5 | Ford Mustang | [11000,15001)!
6 | Lincoln Continental | [12000,14001)
54
http://www.postgresql.org/docs/current/static/functions-range.html
55. Ranges + GiST
obdt=# CREATE INDEX ranges_bounds_gist_idx ON cars USING gist
(bounds);!
"
obdt=# EXPLAIN ANALYZE SELECT * FROM ranges WHERE
int4range(500,1000) && bounds;!
------------!
Bitmap Heap Scan on ranges !
(actual time=0.283..0.370 rows=653 loops=1)!
Recheck Cond: ('[500,1000)'::int4range && bounds)!
-> Bitmap Index Scan on ranges_bounds_gist_idx (actual
time=0.275..0.275 rows=653 loops=1)!
Index Cond: ('[500,1000)'::int4range && bounds)!
Total runtime: 0.435 ms
55
56. Large Search Range?
test=# EXPLAIN ANALYZE SELECT * FROM ranges WHERE
int4range(10000,1000000) && bounds;!
QUERY PLAN
-------------!
Bitmap Heap Scan on ranges!
(actual time=184.028..270.323 rows=993068 loops=1)!
Recheck Cond: ('[10000,1000000)'::int4range && bounds)!
-> Bitmap Index Scan on ranges_bounds_gist_idx ! !
(actual time=183.060..183.060 rows=993068 loops=1)!
Index Cond: ('[10000,1000000)'::int4range &&
bounds)!
Total runtime: 313.743 ms
56
57. SP-GiST
• space-partitioned generalized search tree
• ideal for non-balanced data structures
– k-d trees, quad-trees, suffix trees
– divides search space into partitions of unequal size
• matching partitioning rule = fast search
• traditionally for "in-memory" transactions,
converted to play nicely with I/O
57
http://www.postgresql.org/docs/9.3/static/spgist.html
66. hstore Performance
66
obdt=# EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? '3';!
-----------------------!
Seq Scan on keypairs (cost=0.00..19135.06 rows=950 width=32) (actual
time=0.071..214.007 rows=1 loops=1)!
Filter: (data ? '3'::text)!
Rows Removed by Filter: 999999!
Total runtime: 214.028 ms
obdt=# CREATE INDEX keypairs_data_gin_idx ON keypairs USING gin(data);!
"
obdt=# EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? '3';!
--------------!
Bitmap Heap Scan on keypairs (cost=27.75..2775.66 rows=1000 width=24)
(actual time=0.046..0.046 rows=1 loops=1)!
Recheck Cond: (data ? '3'::text)!
-> Bitmap Index Scan on keypairs_data_gin_idx (cost=0.00..27.50
rows=1000 width=0) (actual time=0.041..0.041 rows=1 loops=1)!
Index Cond: (data ? '3'::text)!
Total runtime: 0.073 ms
67. JSON and PostgreSQL
• Started in 2010 as a Google Summer of Code Project
• https://wiki.postgresql.org/wiki/
JSON_datatype_GSoC_2010
• Goal:
• be similar to XML data type functionality in
Postgres
• be committed as an extension for PostgreSQL 9.1
67
68. What Happened?
• Different proposals over how to finalize the
implementation
• binary vs. text
• Core vs Extension
• Discussions between “old” vs. “new” ways of
packaging for extensions
68
71. PostgreSQL 9.2: JSON
• JSON data type in core PostgreSQL
• based on RFC 4627
• only “strictly” follows if your database encoding
is UTF-8
• text-based format
• checks for validity
71
72. PostgreSQL 9.2: JSON
obdt=# SELECT '[{"PUG": "NYC"}]'::json;!
------------------!
[{"PUG": "NYC"}]!
"
"
obdt=# SELECT '[{"PUG": "NYC"]'::json;!
ERROR: invalid input syntax for type json at character 8!
DETAIL: Expected "," or "}", but found "]".!
CONTEXT: JSON data, line 1: [{"PUG": "NYC"]
72
http://www.postgresql.org/docs/current/static/datatype-json.html
76. PostgreSQL 9.3:
JSON Ups its Game
• Added operators and functions to read / prepare
JSON
• Added casts from hstore to JSON
76
77. PostgreSQL 9.3: JSON
Operator Description Example
-> return JSON array element OR
JSON object field
’[1,2,3]’::json -> 0;
’{"a": 1, "b": 2, "c": 3}’::json -> ’b’;
->> return JSON array element OR
JSON object field AS text
[’1,2,3]’::json ->> 0;
’{"a": 1, "b": 2, "c": 3}’::json ->> ’b’;
#> return JSON object using path ’{"a": 1, "b": 2, "c": [1,2,3]}’::json #> ’{c, 0}’;
#>> return JSON object using path
AS text
’{"a": 1, "b": 2, "c": [1,2,3]}’::json #> ’{c, 0}’;
77
http://www.postgresql.org/docs/current/static/functions-json.html
78. Operator Gotchas
SELECT * FROM category_documents!
WHERE data->’title’ = ’PostgreSQL’;!
ERROR: operator does not exist: json = unknown!
LINE 1: ...ECT * FROM category_documents WHERE data->’title’ =
’Postgre...
^HINT: No operator matches the given name and argument
type(s). You might need to add explicit type casts.
78
79. Operator Gotchas
SELECT * FROM category_documents!
WHERE data->>’title’ = ’PostgreSQL’;!
-----------------------!
{"cat_id":252739,"cat_pages":14,"cat_subcats":0,"cat_files":
0,"title":"PostgreSQL"}!
(1 row)
79
80. For the Upcoming Examples
• Wikipedia English category titles – all 1,823,644 that I
downloaded"
• Relation looks something like:
80
Column | Type | Modifiers !
-------------+---------+--------------------!
cat_id | integer | not null!
cat_pages | integer | not null default 0!
cat_subcats | integer | not null default 0!
cat_files | integer | not null default 0!
title | text |
81. Performance?
EXPLAIN ANALYZE SELECT * FROM category_documents!
WHERE data->>’title’ = ’PostgreSQL’;!
---------------------!
Seq Scan on category_documents (cost=0.00..57894.18
rows=9160 width=32) (actual time=360.083..2712.094 rows=1
loops=1)!
Filter: ((data ->> ’title’::text) = ’PostgreSQL’::text)!
Rows Removed by Filter: 1823643!
Total runtime: 2712.127 ms
81
82. Performance?
CREATE INDEX category_documents_idx ON category_documents
(data);!
ERROR: data type json has no default operator class for
access method "btree"!
HINT: You must specify an operator class for the index or
define a default operator class for the data type.
82
83. Let’s Be Clever
• json_extract_path, json_extract_path_text
• LIKE (#>, #>>) but with list of args
83
SELECT json_extract_path(!
! ’{"a": 1, "b": 2, "c": [1,2,3]}’::json,!
! ’c’, ’0’);!
--------!
1
84. Performance Revisited
CREATE INDEX category_documents_data_idx!
ON category_documents!
! (json_extract_path_text(data, ’title’));!
"
obdt=# EXPLAIN ANALYZE!
SELECT * FROM category_documents!
WHERE json_extract_path_text(data, ’title’) = ’PostgreSQL’;!
------------!
Bitmap Heap Scan on category_documents (cost=303.09..20011.96
rows=9118 width=32) (actual time=0.090..0.091 rows=1 loops=1)!
Recheck Cond: (json_extract_path_text(data, VARIADIC
’{title}’::text[]) = ’PostgreSQL’::text)!
-> Bitmap Index Scan on category_documents_data_idx
(cost=0.00..300.81 rows=9118 width=0) (actual time=0.086..0.086 rows=1
loops=1)!
Index Cond: (json_extract_path_text(data, VARIADIC
’{title}’::text[]) = ’PostgreSQL’::text)!
"
Total runtime: 0.105 ms!
84
85. The Relation vs JSON
• Size on Disk
• category (relation) - 136MB
• category_documents (JSON) - 238MB
• Index Size for “title”
• category - 89MB
• category_documents - 89MB
• Average Performance for looking up “PostgreSQL”
• category - 0.065ms
• category_documents - 0.070ms
85
86. JSON Aggregates
• (this is pretty cool)
• json_agg
86
http://www.postgresql.org/docs/current/static/functions-json.html
SELECT b, json_agg(stuff)!
FROM stuff!
GROUP BY b;!
"
b | json_agg !
------+----------------------------------!
neat | [{"a":4,"b":"neat","c":[4,5,6]}]!
wow | [{"a":1,"b":"wow","c":[1,2,3]}, +!
| {"a":3,"b":"wow","c":[7,8,9]}]!
cool | [{"a":2,"b":"cool","c":[4,5,6]}]
87. hstore gets in the game
• hstore_to_json
• converts hstore to json, treating all values as strings
• hstore_to_json_loose
• converts hstore to json, but also tries to distinguish between
data types and “convert” them to proper JSON representations
SELECT hstore_to_json_loose(’"a key"=>1, b=>t, c=>null, d=>12345,
e=>012345, f=>1.234, g=>2.345e+4’);
----------------
{"b": true, "c": null, "d": 12345, "e": "012345", "f": 1.234,
"g": 2.345e+4, "a key": 1}
87
88. Next Steps?
• In PostgreSQL 9.3, JSON became much more
useful, but…
• Difficult to search within JSON
• Difficult to build new JSON objects
88
90. “Nested hstore”
• Proposed at PGCon 2013 by Oleg Bartunov and Teodor Sigaev
• Hierarchical key-value storage system that supports arrays too
and stored in binary format
• Takes advantage of GIN indexing mechanism in PostgreSQL
• “Generalized Inverted Index”
• Built to search within composite objects
• Arrays, fulltext search, hstore
• …JSON?
90
http://www.pgcon.org/2013/schedule/attachments/280_hstore-pgcon-2013.pdf
91. How JSONB Came to Be
• JSON is the “lingua franca per trasmissione la data
nella web”
• The PostgreSQL JSON type was in a text format
and preserved text exactly as input
• e.g. duplicate keys are preserved
• Create a new data type that merges the nested
Hstore work to create a JSON type stored in a
binary format: JSONB
91
92. JSONB ≠ BSON
BSON is a data type created by MongoDB as a “superset of JSON”
"
JSONB lives in PostgreSQL and is just JSON that is stored in a binary format on disk
92
93. JSONB Gives Us
More Operators
• a @> b - is b contained within a?
• { "a": 1, "b": 2 } @> { "a": 1} -- TRUE!
• a <@ b - is a contained within b?
• { "a": 1 } <@ { "a": 1, "b": 2 } -- TRUE!
• a ? b - does the key “b” exist in JSONB a?
• { "a": 1, "b": 2 } ? 'a' -- TRUE!
• a ?| b - does the array of keys in “b” exist in JSONB a?
• { "a": 1, "b": 2 } ?| ARRAY['b', 'c'] -- TRUE!
• a ?& b - does the array of keys in "b" exist in JSONB a?
• { "a": 1, "b": 2 } ?& ARRAY['a', 'b'] -- TRUE
93
94. JSONB Gives us GIN
• Recall - GIN indexes are used to "look inside"
objects
• JSONB has two flavors of GIN:
• Standard - supports @>, ?, ?|, ?&
"
• "Path Ops" - supports only @>
94
CREATE INDEX category_documents_data_idx USING gin(data);
CREATE INDEX category_documents_path_data_idx USING gin(data jsonb_path_ops);
96. JSONB Gives Us Speed
EXPLAIN ANALYZE SELECT * FROM category_documents!
! WHERE data @> '{"title": "PostgreSQL"}';!
------------!
Bitmap Heap Scan on category_documents (cost=38.13..6091.65
rows=1824 width=153) (actual time=0.021..0.022 rows=1 loops=1)!
Recheck Cond: (data @> '{"title": "PostgreSQL"}'::jsonb)!
Heap Blocks: exact=1!
-> Bitmap Index Scan on category_documents_path_data_idx
(cost=0.00..37.68 rows=1824 width=0) (actual time=0.012..0.012
rows=1 loops=1)!
Index Cond: (data @> '{"title": "PostgreSQL"}'::jsonb)!
Planning time: 0.070 ms!
Execution time: 0.043 ms
96
97. JSONB + Wikipedia Categories:
By the Numbers
• Size on Disk
• category (relation) - 136MB
• category_documents (JSON) - 238MB
• category_documents (JSONB) - 325MB
• Index Size for “title”
• category - 89MB
• category_documents (JSON with one key using an expression index) - 89MB
• category_documents (JSONB, all GIN ops) - 311MB
• category_documents (JSONB, just @>) - 203MB
• Average Performance for looking up “PostgreSQL”
• category - 0.065ms
• category_documents (JSON with one key using an expression index) - 0.070ms
• category_documents (JSONB, all GIN ops) - 0.115ms
• category_documents (JSONB, just @>) - 0.045ms
97
99. In Summary
• PostgreSQL has a lot of advanced data types
• They are easy to access
• They have a lot of functionality around them
• They are durable
• They perform well (but of course must be used correctly)
• Furthermore, you can extend PostgreSQL to:
• Better manipulate your favorite data type
• Create more data types
• ...well, do basically what you want it to do
99