SlideShare a Scribd company logo
1 of 52
Download to read offline
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is HyperLogLog and
Why You Will Love It
Burak Yücesoy
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Number of unique elements (cardinality) in given data
• Useful to find things like…
• Number of unique users visited your web page
• Number of unique products in your inventory
What is COUNT(DISTINCT)?
2
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is COUNT(DISTINCT)?
3
logins
username | date
----------+-----------
Alice | 2018-10-02
Bob | 2018-10-03
Alice | 2018-10-05
Eve | 2018-10-07
Bob | 2018-10-07
Bob | 2018-10-08
• Number of logins: 6
• Number of unique users who log in: 3
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Slow
• High memory footprint
• Cannot work with appended/streaming data
Problems with Traditional COUNT(DISTINCT)
4
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low
memory footprint;
• Approximation algorithm
• Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data
• Mathematically provable error bounds
• It can estimate cardinalities well beyond 109
with 1% error rate using only 6 KB of memory
There is better way!
5
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
It depends...
Is it OK to approximate?
6
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Is it OK to approximate?
7
• Count # of unique felonies associated to a person; Not OK
• Count # of unique visits to my web page; OK
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HLL
• Very fast
• Low memory footprint
• Can work with streaming data
• Can merge estimations of two separate datasets efficiently
8
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work?
Steps;
1. Hash all elements
a. Ensures uniform data distribution
b. Can treat all data types same
2. Observing rare bit patterns
3. Stochastic averaging
9
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
hash
Alice 645403841
binary
0010...001
Number of leading zeros: 2
Maximum number of leading zeros: 2
10
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
hash
Bob 1492309842
binary
0101...010
Number of leading zeros: 1
Maximum number of leading zeros: 2
11
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? - Observing rare bit patterns
...
Maximum number of leading zeros: 7
Cardinality Estimation: 27
12
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
Measuring same thing repeatedly and taking average.
13
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201814
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201815
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
Data
Partition 1
Partition 3
Partition 2
7
5
12
228.968...
Estimation
27
25
212
16
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
How does HLL work? Stochastic Averaging
01000101...010
First m bits to decide
partition number
Remaining bits to
count leading zeros
17
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Error rate of HLL
• Typical Error Rate: 1.04 / sqrt(number of partitions)
• Memory need is number of partitions * log(log(max. value in hash space)) bit
• Can estimate cardinalities well beyond 109
with 1% error rate while using a
memory of only 6 kilobytes
• Memory vs accuracy tradeoff
18
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Why does HLL work?
It turns out, combination of lots of bad observation is a
good observation
19
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Some interesting examples
Alice
Alice
Alice
…
…
…
Alice
Partition 1
Partition 8
Partition 2
0
2
0
1.103...
Harmonic
Mean
20
22
20
hash
Alice 645403841
binary
00100110...001
... ... ...
20
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Some interesting examples
Charlie
Partition 1
Partition 8
Partition 2
29
0
0
1.142...
Harmonic
Mean
229
20
20
hash
Charlie 0
binary
00000000...000
... ... ...
21
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
HLL in PostgreSQL
● https://github.com/citusdata/postgresql-hll
22
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
postgresql-hll uses a data structure, also called hll to keep maximum number of
leading zeros of each partition.
• Use hll_hash_bigint to hash elements.
• There are some other functions for other common data types.
• Use hll_add_agg to aggregate hashed elements into hll data structure.
• Use hll_cardinality to materialize hll data structure to actual distinct count.
HLL in PostgreSQL
23
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Real Time Dashboard with
HyperLogLog
24
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Precomputed aggregates for period of time and set of dimensions;
What is Rollup?
25
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
26
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
27
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
28
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What is Rollup?
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
CREATE TABLE events (
id bigint,
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
device_id bigint,
session_id bigint,
timestamp timestamp
);
29
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
30
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
30
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
31
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
31
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
32
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
32
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
33
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
COUNT (DISTINCT device_id) AS device_distinct_count,
COUNT (DISTINCT session_id) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
What is Rollup?
33
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• Fast & indexed lookups of aggregates
• Avoid expensive repeated computations
• Rollups are compact (uses less space) and can be kept over longer periods
• Rollups can be further aggregated
Benefit of Rollup Tables
34
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
35
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
36
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
SUM (device_distinct_count) AS device_distinct_count,
SUM (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
37
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Rollup Table with HLL
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count hll,
session_distinct_count hll,
minute timestamp
);
CREATE TABLE rollup_events_5min (
customer_id bigint,
event_type varchar,
country varchar,
browser varchar,
event_count bigint,
device_distinct_count bigint,
session_distinct_count bigint,
minute timestamp
);
38
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
39
INSERT INTO rollup_events_5min
SELECT
customer_id,
event_type,
country,
browser,
COUNT(*) AS event_count,
hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count,
hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count,
date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
'epoch' AS minute
FROM events
WHERE timestamp >= $1 AND timestamp <=$2
GROUP BY customer_id, event_type, country, browser, minute
Rollup Table with HLL
39
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What if I want to get aggregation result for 1 hour period?
SELECT
customer_id,
event_type,
country,
browser,
SUM (event_count) AS event_count,
hll_union_agg (device_distinct_count) AS device_distinct_count,
hll_union_agg (session_distinct_count) AS session_distinct_count,
date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
hour
FROM rollup_events_5min
GROUP BY customer_id, event_type, country, browser, minute
40
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 1
Interval 1
Partition 1
Interval 1
Partition 3
Interval 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
How to Merge COUNT(DISTINCT) with HLL
41
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 2
Interval 2
Partition 1
Interval 2
Partition 3
Interval 2
Partition 2
11
7
8
HLL(11, 7, 8)
Intermediate
Result
How to Merge COUNT(DISTINCT) with HLL
42
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
11
7
12
1053.255
Estimation
211
27
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 7, 12)
hll_union_agg
How to Merge COUNT(DISTINCT) with HLL
43
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
Interval 1
+
Interval 2
Interval 1
Partition 1(7)
+
Interval 2
Partition 1(11)
11
7
12
1053.255
Estimation
Interval 1
Partition 2(5)
+
Interval 2
Partition 2(7)
Interval 1
Partition 3(12)
+
Interval 2
Partition 4(8)
How to Merge COUNT(DISTINCT) with HLL
44
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
• What if ...
• Without hll, you would have to maintain 2n
- 1 rollup tables to cover all
combinations in n columns (multiply this with number of time intervals).
45
Rollup Table with HLL
45
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
What Happens in Distributed
Scenario?
46
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
1. Separate data into shards.
events_001 events_002 events_003
postgresql-hll in distributed environment
47
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
2. Put shards into separate nodes.
Worker
Node 1
Coordinator
Worker
Node 2
Worker
Node 3
events_001 events_002 events_003
postgresql-hll in distributed environment
48
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
3. For each shard, calculate hll (but do not materialize).
postgresql-hll in distributed environment
Shard 1
Shard 1
Partition 1
Shard 1
Partition 3
Shard 1
Partition 2
7
5
12
HLL(7, 5, 12)
Intermediate
Result
49
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
4. Pull intermediate results to a single node.
Worker
Node 1
events_001
Coordinator
Worker
Node 2
events_002
Worker
Node 3
events_003
HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
postgresql-hll in distributed environment
50
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
5. Merge separate hll data structures and materialize them
11
13
12
10532.571...
211
213
212
HLL(11, 7, 8)
HLL(7, 5, 12)
HLL(11, 13, 12)
HLL(8, 13, 6)
postgresql-hll in distributed environment
51
Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
burak@citusdata.com
Thanks
&
Questions
@byucesoy
Burak Yücesoy
www.citusdata.com @citusdata

More Related Content

What's hot

HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab
 

What's hot (20)

How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache Spark
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Elk - An introduction
Elk - An introductionElk - An introduction
Elk - An introduction
 
System Design Interviews.pdf
System Design Interviews.pdfSystem Design Interviews.pdf
System Design Interviews.pdf
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTOClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Similar to What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy

Similar to What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy (20)

Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
Distributed count(distinct) with hyper loglog on postgresql | PGConf EU 2017)...
 
2019 GDRR: Blockchain Data Analytics - Cryptocurrency and blockchain analysis...
2019 GDRR: Blockchain Data Analytics - Cryptocurrency and blockchain analysis...2019 GDRR: Blockchain Data Analytics - Cryptocurrency and blockchain analysis...
2019 GDRR: Blockchain Data Analytics - Cryptocurrency and blockchain analysis...
 
The State of Postgres | Strata San Jose 2018 | Umur Cubukcu
The State of Postgres | Strata San Jose 2018 | Umur CubukcuThe State of Postgres | Strata San Jose 2018 | Umur Cubukcu
The State of Postgres | Strata San Jose 2018 | Umur Cubukcu
 
KPIs implementation and decision tree algorithms as support tools in wastewat...
KPIs implementation and decision tree algorithms as support tools in wastewat...KPIs implementation and decision tree algorithms as support tools in wastewat...
KPIs implementation and decision tree algorithms as support tools in wastewat...
 
RIR Collaboration on RIPEstat
RIR Collaboration on RIPEstatRIR Collaboration on RIPEstat
RIR Collaboration on RIPEstat
 
Data Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstatData Gathering and Analysis BoF- RipEstat
Data Gathering and Analysis BoF- RipEstat
 
Large Scale Internet Measurements Infrastructures
Large Scale Internet Measurements InfrastructuresLarge Scale Internet Measurements Infrastructures
Large Scale Internet Measurements Infrastructures
 
IAOS2018 - The EuroGroups Register, A. Bikauskaite, A. Götzfried, Z. Völfinger
IAOS2018 - The EuroGroups Register, A. Bikauskaite, A. Götzfried, Z. VölfingerIAOS2018 - The EuroGroups Register, A. Bikauskaite, A. Götzfried, Z. Völfinger
IAOS2018 - The EuroGroups Register, A. Bikauskaite, A. Götzfried, Z. Völfinger
 
X18136931 dwbi report
X18136931 dwbi reportX18136931 dwbi report
X18136931 dwbi report
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Bitcoin Price Prediction for Long, Short and Medium Time Frame
Bitcoin Price Prediction for Long, Short and Medium Time FrameBitcoin Price Prediction for Long, Short and Medium Time Frame
Bitcoin Price Prediction for Long, Short and Medium Time Frame
 
Move out from your comfort zone!
Move out from your comfort zone!Move out from your comfort zone!
Move out from your comfort zone!
 
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
Using Postgres and Citus for Lightning Fast Analytics, also ft. Rollups | Liv...
 
Blockchain for Marketing & Insights
Blockchain for Marketing & InsightsBlockchain for Marketing & Insights
Blockchain for Marketing & Insights
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
IoT: beyond the coffee machine
IoT: beyond the coffee machineIoT: beyond the coffee machine
IoT: beyond the coffee machine
 
Introduction to statistics project
Introduction to statistics projectIntroduction to statistics project
Introduction to statistics project
 
New from BookNet Canada: BNC BiblioShare
New from BookNet Canada: BNC BiblioShareNew from BookNet Canada: BNC BiblioShare
New from BookNet Canada: BNC BiblioShare
 
Webinar on 4th Industrial Revolution, IoT and RPA
Webinar on 4th Industrial Revolution, IoT and RPAWebinar on 4th Industrial Revolution, IoT and RPA
Webinar on 4th Industrial Revolution, IoT and RPA
 
EU: Electronic Calculators and Pocket-Size Data Recording, Reproducing and Di...
EU: Electronic Calculators and Pocket-Size Data Recording, Reproducing and Di...EU: Electronic Calculators and Pocket-Size Data Recording, Reproducing and Di...
EU: Electronic Calculators and Pocket-Size Data Recording, Reproducing and Di...
 

More from Citus Data

Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Citus Data
 

More from Citus Data (20)

Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
 
When it all goes wrong | PGConf EU 2019 | Will Leinweber
When it all goes wrong | PGConf EU 2019 | Will LeinweberWhen it all goes wrong | PGConf EU 2019 | Will Leinweber
When it all goes wrong | PGConf EU 2019 | Will Leinweber
 
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise GrandjoncAmazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
Amazing SQL your ORM can (or can't) do | PGConf EU 2019 | Louise Grandjonc
 
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
 
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisDeep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
 
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
Why Postgres Why This Database Why Now | SF Bay Area Postgres Meetup | Claire...
 
A story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
A story on Postgres index types | PostgresLondon 2019 | Louise GrandjoncA story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
A story on Postgres index types | PostgresLondon 2019 | Louise Grandjonc
 
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
Why developers need marketing now more than ever | GlueCon 2019 | Claire Gior...
 
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri FontaineThe Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine | Dimitri Fontaine
 
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...
 
When it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
When it all goes wrong (with Postgres) | RailsConf 2019 | Will LeinweberWhen it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
When it all goes wrong (with Postgres) | RailsConf 2019 | Will Leinweber
 
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri FontaineThe Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
The Art of PostgreSQL | PostgreSQL Ukraine Meetup | Dimitri Fontaine
 
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri FontaineHow to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
How to write SQL queries | pgDay Paris 2019 | Dimitri Fontaine
 
When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
When it all Goes Wrong |Nordic PGDay 2019 | Will LeinweberWhen it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber
 
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire GiordanoWhy PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
Why PostgreSQL Why This Database Why Now | Nordic PGDay 2019 | Claire Giordano
 
Scaling Multi-Tenant Applications Using the Django ORM & Postgres | PyCaribbe...
Scaling Multi-Tenant Applications Using the Django ORM & Postgres | PyCaribbe...Scaling Multi-Tenant Applications Using the Django ORM & Postgres | PyCaribbe...
Scaling Multi-Tenant Applications Using the Django ORM & Postgres | PyCaribbe...
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference Europe 2018 | Burak Yucesoy

  • 1. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is HyperLogLog and Why You Will Love It Burak Yücesoy
  • 2. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • Number of unique elements (cardinality) in given data • Useful to find things like… • Number of unique users visited your web page • Number of unique products in your inventory What is COUNT(DISTINCT)? 2
  • 3. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is COUNT(DISTINCT)? 3 logins username | date ----------+----------- Alice | 2018-10-02 Bob | 2018-10-03 Alice | 2018-10-05 Eve | 2018-10-07 Bob | 2018-10-07 Bob | 2018-10-08 • Number of logins: 6 • Number of unique users who log in: 3
  • 4. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • Slow • High memory footprint • Cannot work with appended/streaming data Problems with Traditional COUNT(DISTINCT) 4
  • 5. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low memory footprint; • Approximation algorithm • Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data • Mathematically provable error bounds • It can estimate cardinalities well beyond 109 with 1% error rate using only 6 KB of memory There is better way! 5
  • 6. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 It depends... Is it OK to approximate? 6
  • 7. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Is it OK to approximate? 7 • Count # of unique felonies associated to a person; Not OK • Count # of unique visits to my web page; OK
  • 8. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 HLL • Very fast • Low memory footprint • Can work with streaming data • Can merge estimations of two separate datasets efficiently 8
  • 9. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging 9
  • 10. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2 10
  • 11. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2 11
  • 12. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27 12
  • 13. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average. 13
  • 14. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201814
  • 15. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 201815
  • 16. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212 16
  • 17. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 How does HLL work? Stochastic Averaging 01000101...010 First m bits to decide partition number Remaining bits to count leading zeros 17
  • 18. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Error rate of HLL • Typical Error Rate: 1.04 / sqrt(number of partitions) • Memory need is number of partitions * log(log(max. value in hash space)) bit • Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes • Memory vs accuracy tradeoff 18
  • 19. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Why does HLL work? It turns out, combination of lots of bad observation is a good observation 19
  • 20. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Some interesting examples Alice Alice Alice … … … Alice Partition 1 Partition 8 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100110...001 ... ... ... 20
  • 21. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Some interesting examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000000...000 ... ... ... 21
  • 22. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 HLL in PostgreSQL ● https://github.com/citusdata/postgresql-hll 22
  • 23. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 postgresql-hll uses a data structure, also called hll to keep maximum number of leading zeros of each partition. • Use hll_hash_bigint to hash elements. • There are some other functions for other common data types. • Use hll_add_agg to aggregate hashed elements into hll data structure. • Use hll_cardinality to materialize hll data structure to actual distinct count. HLL in PostgreSQL 23
  • 24. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Real Time Dashboard with HyperLogLog 24
  • 25. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Precomputed aggregates for period of time and set of dimensions; What is Rollup? 25
  • 26. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 26
  • 27. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 27
  • 28. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 28
  • 29. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 29
  • 30. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 30 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 30
  • 31. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 31 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 31
  • 32. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 32 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 32
  • 33. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 33 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 33
  • 34. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • Fast & indexed lookups of aggregates • Avoid expensive repeated computations • Rollups are compact (uses less space) and can be kept over longer periods • Rollups can be further aggregated Benefit of Rollup Tables 34
  • 35. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 35
  • 36. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 36
  • 37. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 37
  • 38. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Rollup Table with HLL CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count hll, session_distinct_count hll, minute timestamp ); CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); 38
  • 39. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 39 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute Rollup Table with HLL 39
  • 40. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, hll_union_agg (device_distinct_count) AS device_distinct_count, hll_union_agg (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 40
  • 41. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Interval 1 Interval 1 Partition 1 Interval 1 Partition 3 Interval 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 41
  • 42. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Interval 2 Interval 2 Partition 1 Interval 2 Partition 3 Interval 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 42
  • 43. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 11 7 12 1053.255 Estimation 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg How to Merge COUNT(DISTINCT) with HLL 43
  • 44. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 Interval 1 + Interval 2 Interval 1 Partition 1(7) + Interval 2 Partition 1(11) 11 7 12 1053.255 Estimation Interval 1 Partition 2(5) + Interval 2 Partition 2(7) Interval 1 Partition 3(12) + Interval 2 Partition 4(8) How to Merge COUNT(DISTINCT) with HLL 44
  • 45. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 • What if ... • Without hll, you would have to maintain 2n - 1 rollup tables to cover all combinations in n columns (multiply this with number of time intervals). 45 Rollup Table with HLL 45
  • 46. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 What Happens in Distributed Scenario? 46
  • 47. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 1. Separate data into shards. events_001 events_002 events_003 postgresql-hll in distributed environment 47
  • 48. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 2. Put shards into separate nodes. Worker Node 1 Coordinator Worker Node 2 Worker Node 3 events_001 events_002 events_003 postgresql-hll in distributed environment 48
  • 49. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 3. For each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result 49
  • 50. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 4. Pull intermediate results to a single node. Worker Node 1 events_001 Coordinator Worker Node 2 events_002 Worker Node 3 events_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5) postgresql-hll in distributed environment 50
  • 51. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 5. Merge separate hll data structures and materialize them 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6) postgresql-hll in distributed environment 51
  • 52. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 burak@citusdata.com Thanks & Questions @byucesoy Burak Yücesoy www.citusdata.com @citusdata