This document summarizes a presentation about building a real-time analytics API at scale using Citus, an open-source PostgreSQL extension. The presentation discusses how Algolia moved from ElasticSearch to Citus to enable sub-second analytics queries on billions of events per day. Key points include how Algolia configured Citus to shard and distribute data across clusters, used roll-up tables to aggregate raw events into aggregated metrics on different time intervals, and could perform queries on these aggregated tables with sub-800ms latency at scale. The approach using Citus as the foundation has proven successful for Algolia's analytics needs.
DataXDay - Building a Real Time Analytics API at Scale
1. Building a Real Time Analytics API
at Scale
DataXDay, May 17th 2018
Sylvain Friquet
@sylvainfriquet
Software Engineer
2. Algolia: Search as a Service
As-you-type Speed
Results in milliseconds
at every keystrokes.
Relevance
Finding the best content
for every intent.
User Experience
Delightfully engaging,
impressively intuitive.
@DataXDay
4. Algolia by the numbers
16
Regions
55
Data centers
Offices
Regions
40B
Searches /mo
150B
API calls /mo
2012
Founded
200
Employees
4500
Customers
$74M
Funding
6. Where we started
> 4 years old project
> ElasticSearch
> Self Hosted
> 500M to 40B searches/month
> Upgrading ES Cluster too tedious
@DataXDay
7. What we wanted
> Sub second API response time
> Low latency
> Large retention
> Billions of events per day
> Scale with us
> Hosted solution
@DataXDay
17. Rollup table
CREATE TABLE rollups_5min (
timestamp timestamp,
app_id text,
query_count bigint,
user_count HLL,
top_queries JSONB
);
> Aggregate metrics per time window
> TOPN and HLL extension
@DataXDay
18. Rollup table
CREATE TABLE rollups_5min (
timestamp timestamp,
app_id text,
query_count bigint,
user_count HLL,
top_queries JSONB
);
SELECT create_distributed_table('rollups_5min',
'app_id');
> Aggregate metrics per time window
> TOPN and HLL extension
> Collocated with raw event tables
@DataXDay
19. Rollup query
> Periodic rollup
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
@DataXDay
20. Rollup query
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
> Periodic rollup
> Concurrently executed across workers
@DataXDay
21. Rollup query
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
> Periodic rollup
> Concurrently executed across workers
> Out of order events
@DataXDay
22. Rollup query
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
ON CONFLICT (app_id, minute)
DO UPDATE SET
query_count = query + EXCLUDED.query_count,
user_count = user_count + EXCLUDED.user_count,
top_queries = top_queries + EXCLUDED.top_queries;
> Periodic rollup
> Concurrently executed across workers
> Out of order events
> Incremental or idempotent
@DataXDay
23. Rolling up the rollups
> Further aggregate 5min rollups into 1day rollups
> 200x to 50,000x compression ratio in our case
@DataXDay
24. API queries
Sample queries
> Count
SELECT sum(query_count) FROM … rollups_1day UNION ALL rollups_5min … WHERE …
> Distinct Approx Count
SELECT hll_cardinality(sum(user_count))::bigint FROM ...
> TopN
SELECT (topn(topn_union_agg(top_queries), 10)).* FROM ...
@DataXDay
25. Some numbers
> every 5min: ~20s to rollup 20M rows
> 64 shards, 48 vCPUs, 366G RAM
> API latency p99 < 800ms, p95 < 500ms
@DataXDay
26. Conclusion
> Rollup approach working at scale
> Citus becoming the foundation for several new
products (Click Analytics...)
@DataXDay