4. 100+
Data centers globally
2.5B
Monthly unique visitors
10%
Internet requests
everyday
1.3M+
DNS queries/second
websites, apps & APIs
in 150 countries
6M+
5.5M+
HTTP requests/second
5. What did we want?
- Multidimensional query analytics
- Complex ad-hoc queries
- Capable of current and expected future scale
- Gracefully handle late arriving log data
- Roll-ups/aggregations for long term storage
- Highly available and replicated architecture
Inserted
rows /
Second
O(1M)
Edge Points
of Presence
100+
Query
Dimensions
20+
Years of
stored
aggregation
5+
6.
7.
8. We tried a few things...
- Kafka + Go + Citus
- Kafka + Spark Streaming
- Kafka + Flink
- Kafka + Druid
- Kafka + ClickHouse
9. ClickHouse
- Tabular, column-oriented data store
- Single binary, clustered architecture
- Familiar SQL query interface
Lots of very useful built-in aggregation functions
- Raw log data stored for 3 months
~7 trillion rows
- Aggregated data for ∞
1m, 1h aggregations across 3 dimensions
13. Speeding up typical queries
Fiels
- SUM() / COUNT() over a few low-cardinality dimensions
- Global overview (trends, monitoring)
- Storing intermediate state for non-additive functions
14.
15. Anatomy of a DNS query
$ dig www.cloudflare.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.cloudflare.com. IN A
;; ANSWER SECTION:
www.cloudflare.com. 5 IN A 198.41.215.162
www.cloudflare.com. 5 IN A 198.41.214.162
;; Query time: 34 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sat Sep 2 10:48:30 2017
;; MSG SIZE rcvd: 68
16. Anatomy of a DNS query
$ dig www.cloudflare.com
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.cloudflare.com. IN A
;; ANSWER SECTION:
www.cloudflare.com. 5 IN A 198.41.215.162
www.cloudflare.com. 5 IN A 198.41.214.162
;; Query time: 34 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sat Sep 2 10:48:30 2017
;; MSG SIZE rcvd: 68
Fields
30+
18. October 2016
Began evaluating technologies and
architecture, 1 instance in Docker
Finalized schema, deployed a production
ClickHouse cluster of 6 nodes
November 2016
Prototype ClickHouse cluster with 3
nodes, inserting a sample of data
August 2017
Migrated to a new cluster with
multi-tenancy
Growing interest among other
Cloudflare engineering teams,
worked on standard tooling
December 2016
ClickHouse visualisations with
Superset and Grafana
Spring 2017
TopN, IP prefix matching, Go native
driver, Analytics library, pkey in
monotonic functions
19. Multi-tenant ClickHouse cluster
Row Insertion/s
8M+
Raid-0 Spinning Disks
2PB+
Insertion Throughput/s
4GB+
Nodes
33
October 2016
Began evaluating technologies and
architecture
Finalized schema, deployed a production
ClickHouse cluster of 6 nodes
November 2016
Prototype ClickHouse cluster with 3
nodes, inserting a sample of data
August 2017
Migrated to a new cluster with
multi-tenancy
Growing interest among other
Cloudflare engineering teams,
worked on standard tooling
21. Example
SELECT toStartOfMinute(datetime) as t,
count() / 60 AS qps,
uniq(srcIPv4) AS ip4,
uniq(srcIPv6) AS ip6,
uniq(queryName) AS qn,
countIf(queryType = 1) AS aCount,
countIf(queryType = 28) AS aaaaCount
FROM open.dnslogs
WHERE
date = '2017-08-01'
AND ...
GROUP BY t
ORDER BY t
23. What we’re working on
- Go native driver (github.com/kshvakov/clickhouse)
- Grafana plugin (though the Vertamedia one looks nice)
- Kafka → ClickHouse inserter
- ClickHouse → API scaffolding
- ClickHouse: top K, IP trie dictionary, pkey optimisations, “pipelines”