SlideShare une entreprise Scribd logo
1  sur  47
ClickHouse Data Warehouse 101
The First Billion Rows
Alexander Zaitsev and Robert Hodges
About Us
Alexander Zaitsev - Altinity CTO
Expert in data warehouse with
petabyte-scale deployments.
Altinity Founder; Previously at
LifeStreet (Ad Tech business)
Robert Hodges - Altinity CEO
30+ years on DBMS plus
virtualization and security.
Previously at VMware and
Continuent
Altinity Background
● Premier provider of software and services for ClickHouse
● Incorporated in UK with distributed team in US/Canada/Europe
● Main US/Europe sponsor of ClickHouse community
● Offerings:
○ Enterprise support for ClickHouse and ecosystem projects
○ Software (Kubernetes, cluster manager, tools & utilities)
○ POCs/Training
ClickHouse
Overview
ClickHouse is a powerful data warehouse that handles
many use cases
Understands SQL
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
Is WAY fast!
a b c d
a b c d
a b c d
a b c d
Tables are split into indexed, sorted parts for fast queries
Table
Part
Index Columns
Indexed
Sorted
Compressed
Part
Index Columns
Part
If one server is not enough -- ClickHouse can scale out easily
SELECT ...
FROM
tripdata_dist
ClickHouse
tripdata_dist
(Distributed)
tripdata
(MergeTable)
ClickHouse
tripdata_dist tripdata
ClickHouse
tripdata_dist tripdata
Result Set
Getting Started:
Data Loading
Installation: Use packages on Linux host
$ sudo apt -y install clickhouse-client=19.6.2 
clickhouse-server=19.6.2 
clickhouse-common-static=19.6.2
...
$ sudo systemctl start clickhouse-server
...
$ clickhouse-client
11e99303c78e :) select version()
...
┌─version()─┐
│ 19.6.2.11 │
└───────────┘
Decision tree for ClickHouse basic schema design
Types
are
known?
Fields
are
fixed?
Yes
Use array
columns to
store key
value pairs
No
Use scalar
columns with
String type
No
Yes Use scalar
columns with
specific type
Select
partition key
and sort order
Tabular data structure typically gives the best results
CREATE TABLE tripdata (
`pickup_date` Date DEFAULT
toDate(tpep_pickup_datetime),
`id` UInt64,
`vendor_id` String,
`tpep_pickup_datetime` DateTime,
`tpep_dropoff_datetime` DateTime,
...
) ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_location_id, dropoff_location_id, vendor_id)
Time-based partition key
Scalar columns
Specific datatypes
Sort key to index parts
Use clickhouse-client to load data quickly from files
"Pickup_date","id","vendor_id","tpep_pickup_datetime"…
"2016-01-02",0,"1","2016-01-02 04:03:29","2016-01-02…
"2016-01-29",0,"1","2016-01-29 12:00:51","2016-01-29…
"2016-01-09",0,"1","2016-01-09 17:22:05","2016-01-09…
clickhouse-client --database=nyc_taxi_rides --query='INSERT
INTO tripdata FORMAT CSVWithNames' < data.csv
CSV Input Data
Reading CSV Input with Headers
gzip -d -c | clickhouse-client --database=nyc_taxi_rides
--query='INSERT INTO tripdata FORMAT CSVWithNames'
Reading Gzipped CSV Input with Headers
Wouldn’t it be nice to run in parallel over a lot of input files?
Altinity Datasets project does exactly that!
● Dump existing schema definitions and data to files
● Load files back into a database
● Data dump/load commands run in parallel
See https://github.com/Altinity/altinity-datasets
How long does it take to load 1.3B rows?
$ time ad-cli dataset load nyc_taxi_rides --repo_path=/data1/sample-data
Creating database if it does not exist: nyc_timed
Executing DDL: /data1/sample-data/nyc_taxi_rides/ddl/taxi_zones.sql
. . .
Loading data: table=tripdata, file=data-200901.csv.gz
. . .
Operation summary: succeeded=193, failed=0
real 11m4.827s
user 63m32.854s
sys 2m41.235s
(Amazon md5.2xlarge: Xeon(R) Platinum 8175M, 8vCPU, 30GB RAM, NVMe SSD)
Do we really have 1B+ table?
:) select count() from tripdata;
SELECT count()
FROM tripdata
┌────count()─┐
│ 1310903963 │
└────────────┘
1 rows in set. Elapsed: 0.324 sec. Processed 1.31 billion rows, 1.31 GB (4.05
billion rows/s., 4.05 GB/s.)
1,310,903,963/11m4s = 1,974,253 rows/sec!!!
Getting Started
on Queries
Let’s try to predict maximum performance
SELECT avg(number)
FROM
(
SELECT number
FROM system.numbers
LIMIT 1310903963
)
┌─avg(number)─┐
│ 655451981 │
└─────────────┘
1 rows in set. Elapsed: 3.420 sec. Processed 1.31 billion rows, 10.49 GB (383.29
million rows/s., 3.07 GB/s.)
system.numbers -- internal
generator for testing
Now we try with the real data
SELECT avg(passenger_count)
FROM tripdata
┌─avg(passenger_count)─┐
│ 1.6817462943317076 │
└──────────────────────┘
1 rows in set. Elapsed: ?
Guess how fast?
Now we try with the real data
SELECT avg(passenger_count)
FROM tripdata
┌─avg(passenger_count)─┐
│ 1.6817462943317076 │
└──────────────────────┘
1 rows in set. Elapsed: 1.084 sec. Processed 1.31 billion rows, 1.31 GB (1.21
billion rows/s., 1.21 GB/s.)
Even faster!!!!
Data type and cardinality matters
What if we add a filter
SELECT avg(passenger_count)
FROM tripdata
WHERE toYear(pickup_date) = 2016
┌─avg(passenger_count)─┐
│ 1.6571129913837774 │
└──────────────────────┘
1 rows in set. Elapsed: 0.162 sec. Processed 131.17 million rows, 393.50 MB (811.05
million rows/s., 2.43 GB/s.)
What if we add a group by
SELECT
pickup_location_id AS location_id,
avg(passenger_count),
count()
FROM tripdata
WHERE toYear(pickup_date) = 2016
GROUP BY location_id LIMIT 10
...
10 rows in set. Elapsed: 0.251 sec. Processed 131.17 million rows, 655.83 MB
(522.62 million rows/s., 2.61 GB/s.)
What if we add a join
SELECT
zone,
avg(passenger_count),
count()
FROM tripdata
INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id
WHERE toYear(pickup_date) = 2016
GROUP BY zone
LIMIT 10
10 rows in set. Elapsed: 0.803 sec. Processed 131.17 million rows, 655.83 MB (163.29
million rows/s., 816.44 MB/s.)
Yes, ClickHouse is FAST!
https://tech.marksblogg.com/benchmarks.html
Optimization
Techniques
How to make ClickHouse
even faster
You can optimize
Server settings
Schema
Column storage
Queries
You can optimize
SELECT avg(passenger_count)
FROM tripdata
SETTINGS max_threads = 1
...
1 rows in set. Elapsed: 4.855 sec. Processed 1.31 billion rows, 1.31 GB
(270.04 million rows/s., 270.04 MB/s.)
SELECT avg(passenger_count)
FROM tripdata
SETTINGS max_threads = 8
...
1 rows in set. Elapsed: 1.092 sec. Processed 1.31 billion rows, 1.31 GB (1.20
billion rows/s., 1.20 GB/s.)
Default is a half of
available cores --
good enough
Schema optimizations
Data types
Index
Dictionaries
Arrays
Materialized Views and aggregating engines
Data Types matter!
https://www.percona.com/blog/2019/02/15/clickhouse-performance-uint32-vs-uint64-vs-float32-vs-float64/
MaterializedView with SummingMergeTree
CREATE MATERIALIZED VIEW tripdata_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_location_id, dropoff_location_id, vendor_id) AS
SELECT
pickup_date,
vendor_id,
pickup_location_id,
dropoff_location_id,
sum(passenger_count) AS passenger_count_sum,
sum(trip_distance) AS trip_distance_sum,
sum(fare_amount) AS fare_amount_sum,
sum(tip_amount) AS tip_amount_sum,
sum(tolls_amount) AS tolls_amount_sum,
sum(total_amount) AS total_amount_sum,
count() AS trips_count
FROM tripdata
GROUP BY
pickup_date,
vendor_id,
pickup_location_id,
dropoff_location_id
MaterializedView
works as an INSERT
trigger
SummingMergeTree
automatically
aggregates data in
the background
MaterializedView with SummingMergeTree
INSERT INTO tripdata_mv SELECT
pickup_date,
vendor_id,
pickup_location_id,
dropoff_location_id,
passenger_count,
trip_distance,
fare_amount,
tip_amount,
tolls_amount,
total_amount,
1
FROM tripdata;
Ok.
0 rows in set. Elapsed: 303.664 sec. Processed 1.31 billion rows,
50.57 GB (4.32 million rows/s., 166.54 MB/s.)
Note, no group by!
SummingMergeTree
automatically
aggregates data in
the background
MaterializedView with SummingMergeTree
SELECT count()
FROM tripdata_mv
┌──count()─┐
│ 20742525 │
└──────────┘
1 rows in set. Elapsed: 0.015 sec. Processed 20.74 million rows, 41.49 MB (1.39 billion
rows/s., 2.78 GB/s.)
SELECT
zone,
sum(passenger_count_sum)/sum(trips_count),
sum(trips_count)
FROM tripdata_mv
INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id
WHERE toYear(pickup_date) = 2016
GROUP BY zone
LIMIT 10
10 rows in set. Elapsed: 0.036 sec. Processed 3.23 million rows, 64.57 MB (89.14 million
rows/s., 1.78 GB/s.)
Realtime Aggreation with Materialized Views
Raw Data
Summing
MergeTree
Summing
MergeTree
Summing
MergeTree
INSERTS
Column storage optimizations
Compression
LowCardinality
Column encodings
:) create table test_lc (
a String, a_lc LowCardinality(String) DEFAULT a) Engine = MergeTree
PARTITION BY tuple() ORDER BY tuple();
:) INSERT INTO test_lc (a) SELECT
concat('openconfig-interfaces:interfaces/interface/subinterfaces/subinter
face/state/index', toString(rand() % 1000))
FROM system.numbers LIMIT 1000000000;
┌─table───┬─name─┬─type───────────────────┬─compressed─┬─uncompressed─┐
│ test_lc │ a │ String │ 4663631515 │ 84889975226 │
│ test_lc │ a_lc │ LowCardinality(String) │ 2010472937 │ 2002717299 │
└─────────┴──────┴────────────────────────┴────────────┴──────────────┘
LowCardinality example. Another 1B rows.
Storage is
dramatically
reduced
LowCardinality
encodes column
with a dictionary
encoding
:) select a a, count(*) from test_lc group by a order by count(*) desc limit 10;
┌─a────────────────────────────────────────────────────────────────────────────────────┬─count()─┐
│ openconfig-interfaces:interfaces/interface/subinterfaces/subinterface/state/index396 │ 1002761 │
...
│ openconfig-interfaces:interfaces/interface/subinterfaces/subinterface/state/index5 │ 1002203 │
└──────────────────────────────────────────────────────────────────────────────────────┴─────────┘
10 rows in set. Elapsed: 11.627 sec. Processed 1.00 billion rows, 92.89 GB (86.00 million
rows/s., 7.99 GB/s.)
:) select a_lc a, count(*) from test_lc group by a order by count(*) desc limit 10;
...
10 rows in set. Elapsed: 1.569 sec. Processed 1.00 billion rows, 3.42 GB (637.50 million
rows/s., 2.18 GB/s.)
LowCardinality example. Another 1B rows
Faster
create table test_array (
s String,
a Array(LowCardinality(String)) default arrayDistinct(splitByChar(',', s))
) Engine = MergeTree PARTITION BY tuple() ORDER BY tuple();
INSERT INTO test_array (s)
WITH ['Percona', 'Live', 'Altinity', 'ClickHouse', 'MySQL', 'Oracle', 'Austin', 'Texas',
'PostgreSQL', 'MongoDB'] AS keywords
SELECT concat(keywords[((rand(1) % 10) + 1)], ',',
keywords[((rand(2) % 10) + 1)], ',',
keywords[((rand(3) % 10) + 1)], ',',
keywords[((rand(4) % 10) + 1)])
FROM system.numbers LIMIT 1000000000;
Array example. Another 1B rows
Arrays efficiently model 1-to-N
relationship
Note the use of complex default
expression
Data sample:
┌─s────────────────────────────────────┬─a────────────────────────────────────────────┐
│ Texas,ClickHouse,Live,MySQL │ ['Texas','ClickHouse','Live','MySQL'] │
│ Texas,Oracle,Altinity,PostgreSQL │ ['Texas','PostgreSQL','Oracle','Altinity'] │
│ Percona,MySQL,MySQL,Austin │ ['MySQL','Percona','Austin'] │
│ PostgreSQL,Austin,PostgreSQL,Percona │ ['PostgreSQL','Percona','Austin'] │
│ Altinity,Percona,Percona,Percona │ ['Altinity','Percona'] │
Storage:
┌─table──────┬─name─┬─type──────────────────────────┬────────comp─┬──────uncomp─┐
│ test_array │ s │ String │ 11239860686 │ 31200058000 │
│ test_array │ a │ Array(LowCardinality(String)) │ 4275679420 │ 11440948123 │
└────────────┴──────┴───────────────────────────────┴─────────────┴─────────────┘
Array example. Another 1B rows
Array efficiently models 1-to-N
relationship
:) select count() from test_array where s like '%ClickHouse%';
┌───count()─┐
│ 343877409 │
└───────────┘
1 rows in set. Elapsed: 7.363 sec. Processed 1.00 billion rows, 39.20 GB (135.81 million
rows/s., 5.32 GB/s.)
:) select count() from test_array where has(a,'ClickHouse');
┌───count()─┐
│ 343877409 │
└───────────┘
1 rows in set. Elapsed: 8.428 sec. Processed 1.00 billion rows, 11.44 GB (118.66 million
rows/s., 1.36 GB/s.)
Array example. Another 1B rows
Well, ‘like’ is very efficient, but we reduced I/O a lot.
* has() will be optimized by dev team
SELECT
zone,
avg(passenger_count),
count()
FROM tripdata
INNER JOIN taxi_zones ON taxi_zones.location_id =
pickup_location_id
WHERE toYear(pickup_date) = 2016
GROUP BY zone
LIMIT 10
10 rows in set. Elapsed: 0.803 sec. Processed 131.17 million rows,
655.83 MB (163.29 million rows/s., 816.44 MB/s.)
Query optimization example. JOIN optimization
Can we do it any faster?
SELECT
zone,
sum(pc_sum) / sum(pc_cnt) AS pc_avg,
sum(pc_cnt)
FROM
(
SELECT
pickup_location_id,
sum(passenger_count) AS pc_sum,
count() AS pc_cnt
FROM tripdata
WHERE toYear(pickup_date) = 2016
GROUP BY pickup_location_id
)
INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id
GROUP BY zone LIMIT 10
10 rows in set. Elapsed: 0.248 sec. Processed 131.17 million rows, 655.83
MB (529.19 million rows/s., 2.65 GB/s.)
Query optimization example. JOIN optimization
Subquery minimizes data
scanned in parallel; joins
on GROUP BY results
ClickHouse
Integrations
...And a nice set of supporting ecosystem tools
Client libraries: JDBC, ODBC, Python, Golang, ...
Kafka table engine to ingest from Kafka queues
Visualization tools: Grafana, Tableau, Tabix, SuperSet
Data science stack integration: Pandas, Jupyter Notebooks
Kubernetes ClickHouse operator
Integrations with MySQL
MySQL External Dictionaries (pull data from MySQL to CH)
MySQL Table Engine and Table Function (query/insert)
Binary Log Replication
ProxySQL supports ClickHouse
ClickHouse supports MySQL wire protocol (in June release)
..and with PostgreSQL
ODBC External Dictionaries (pull data from PostgreSQL to CH)
ODBC Table Engine and Table Function (query/insert)
Logical Replication: https://github.com/mkabilov/pg2ch
Foreign Data Wrapper:
https://github.com/Percona-Lab/clickhousedb_fdw
ClickHouse Operator -- an easy way to manage ClickHouse
DWH in Kubernetes
https://github.com/Altinity/clickhouse-operator
Where to get more information
● ClickHouse Docs: https://clickhouse.yandex/docs/en/
● Altinity Blog: https://www.altinity.com/blog
● Meetups and presentations: https://www.altinity.com/presentations
○ 2 April -- Madrid, Spain ClickHouse Meetup
○ 7 May -- Limassol, Cyprus ClickHouse Meetup
○ 28-30 May -- Austin, TX Percona Live 2019
○ 4 June -- San Francisco ClickHouse Meetup
○ 8 June -- Beijing ClickHouse Meetup
○ September -- ClickHouse Paris Meetup
Questions?
Thank you!
Contacts:
info@altinity.com
Visit us at:
https://www.altinity.com
Read Our Blog:
https://www.altinity.com/blog

Contenu connexe

Tendances

ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseAltinity Ltd
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareAltinity Ltd
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Altinity Ltd
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEOAltinity Ltd
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Ltd
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAltinity Ltd
 
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIData Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd
 

Tendances (20)

ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
ClickHouse Intro
ClickHouse IntroClickHouse Intro
ClickHouse Intro
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree Engine
 
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIData Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
 

Similaire à ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev and Robert Hodges, Altinity

ClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic ContinuesClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic ContinuesAltinity Ltd
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...Altinity Ltd
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et Rpkernevez
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Future of Data Meetup
 
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...MongoDB
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLJim Mlodgenski
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Amazon Redshift
Amazon RedshiftAmazon Redshift
Amazon RedshiftJeff Patti
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Handling 20 billion requests a month
Handling 20 billion requests a monthHandling 20 billion requests a month
Handling 20 billion requests a monthDmitriy Dumanskiy
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptxPaulo Alonso
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open SourceBuilding an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open SourceAltinity Ltd
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptxBuilding an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptxAltinity Ltd
 

Similaire à ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev and Robert Hodges, Altinity (20)

ClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic ContinuesClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic Continues
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
Malstone KDD 2010
Malstone KDD 2010Malstone KDD 2010
Malstone KDD 2010
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Amazon Redshift
Amazon RedshiftAmazon Redshift
Amazon Redshift
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
Handling 20 billion requests a month
Handling 20 billion requests a monthHandling 20 billion requests a month
Handling 20 billion requests a month
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open SourceBuilding an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open Source
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptxBuilding an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
 

Plus de Altinity Ltd

Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...Altinity Ltd
 
Fun with ClickHouse Window Functions-2021-08-19.pdf
Fun with ClickHouse Window Functions-2021-08-19.pdfFun with ClickHouse Window Functions-2021-08-19.pdf
Fun with ClickHouse Window Functions-2021-08-19.pdfAltinity Ltd
 
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdfCloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdfAltinity Ltd
 
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...Altinity Ltd
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
 
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdfOwn your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdfAltinity Ltd
 
ClickHouse ReplacingMergeTree in Telecom Apps
ClickHouse ReplacingMergeTree in Telecom AppsClickHouse ReplacingMergeTree in Telecom Apps
ClickHouse ReplacingMergeTree in Telecom AppsAltinity Ltd
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdfAltinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdfAltinity Ltd
 
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...Altinity Ltd
 
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdfOSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdfAltinity Ltd
 
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...Altinity Ltd
 
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...Altinity Ltd
 
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...Altinity Ltd
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...Altinity Ltd
 
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdfOSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdfAltinity Ltd
 
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...Altinity Ltd
 
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...Altinity Ltd
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 

Plus de Altinity Ltd (20)

Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
 
Fun with ClickHouse Window Functions-2021-08-19.pdf
Fun with ClickHouse Window Functions-2021-08-19.pdfFun with ClickHouse Window Functions-2021-08-19.pdf
Fun with ClickHouse Window Functions-2021-08-19.pdf
 
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdfCloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
 
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdfOwn your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
 
ClickHouse ReplacingMergeTree in Telecom Apps
ClickHouse ReplacingMergeTree in Telecom AppsClickHouse ReplacingMergeTree in Telecom Apps
ClickHouse ReplacingMergeTree in Telecom Apps
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdfAltinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
 
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
 
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdfOSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
 
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
 
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
 
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
 
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdfOSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
 
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
 
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...
OSA Con 2022 - Signal Correlation, the Ho11y Grail - Michael Hausenblas - AWS...
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 

Dernier

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Dernier (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev and Robert Hodges, Altinity

  • 1. ClickHouse Data Warehouse 101 The First Billion Rows Alexander Zaitsev and Robert Hodges
  • 2. About Us Alexander Zaitsev - Altinity CTO Expert in data warehouse with petabyte-scale deployments. Altinity Founder; Previously at LifeStreet (Ad Tech business) Robert Hodges - Altinity CEO 30+ years on DBMS plus virtualization and security. Previously at VMware and Continuent
  • 3. Altinity Background ● Premier provider of software and services for ClickHouse ● Incorporated in UK with distributed team in US/Canada/Europe ● Main US/Europe sponsor of ClickHouse community ● Offerings: ○ Enterprise support for ClickHouse and ecosystem projects ○ Software (Kubernetes, cluster manager, tools & utilities) ○ POCs/Training
  • 5. ClickHouse is a powerful data warehouse that handles many use cases Understands SQL Runs on bare metal to cloud Stores data in columns Parallel and vectorized execution Scales to many petabytes Is Open source (Apache 2.0) Is WAY fast! a b c d a b c d a b c d a b c d
  • 6. Tables are split into indexed, sorted parts for fast queries Table Part Index Columns Indexed Sorted Compressed Part Index Columns Part
  • 7. If one server is not enough -- ClickHouse can scale out easily SELECT ... FROM tripdata_dist ClickHouse tripdata_dist (Distributed) tripdata (MergeTable) ClickHouse tripdata_dist tripdata ClickHouse tripdata_dist tripdata Result Set
  • 9. Installation: Use packages on Linux host $ sudo apt -y install clickhouse-client=19.6.2 clickhouse-server=19.6.2 clickhouse-common-static=19.6.2 ... $ sudo systemctl start clickhouse-server ... $ clickhouse-client 11e99303c78e :) select version() ... ┌─version()─┐ │ 19.6.2.11 │ └───────────┘
  • 10. Decision tree for ClickHouse basic schema design Types are known? Fields are fixed? Yes Use array columns to store key value pairs No Use scalar columns with String type No Yes Use scalar columns with specific type Select partition key and sort order
  • 11. Tabular data structure typically gives the best results CREATE TABLE tripdata ( `pickup_date` Date DEFAULT toDate(tpep_pickup_datetime), `id` UInt64, `vendor_id` String, `tpep_pickup_datetime` DateTime, `tpep_dropoff_datetime` DateTime, ... ) ENGINE = MergeTree PARTITION BY toYYYYMM(pickup_date) ORDER BY (pickup_location_id, dropoff_location_id, vendor_id) Time-based partition key Scalar columns Specific datatypes Sort key to index parts
  • 12. Use clickhouse-client to load data quickly from files "Pickup_date","id","vendor_id","tpep_pickup_datetime"… "2016-01-02",0,"1","2016-01-02 04:03:29","2016-01-02… "2016-01-29",0,"1","2016-01-29 12:00:51","2016-01-29… "2016-01-09",0,"1","2016-01-09 17:22:05","2016-01-09… clickhouse-client --database=nyc_taxi_rides --query='INSERT INTO tripdata FORMAT CSVWithNames' < data.csv CSV Input Data Reading CSV Input with Headers gzip -d -c | clickhouse-client --database=nyc_taxi_rides --query='INSERT INTO tripdata FORMAT CSVWithNames' Reading Gzipped CSV Input with Headers
  • 13. Wouldn’t it be nice to run in parallel over a lot of input files? Altinity Datasets project does exactly that! ● Dump existing schema definitions and data to files ● Load files back into a database ● Data dump/load commands run in parallel See https://github.com/Altinity/altinity-datasets
  • 14. How long does it take to load 1.3B rows? $ time ad-cli dataset load nyc_taxi_rides --repo_path=/data1/sample-data Creating database if it does not exist: nyc_timed Executing DDL: /data1/sample-data/nyc_taxi_rides/ddl/taxi_zones.sql . . . Loading data: table=tripdata, file=data-200901.csv.gz . . . Operation summary: succeeded=193, failed=0 real 11m4.827s user 63m32.854s sys 2m41.235s (Amazon md5.2xlarge: Xeon(R) Platinum 8175M, 8vCPU, 30GB RAM, NVMe SSD)
  • 15. Do we really have 1B+ table? :) select count() from tripdata; SELECT count() FROM tripdata ┌────count()─┐ │ 1310903963 │ └────────────┘ 1 rows in set. Elapsed: 0.324 sec. Processed 1.31 billion rows, 1.31 GB (4.05 billion rows/s., 4.05 GB/s.) 1,310,903,963/11m4s = 1,974,253 rows/sec!!!
  • 17. Let’s try to predict maximum performance SELECT avg(number) FROM ( SELECT number FROM system.numbers LIMIT 1310903963 ) ┌─avg(number)─┐ │ 655451981 │ └─────────────┘ 1 rows in set. Elapsed: 3.420 sec. Processed 1.31 billion rows, 10.49 GB (383.29 million rows/s., 3.07 GB/s.) system.numbers -- internal generator for testing
  • 18. Now we try with the real data SELECT avg(passenger_count) FROM tripdata ┌─avg(passenger_count)─┐ │ 1.6817462943317076 │ └──────────────────────┘ 1 rows in set. Elapsed: ? Guess how fast?
  • 19. Now we try with the real data SELECT avg(passenger_count) FROM tripdata ┌─avg(passenger_count)─┐ │ 1.6817462943317076 │ └──────────────────────┘ 1 rows in set. Elapsed: 1.084 sec. Processed 1.31 billion rows, 1.31 GB (1.21 billion rows/s., 1.21 GB/s.) Even faster!!!! Data type and cardinality matters
  • 20. What if we add a filter SELECT avg(passenger_count) FROM tripdata WHERE toYear(pickup_date) = 2016 ┌─avg(passenger_count)─┐ │ 1.6571129913837774 │ └──────────────────────┘ 1 rows in set. Elapsed: 0.162 sec. Processed 131.17 million rows, 393.50 MB (811.05 million rows/s., 2.43 GB/s.)
  • 21. What if we add a group by SELECT pickup_location_id AS location_id, avg(passenger_count), count() FROM tripdata WHERE toYear(pickup_date) = 2016 GROUP BY location_id LIMIT 10 ... 10 rows in set. Elapsed: 0.251 sec. Processed 131.17 million rows, 655.83 MB (522.62 million rows/s., 2.61 GB/s.)
  • 22. What if we add a join SELECT zone, avg(passenger_count), count() FROM tripdata INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id WHERE toYear(pickup_date) = 2016 GROUP BY zone LIMIT 10 10 rows in set. Elapsed: 0.803 sec. Processed 131.17 million rows, 655.83 MB (163.29 million rows/s., 816.44 MB/s.)
  • 23. Yes, ClickHouse is FAST! https://tech.marksblogg.com/benchmarks.html
  • 24. Optimization Techniques How to make ClickHouse even faster
  • 25. You can optimize Server settings Schema Column storage Queries
  • 26. You can optimize SELECT avg(passenger_count) FROM tripdata SETTINGS max_threads = 1 ... 1 rows in set. Elapsed: 4.855 sec. Processed 1.31 billion rows, 1.31 GB (270.04 million rows/s., 270.04 MB/s.) SELECT avg(passenger_count) FROM tripdata SETTINGS max_threads = 8 ... 1 rows in set. Elapsed: 1.092 sec. Processed 1.31 billion rows, 1.31 GB (1.20 billion rows/s., 1.20 GB/s.) Default is a half of available cores -- good enough
  • 29. MaterializedView with SummingMergeTree CREATE MATERIALIZED VIEW tripdata_mv ENGINE = SummingMergeTree PARTITION BY toYYYYMM(pickup_date) ORDER BY (pickup_location_id, dropoff_location_id, vendor_id) AS SELECT pickup_date, vendor_id, pickup_location_id, dropoff_location_id, sum(passenger_count) AS passenger_count_sum, sum(trip_distance) AS trip_distance_sum, sum(fare_amount) AS fare_amount_sum, sum(tip_amount) AS tip_amount_sum, sum(tolls_amount) AS tolls_amount_sum, sum(total_amount) AS total_amount_sum, count() AS trips_count FROM tripdata GROUP BY pickup_date, vendor_id, pickup_location_id, dropoff_location_id MaterializedView works as an INSERT trigger SummingMergeTree automatically aggregates data in the background
  • 30. MaterializedView with SummingMergeTree INSERT INTO tripdata_mv SELECT pickup_date, vendor_id, pickup_location_id, dropoff_location_id, passenger_count, trip_distance, fare_amount, tip_amount, tolls_amount, total_amount, 1 FROM tripdata; Ok. 0 rows in set. Elapsed: 303.664 sec. Processed 1.31 billion rows, 50.57 GB (4.32 million rows/s., 166.54 MB/s.) Note, no group by! SummingMergeTree automatically aggregates data in the background
  • 31. MaterializedView with SummingMergeTree SELECT count() FROM tripdata_mv ┌──count()─┐ │ 20742525 │ └──────────┘ 1 rows in set. Elapsed: 0.015 sec. Processed 20.74 million rows, 41.49 MB (1.39 billion rows/s., 2.78 GB/s.) SELECT zone, sum(passenger_count_sum)/sum(trips_count), sum(trips_count) FROM tripdata_mv INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id WHERE toYear(pickup_date) = 2016 GROUP BY zone LIMIT 10 10 rows in set. Elapsed: 0.036 sec. Processed 3.23 million rows, 64.57 MB (89.14 million rows/s., 1.78 GB/s.)
  • 32. Realtime Aggreation with Materialized Views Raw Data Summing MergeTree Summing MergeTree Summing MergeTree INSERTS
  • 34. :) create table test_lc ( a String, a_lc LowCardinality(String) DEFAULT a) Engine = MergeTree PARTITION BY tuple() ORDER BY tuple(); :) INSERT INTO test_lc (a) SELECT concat('openconfig-interfaces:interfaces/interface/subinterfaces/subinter face/state/index', toString(rand() % 1000)) FROM system.numbers LIMIT 1000000000; ┌─table───┬─name─┬─type───────────────────┬─compressed─┬─uncompressed─┐ │ test_lc │ a │ String │ 4663631515 │ 84889975226 │ │ test_lc │ a_lc │ LowCardinality(String) │ 2010472937 │ 2002717299 │ └─────────┴──────┴────────────────────────┴────────────┴──────────────┘ LowCardinality example. Another 1B rows. Storage is dramatically reduced LowCardinality encodes column with a dictionary encoding
  • 35. :) select a a, count(*) from test_lc group by a order by count(*) desc limit 10; ┌─a────────────────────────────────────────────────────────────────────────────────────┬─count()─┐ │ openconfig-interfaces:interfaces/interface/subinterfaces/subinterface/state/index396 │ 1002761 │ ... │ openconfig-interfaces:interfaces/interface/subinterfaces/subinterface/state/index5 │ 1002203 │ └──────────────────────────────────────────────────────────────────────────────────────┴─────────┘ 10 rows in set. Elapsed: 11.627 sec. Processed 1.00 billion rows, 92.89 GB (86.00 million rows/s., 7.99 GB/s.) :) select a_lc a, count(*) from test_lc group by a order by count(*) desc limit 10; ... 10 rows in set. Elapsed: 1.569 sec. Processed 1.00 billion rows, 3.42 GB (637.50 million rows/s., 2.18 GB/s.) LowCardinality example. Another 1B rows Faster
  • 36. create table test_array ( s String, a Array(LowCardinality(String)) default arrayDistinct(splitByChar(',', s)) ) Engine = MergeTree PARTITION BY tuple() ORDER BY tuple(); INSERT INTO test_array (s) WITH ['Percona', 'Live', 'Altinity', 'ClickHouse', 'MySQL', 'Oracle', 'Austin', 'Texas', 'PostgreSQL', 'MongoDB'] AS keywords SELECT concat(keywords[((rand(1) % 10) + 1)], ',', keywords[((rand(2) % 10) + 1)], ',', keywords[((rand(3) % 10) + 1)], ',', keywords[((rand(4) % 10) + 1)]) FROM system.numbers LIMIT 1000000000; Array example. Another 1B rows Arrays efficiently model 1-to-N relationship Note the use of complex default expression
  • 37. Data sample: ┌─s────────────────────────────────────┬─a────────────────────────────────────────────┐ │ Texas,ClickHouse,Live,MySQL │ ['Texas','ClickHouse','Live','MySQL'] │ │ Texas,Oracle,Altinity,PostgreSQL │ ['Texas','PostgreSQL','Oracle','Altinity'] │ │ Percona,MySQL,MySQL,Austin │ ['MySQL','Percona','Austin'] │ │ PostgreSQL,Austin,PostgreSQL,Percona │ ['PostgreSQL','Percona','Austin'] │ │ Altinity,Percona,Percona,Percona │ ['Altinity','Percona'] │ Storage: ┌─table──────┬─name─┬─type──────────────────────────┬────────comp─┬──────uncomp─┐ │ test_array │ s │ String │ 11239860686 │ 31200058000 │ │ test_array │ a │ Array(LowCardinality(String)) │ 4275679420 │ 11440948123 │ └────────────┴──────┴───────────────────────────────┴─────────────┴─────────────┘ Array example. Another 1B rows Array efficiently models 1-to-N relationship
  • 38. :) select count() from test_array where s like '%ClickHouse%'; ┌───count()─┐ │ 343877409 │ └───────────┘ 1 rows in set. Elapsed: 7.363 sec. Processed 1.00 billion rows, 39.20 GB (135.81 million rows/s., 5.32 GB/s.) :) select count() from test_array where has(a,'ClickHouse'); ┌───count()─┐ │ 343877409 │ └───────────┘ 1 rows in set. Elapsed: 8.428 sec. Processed 1.00 billion rows, 11.44 GB (118.66 million rows/s., 1.36 GB/s.) Array example. Another 1B rows Well, ‘like’ is very efficient, but we reduced I/O a lot. * has() will be optimized by dev team
  • 39. SELECT zone, avg(passenger_count), count() FROM tripdata INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id WHERE toYear(pickup_date) = 2016 GROUP BY zone LIMIT 10 10 rows in set. Elapsed: 0.803 sec. Processed 131.17 million rows, 655.83 MB (163.29 million rows/s., 816.44 MB/s.) Query optimization example. JOIN optimization Can we do it any faster?
  • 40. SELECT zone, sum(pc_sum) / sum(pc_cnt) AS pc_avg, sum(pc_cnt) FROM ( SELECT pickup_location_id, sum(passenger_count) AS pc_sum, count() AS pc_cnt FROM tripdata WHERE toYear(pickup_date) = 2016 GROUP BY pickup_location_id ) INNER JOIN taxi_zones ON taxi_zones.location_id = pickup_location_id GROUP BY zone LIMIT 10 10 rows in set. Elapsed: 0.248 sec. Processed 131.17 million rows, 655.83 MB (529.19 million rows/s., 2.65 GB/s.) Query optimization example. JOIN optimization Subquery minimizes data scanned in parallel; joins on GROUP BY results
  • 42. ...And a nice set of supporting ecosystem tools Client libraries: JDBC, ODBC, Python, Golang, ... Kafka table engine to ingest from Kafka queues Visualization tools: Grafana, Tableau, Tabix, SuperSet Data science stack integration: Pandas, Jupyter Notebooks Kubernetes ClickHouse operator
  • 43. Integrations with MySQL MySQL External Dictionaries (pull data from MySQL to CH) MySQL Table Engine and Table Function (query/insert) Binary Log Replication ProxySQL supports ClickHouse ClickHouse supports MySQL wire protocol (in June release)
  • 44. ..and with PostgreSQL ODBC External Dictionaries (pull data from PostgreSQL to CH) ODBC Table Engine and Table Function (query/insert) Logical Replication: https://github.com/mkabilov/pg2ch Foreign Data Wrapper: https://github.com/Percona-Lab/clickhousedb_fdw
  • 45. ClickHouse Operator -- an easy way to manage ClickHouse DWH in Kubernetes https://github.com/Altinity/clickhouse-operator
  • 46. Where to get more information ● ClickHouse Docs: https://clickhouse.yandex/docs/en/ ● Altinity Blog: https://www.altinity.com/blog ● Meetups and presentations: https://www.altinity.com/presentations ○ 2 April -- Madrid, Spain ClickHouse Meetup ○ 7 May -- Limassol, Cyprus ClickHouse Meetup ○ 28-30 May -- Austin, TX Percona Live 2019 ○ 4 June -- San Francisco ClickHouse Meetup ○ 8 June -- Beijing ClickHouse Meetup ○ September -- ClickHouse Paris Meetup
  • 47. Questions? Thank you! Contacts: info@altinity.com Visit us at: https://www.altinity.com Read Our Blog: https://www.altinity.com/blog