San Diego Cloud Native Computing Meetup, January 23, 2020
Presented by Robert Hodges, Altinity CEO
Data services are the latest wave of applications to catch the Kubernetes bug, but how many people would guess that includes data warehouses? We proved it works by developing the ClickHouse Kubernetes operator, which is now in production use at companies like Mux.com. It's an open source operator to stand up and run ClickHouse, a popular Apache 2.0 data warehouse that can return queries on trillions of rows in seconds or less. This talk introduces ClickHouse and shows why it's a 'cloud friendly' DBMS. We'll go mano-a-mano with the ClickHouse operator, showing how you can spin up data warehouses in 60 seconds or less. We'll cover issues like storage management, monitoring and upgrade. In short, everything you need to know to try running your own ClickHouse data warehouses on Kubernetes.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert Hodges
1. Data Warehouse on Kubernetes
A gentle introduction to the ClickHouse
Kubernetes Operator
Robert Hodges
2. Brief Intros
www.altinity.com
Leading software and services
provider for ClickHouse
Major committer and community
sponsor in US and Western Europe
Robert Hodges - Altinity CEO
30+ years on DBMS plus
virtualization and security.
ClickHouse is DBMS #20
3. Why run data warehouse on Kubernetes?
1. Same environment as other cloud native services
2. Portability
3. Fast deployment cycles
4. Flexible mapping to resources
...
5. It offers revolutionary capabilities for building
analytic systems
5. Introduction to ClickHouse
Understands SQL
Runs on bare metal to cloud
Shared nothing architecture
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
a b c d
a b c d
a b c d
a b c d
And it’s really fast!
6. ClickHouse structure is optimized for speed
Table
Part
Index Columns
Indexed
Sorted
Compressed
Part
Index Columns
Part
7. ClickHouse has built-in sharding & replication
ClickHouse
event_loc
ClickHouse
event
event_loc
ClickHouse
event_loc
ClickHouse
event_loc
ClickHouse
event_loc
ClickHouse
event_loc
SELECT ...
FROM event
GROUP BY ...
Result Set
Zookeeper
ZNodes
Zookeeper
ZNodes
Zookeeper
ZNodes
8. What makes ClickHouse “cloud friendly?”
● Single process
● Relatively few configuration knobs
● Simple networking and storage
● Replication/high availability built in
● Already containerized!
11. Obligatory slide on Kubernetes
What does Kubernetes do for us?
● manage container-based systems
● build distributed applications declaratively
● allocate machine resources efficiently
● automate application deployment
12. A simple distributed data service
Load
Balancer
Service
#1
Service
#3
Service
#2
Storage
Storage
Storage
Traffic
13. Defined using Kubernetes resources
Pod
“svc-1”
Persistent
Volume
Service
“svc”
Stateful
Set
Persistent
Volume
Claim
Persistent
Volume
Persistent
Volume
Pod
“svc-2”
Pod
“svc-2”
Persistent
Volume
Claim
Persistent
Volume
Claim
Config
Maps
SecretsConfig
Maps
Secrets
16. ClickHouse on Kubernetes is complex!
Zookeeper
Services
Zookeeper-0
Zookeeper-2
Zookeeper-1Shard 1 Replica 1
Replica
Service
Load
Balancer
Service
Shard 1 Replica 2
Shard 2 Replica 1
Shard 2 Replica 2
Replica
Service
Replica
Service
Replica
Service
User Config Map Common Config Map
Stateful
Set
Pod
Persistent
Volume
Claim
Persistent
Volume
Per-replica Config Map
17. Operators encapsulate complex deployments
kube-system namespace
ClickHouse
Operator
your-favorite namespace
Apache 2.0 source,
distributed as Docker
imageSingle specification
Best practice deployment
ClickHouse
Resource
Definition
18. Installing and removing ClickHouse operator
Get operator custom resource definition:
wget
https://raw.githubusercontent.com/Altinity/clickhouse-operato
r/master/deploy/operator/clickhouse-operator-install.yaml
Install the operator:
kubectl apply -f clickhouse-operator-install.yaml
Remove the operator:
kubectl delete -f clickhouse-operator-install.yaml
19. You will also need Zookeeper
Simplest way is to use helm:
kubectl create namespace zk
helm install --namespace zk --name zookeeper
incubator/zookeeper
(There’s also an operator for Zookeeper now)
22. Basic data warehouse topology
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "ch01"
spec:
configuration:
clusters:
- name: replicated
layout:
shardsCount: 2
replicasCount: 2
zookeeper:
nodes:
- host: zookeeper.zk
Name used to identify all resources
Definition of cluster
Location of service we depend on
23. You can add users and change configuration
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "ch01"
spec:
configuration:
users:
demo/default: secret
demo/password: demo
demo/profile: default
demo/quota: default
demo/networks/ip: "::/0"
clusters:
- name: replicated
Changes take a few
minutes to propagate
24. Simplicity requires defaults
defaults:
templates:
volumeClaimTemplate: persistent
podTemplate: clickhouse:19.6
serviceTemplate: minikube
templates:
volumeClaimTemplates:
- name: persistent
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Name of template
Storage misconfigurations
lead to insidious errors
25. Speaking of storage, we have options
● Cloud storage:
○ AWS
○ GKE
○ Other cloud providers
● Local storage
○ emptyDir
○ hostPath
○ local Complex
Network access
Simple
Fast
26. Use storageClassName to bind storage
Use kubectl to find available storage classes:
kubectl describe StorageClass
Bind to default storage:
spec:
storageClassName: default
Bind to gp2 type
spec:
storageClassName: gp2
27. Templates can be simple
defaults:
templates:
volumeClaimTemplate: persistent
podTemplate: clickhouse:19.6
serviceTemplate: minikube
templates:
podTemplates:
- name: clickhouse:19.6
spec:
containers:
- name: clickhouse-pod
image: yandex/clickhouse-server:19.6.2.11
Name of template
Most values take
defaults
29. Versatile mapping to different deployments
ClickHouse
Resource
Definition
Pod
Load
Balance
PodPod
Pod Pod
Load
BalanceLoad
Balance
Load
BalanceLoad
Balance
Pod Pod
Load
BalanceLoad
Balance
Pod Pod
Minikube Multi-AZ Deployment
(Differences mostly
in templates)
30. Changes are recognized automatically
defaults:
templates:
volumeClaimTemplate: persistent
podTemplate: clickhouse:19.11
serviceTemplate: minikube
templates:
podTemplates:
- name: clickhouse:19.11
spec:
containers:
- name: clickhouse-pod
image: yandex/clickhouse-server:19.11.3.11
Make new version
the default
Define template
for new version
31. Upgrade runs while service is online
Pod
chi-0-0
Update resource definition
ClickHouse
Operator
Apply Pod
chi-0-1
Pod
chi-1-1
Pod
chi-1-0
Plan
Compare resource
to actual state
Upgrade pods sequentially
ClickHouse
Resource
Definition
32. What’s going on inside Kubernetes?
kubectl apply
ClickHouse
Operator
Custom
Resource
Controller
ClickHouse
Resource
Definition
Kubernetes API
Events
Actions
Etcd
System state
Native
Controller
Native
Controller
Native
Controllers
36. Pod
chi-0-1
Con: DNS resolution is complex/error prone
Pod
chi-1-1 Pod
chi-0-1
Pod
chi-1-0
Pod
chi-0-0
DNS DNS
DNS
Restart
Pod restart invalidates
cluster DNS mappings
Core DNS
Server
Name resolution
deadlock at startup
Must resolve
host name
to start up
Won’t resolve
host until
pod starts
37. Pro: Kubernetes overhead is minimal
Cluster deploy and load Query Comparison
Redshift dc2.large vs. Kubernetes EC2 r5.xlarge with EBS (st1)
38. Con: error handling is complicated
ClickHouse
Operator
ClickHouse
Resource
Definition
Complex
specification
Kubernetes
Storage
Provider
Asynchronous
execution
Local
semantics
40. Architectural challenge
Data warehouses are not cattle
Losing/compromising data can be really bad
Safety is paramount
Security, migration, availability require logic
above level of the operator
41. Biggest opportunity
Kubernetes democratizes data
warehouse access
Set up complex configurations in minutes
Run on any platform that Kubernetes runs on
Integrate easily with other services
42. Dashboards and predictive analytics
Most intriguing future benefit of Kubernetes
Kafka
Apps
ClickHouse
AppsContent
Delivery
Applications
Grafana
Tailored analytic solution
for every service that
needs it