Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

Storage Capacity Management
@Booking.com
Nurettin OMEROGLU
What happens
when broker
disk is FULL?
A)Only some producers fail
B)
C)
All producers fail
Kafka service fails
Streaming Infra Team
Nurettin OMEROGLU
Senior Software Engineer
I am a member of Streaming Infra Team (10
people) and have more than 4 years of expertise
on Apache Kafka client and server side
components. We manage on-prem Kafka solution
serving to clients running on variety of platforms
such as bare-metal, kubernetes and also Cloud
Agenda
1. Introduction
2. Before the project
3. Step by step capacity project
4. Future plans
Introduction
100M
monthly active
app users
155,000
destinations around the world
Car hire available in 140+countries
and pre-booked taxis in
over 500cities across 120+
countries
243M+
verified guest reviews
and 24/7
customer service
in 45
languages and dialects
Since 2010,
Booking.com has
welcomed
4.5B+
guest arrivals
28M
total reported
listings
worldwide
6.6M
options in homes,
apartments and
other unique
places to stay
30
different types of
places to stay,
including homes,
apartments, B&Bs,
hostels, farm stays,
bungalows, even
boats, igloos and
treehouses
140offices in 70countries over
5,000employees in Amsterdam
Payments
A/B Tests
MySQL
Cassandra
Hadoop
Cloud
...
Events
Logs
Online ML
Fraud detection
Personalization
Bookings FPA reporting
Data Streaming
Platform
MySQL
Cassandra
Hadoop
Cloud
...
● Transports and transposes data via pub/sub;
● Connects application through data pipeline
● Resilient, scalable, fault tolerant, secure, with SLO guarantees;
Real-time
analytics
Scale of Streaming @Booking.com
How much data? ~2.2PB
produced and consumed per day
How many clusters? 62
How many topics? ~34K
How many partitions? ~138K
How many servers? 900 kafka brokers
+75 zk
Before the project
Setup
● On-premise multi-tenant kafka clusters running on bare-metal
● Local SSD storage (~3.5TB per broker)
● 32 thread CPU / 256MB memory / 10 Gb network
Existing Components
● Custom Configuration validations
● Custom Quota validations
○ Topics per principal
○ Partitions per principal
…
● Topics
● Custom quotas
(booking-specific)
…
● Specific Configurations
● Custom Quotas
…
● Custom PrincipalBuilder
● Custom Policies
(AbstractPolicy)
○ AlterConfigPolicy
○ CreateTopicPolicy
…
Mysql
(Metadata
Store)
Bkstreaming CLI
(Self-service, home-built)
Kontrole
(Control Center, home-built)
Kafka Cluster
Example Scenario for Custom Quota Validations
(2) Auth: OK
(3) Topics per principal quota: OK
(4) Partitions per principal quota: OK
(1) Add topic for a service
(5) Create topic
Mysql
(Metadata
Store)
Kontrole
(Control Center)
Kafka Cluster
Reactive Approach
● Clients use retention.ms configuration
retention.ms - which deletes messages after a
certain amount of time.
● Dangerous situations if traffic spikes
● We were the middleman handling the toil /
issues between multiple tenants
○ Increase number of brokers, or
○ Determine noisy neighbors and
■ Throttle, or
■ Communicate with clients (night?)
● Lack of visibility and forecasting to plan ahead
reserved space for safety
Topic 1
Shared broker disk among topics
Topic 4
Topic 2
Topic 5
Topic 3
Topic 6
Step by Step
Capacity Project
IDEA?
retention.bytes - which deletes the oldest messages
when the total size of a partition exceeds a threshold.
● Reserve storage per principal (quota)
● Let the clients manage their reserved storage
● Make retention.bytes mandatory on topic
● Feedback to clients around their usage/growth
Discarded Options:
● Kubernetes elasticity
● Network attached or remote storage options
reserved space for safety
Reserved quotas per principal
Principal
quota
Principal
quota
Principal
quota
Determine cluster capacity
1) Periodically fetch
information from Cruise
Control about the cluster
Number of available
brokers, disk information …
2) Use min disk capacity
among brokers to calculate
cluster capacity
3) Target 90% disk usage
(headroom)
Total capacity = (min broker disk * number of brokers) *
0.9
Kontrole
Cruise
Control
Graphite
(1) Periodic cron job
(2) Available brokers,
disk information
(3) Calculate capacity,
Publish metrics
New Quota + Topic level configuration
● Reserve storage per principal (quota) (default 500MB)
● Add property `topic_capacity_bytes` per Kafka topic (not visible to
Kafka brokers) to manage retention.bytes
● We do all the calculations under this value (including retention.bytes)
topic_capacity_bytes = retention.bytes * partition_count * replica_count
● Whenever there is a partition count increase (i.e. done via Kontrole),
retention.bytes (per partition) is re-calculated accordingly.
New Quota Creation
Kontrole
Cruise
Control
mysql
(1) Create principal quota
(2) Get available brokers,
disk information
(3) Get existing quotas
(5) Save quota
(4) Validate if new quota fits into cluster
New Topic Creation
Kontrole
mysql
(1) Create topic
with topic_capacity_bytes
(2) Get principal’s quota
(3) Enough space for the new topic?
(4) No, reject. Ask for quota increase
(4) Yes, topic fits, go on!
Create topic with relevant
retention.bytes
Kafka Cluster
Dashboards for Admins
Dashboards for Clients
Add Alerting
● Warn/notify before topic_capacity_bytes configuration kicks in and start
deleting data.
● Actions:
○ reduce the retention.ms configuration, or
○ increase the topic capacity.
Onboard Existing Clusters
● Simulating scenarios on test cluster
● Operational documentation
● Stakeholder management
● Documentation for clients
● Enable capacity project on a cluster
○ Calculate / Add topic_capacity_bytes to each topic (with extra)
○ Calculate / Add quotas per principal
Migration Challenges
● Revert strategy
○ Dynamic flag to disable the project on cluster
● Sanity check if cluster is suitable
○ Brokers may have non-uniform storage capacity
○ With extras, all quotas may not fit into the available capacity
Future Work
What is next?
● Allow teams to extend their quota if there is enough capacity
(self service)
● Send usage report to the teams, with the capacity allocated to the
principal vs. their usage
(cost attribution)
Booking.com
Facebook: facebook.com/booking.com
Instagram: @bookingcom
Twitter: @booking.com; @bookingcomnews
Linkedin: nl.linkedin.com/company/booking.com
Youtube: youtube.com/booking
Join Booking.com as a partner
join.booking.com
Join the Booking.com team
careers.booking.com
Questions?
Thank you!
1 sur 28

Recommandé

A Kafka-based platform to process medical prescriptions of Germany’s health i... par
A Kafka-based platform to process medical prescriptions of Germany’s health i...A Kafka-based platform to process medical prescriptions of Germany’s health i...
A Kafka-based platform to process medical prescriptions of Germany’s health i...HostedbyConfluent
890 vues52 diapositives
01. Kubernetes-PPT.pptx par
01. Kubernetes-PPT.pptx01. Kubernetes-PPT.pptx
01. Kubernetes-PPT.pptxTamalBanerjee16
51 vues139 diapositives
Deep Dive on Amazon Aurora par
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon AuroraAmazon Web Services
3.4K vues37 diapositives
kafka par
kafkakafka
kafkaAmikam Snir
1K vues23 diapositives
Producer Performance Tuning for Apache Kafka par
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
44.6K vues79 diapositives
Fundamentals of Apache Kafka par
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
996 vues65 diapositives

Contenu connexe

Tendances

Apache kafka par
Apache kafkaApache kafka
Apache kafkaNexThoughts Technologies
804 vues23 diapositives
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... par
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
798 vues39 diapositives
Kafka 101 par
Kafka 101Kafka 101
Kafka 101Clement Demonchy
2.4K vues41 diapositives
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ... par
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...HostedbyConfluent
2.2K vues51 diapositives
Apache Kafka par
Apache KafkaApache Kafka
Apache KafkaSaroj Panyasrivanit
375 vues35 diapositives
Tomcat 마이그레이션 도전하기 (Jins Choi) par
Tomcat 마이그레이션 도전하기 (Jins Choi)Tomcat 마이그레이션 도전하기 (Jins Choi)
Tomcat 마이그레이션 도전하기 (Jins Choi)삵 (sarc.io)
2.2K vues26 diapositives

Tendances(20)

Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... par HostedbyConfluent
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ... par HostedbyConfluent
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
HostedbyConfluent2.2K vues
Tomcat 마이그레이션 도전하기 (Jins Choi) par 삵 (sarc.io)
Tomcat 마이그레이션 도전하기 (Jins Choi)Tomcat 마이그레이션 도전하기 (Jins Choi)
Tomcat 마이그레이션 도전하기 (Jins Choi)
삵 (sarc.io)2.2K vues
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul... par StreamNative
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
StreamNative378 vues
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME par confluent
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
confluent348 vues
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter... par HostedbyConfluent
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...
ksqlDB - Stream Processing simplified! par Guido Schmutz
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz1.1K vues
A visual introduction to Apache Kafka par Paul Brebner
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
Paul Brebner4.7K vues
An Introduction to Apache Kafka par Amir Sedighi
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi2.7K vues
Kafka Tutorial - Introduction to Apache Kafka (Part 1) par Jean-Paul Azar
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar8.9K vues
Evening out the uneven: dealing with skew in Flink par Flink Forward
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward2.5K vues
From Zero to Hero with Kafka Connect par confluent
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
confluent9.1K vues
Introduction to Apache Kafka par Jeff Holoman
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman52.2K vues
An intro to Kubernetes operators par J On The Beach
An intro to Kubernetes operatorsAn intro to Kubernetes operators
An intro to Kubernetes operators
J On The Beach2.5K vues
Introduction to Kubernetes with demo par Opsta
Introduction to Kubernetes with demoIntroduction to Kubernetes with demo
Introduction to Kubernetes with demo
Opsta2.5K vues
Kafka Tutorial - introduction to the Kafka streaming platform par Jean-Paul Azar
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar1.9K vues

Similaire à Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

Introduction to apache kafka par
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
921 vues49 diapositives
Kubernetes 1.12 Update and Container Security with Liz Rice par
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceCloudOps2005
169 vues42 diapositives
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons par
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsC4Media
378 vues69 diapositives
lessons from managing a pulsar cluster par
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar clusterShivji Kumar Jha
463 vues59 diapositives
Uber Real Time Data Analytics par
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
2.4K vues71 diapositives
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons par
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
1.2K vues66 diapositives

Similaire à Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu(20)

Kubernetes 1.12 Update and Container Security with Liz Rice par CloudOps2005
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
CloudOps2005169 vues
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons par C4Media
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
C4Media378 vues
Uber Real Time Data Analytics par Ankur Bansal
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
Ankur Bansal2.4K vues
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons par aspyker
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
aspyker1.2K vues
Netflix Data Pipeline With Kafka par Steven Wu
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu5.2K vues
Kubernetes for Beginners: An Introductory Guide par Bytemark
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
Bytemark10.7K vues
Scaling Monitoring At Databricks From Prometheus to M3 par LibbySchulze
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze291 vues
Workday's Next Generation Private Cloud par Silvano Buback
Workday's Next Generation Private CloudWorkday's Next Generation Private Cloud
Workday's Next Generation Private Cloud
Silvano Buback272 vues
Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022 par HostedbyConfluent
Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022
Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022
Capital One Delivers Risk Insights in Real Time with Stream Processing par confluent
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent1.6K vues
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning par Guido Schmutz
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Guido Schmutz1.6K vues
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum... par HostedbyConfluent
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
(Current22) Let's Monitor The Conditions at the Conference par Timothy Spann
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann150 vues
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ... par confluent
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
confluent5.7K vues
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter par HostedbyConfluent
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Insta clustr seattle kafka meetup presentation bb par Nitin Kumar
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
Nitin Kumar256 vues

Plus de HostedbyConfluent

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams par
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsHostedbyConfluent
62 vues26 diapositives
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ... par
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...HostedbyConfluent
26 vues84 diapositives
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ... par
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...HostedbyConfluent
55 vues97 diapositives
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern... par
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...HostedbyConfluent
50 vues15 diapositives
Rule Based Asset Management Workflow Automation at Netflix par
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixHostedbyConfluent
31 vues56 diapositives
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML... par
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...HostedbyConfluent
56 vues32 diapositives

Plus de HostedbyConfluent(20)

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams par HostedbyConfluent
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka Streams
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ... par HostedbyConfluent
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ... par HostedbyConfluent
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern... par HostedbyConfluent
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Rule Based Asset Management Workflow Automation at Netflix par HostedbyConfluent
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at Netflix
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML... par HostedbyConfluent
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Indeed Flex: The Story of a Revolutionary Recruitment Platform par HostedbyConfluent
Indeed Flex: The Story of a Revolutionary Recruitment PlatformIndeed Flex: The Story of a Revolutionary Recruitment Platform
Indeed Flex: The Story of a Revolutionary Recruitment Platform
Forecasting Kafka Lag Issues with Machine Learning par HostedbyConfluent
Forecasting Kafka Lag Issues with Machine LearningForecasting Kafka Lag Issues with Machine Learning
Forecasting Kafka Lag Issues with Machine Learning
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U... par HostedbyConfluent
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre... par HostedbyConfluent
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Accelerating Path to Production for Generative AI-powered Applications par HostedbyConfluent
Accelerating Path to Production for Generative AI-powered ApplicationsAccelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered Applications
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited... par HostedbyConfluent
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad... par HostedbyConfluent
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Go Big or Go Home: Approaching Kafka Replication at Scale par HostedbyConfluent
Go Big or Go Home: Approaching Kafka Replication at ScaleGo Big or Go Home: Approaching Kafka Replication at Scale
Go Big or Go Home: Approaching Kafka Replication at Scale
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2 par HostedbyConfluent
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid par HostedbyConfluent
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python par HostedbyConfluent
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark PythonFrom Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite... par HostedbyConfluent
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K... par HostedbyConfluent
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...

Dernier

【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 par
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
41 vues8 diapositives
Melek BEN MAHMOUD.pdf par
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
14 vues1 diapositive
Special_edition_innovator_2023.pdf par
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
17 vues6 diapositives
20231123_Camunda Meetup Vienna.pdf par
20231123_Camunda Meetup Vienna.pdf20231123_Camunda Meetup Vienna.pdf
20231123_Camunda Meetup Vienna.pdfPhactum Softwareentwicklung GmbH
33 vues73 diapositives
Piloting & Scaling Successfully With Microsoft Viva par
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft VivaRichard Harbridge
12 vues160 diapositives
Five Things You SHOULD Know About Postman par
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
30 vues43 diapositives

Dernier(20)

【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 par IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Special_edition_innovator_2023.pdf par WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 vues
Piloting & Scaling Successfully With Microsoft Viva par Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Five Things You SHOULD Know About Postman par Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman30 vues
Spesifikasi Lengkap ASUS Vivobook Go 14 par Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang37 vues
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors par sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab19 vues
Voice Logger - Telephony Integration Solution at Aegis par Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma31 vues
The details of description: Techniques, tips, and tangents on alternative tex... par BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada126 vues
From chaos to control: Managing migrations and Microsoft 365 with ShareGate! par sammart93
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
sammart939 vues
Attacking IoT Devices from a Web Perspective - Linux Day par Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri15 vues
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive par Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... par Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker33 vues
Lilypad @ Labweek, Istanbul, 2023.pdf par Ally339821
Lilypad @ Labweek, Istanbul, 2023.pdfLilypad @ Labweek, Istanbul, 2023.pdf
Lilypad @ Labweek, Istanbul, 2023.pdf
Ally3398219 vues
Igniting Next Level Productivity with AI-Infused Data Integration Workflows par Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software257 vues
Black and White Modern Science Presentation.pptx par maryamkhalid2916
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptx

Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

  • 2. What happens when broker disk is FULL? A)Only some producers fail B) C) All producers fail Kafka service fails
  • 3. Streaming Infra Team Nurettin OMEROGLU Senior Software Engineer I am a member of Streaming Infra Team (10 people) and have more than 4 years of expertise on Apache Kafka client and server side components. We manage on-prem Kafka solution serving to clients running on variety of platforms such as bare-metal, kubernetes and also Cloud
  • 4. Agenda 1. Introduction 2. Before the project 3. Step by step capacity project 4. Future plans
  • 6. 100M monthly active app users 155,000 destinations around the world Car hire available in 140+countries and pre-booked taxis in over 500cities across 120+ countries 243M+ verified guest reviews and 24/7 customer service in 45 languages and dialects Since 2010, Booking.com has welcomed 4.5B+ guest arrivals 28M total reported listings worldwide 6.6M options in homes, apartments and other unique places to stay 30 different types of places to stay, including homes, apartments, B&Bs, hostels, farm stays, bungalows, even boats, igloos and treehouses 140offices in 70countries over 5,000employees in Amsterdam
  • 7. Payments A/B Tests MySQL Cassandra Hadoop Cloud ... Events Logs Online ML Fraud detection Personalization Bookings FPA reporting Data Streaming Platform MySQL Cassandra Hadoop Cloud ... ● Transports and transposes data via pub/sub; ● Connects application through data pipeline ● Resilient, scalable, fault tolerant, secure, with SLO guarantees; Real-time analytics
  • 8. Scale of Streaming @Booking.com How much data? ~2.2PB produced and consumed per day How many clusters? 62 How many topics? ~34K How many partitions? ~138K How many servers? 900 kafka brokers +75 zk
  • 10. Setup ● On-premise multi-tenant kafka clusters running on bare-metal ● Local SSD storage (~3.5TB per broker) ● 32 thread CPU / 256MB memory / 10 Gb network
  • 11. Existing Components ● Custom Configuration validations ● Custom Quota validations ○ Topics per principal ○ Partitions per principal … ● Topics ● Custom quotas (booking-specific) … ● Specific Configurations ● Custom Quotas … ● Custom PrincipalBuilder ● Custom Policies (AbstractPolicy) ○ AlterConfigPolicy ○ CreateTopicPolicy … Mysql (Metadata Store) Bkstreaming CLI (Self-service, home-built) Kontrole (Control Center, home-built) Kafka Cluster
  • 12. Example Scenario for Custom Quota Validations (2) Auth: OK (3) Topics per principal quota: OK (4) Partitions per principal quota: OK (1) Add topic for a service (5) Create topic Mysql (Metadata Store) Kontrole (Control Center) Kafka Cluster
  • 13. Reactive Approach ● Clients use retention.ms configuration retention.ms - which deletes messages after a certain amount of time. ● Dangerous situations if traffic spikes ● We were the middleman handling the toil / issues between multiple tenants ○ Increase number of brokers, or ○ Determine noisy neighbors and ■ Throttle, or ■ Communicate with clients (night?) ● Lack of visibility and forecasting to plan ahead reserved space for safety Topic 1 Shared broker disk among topics Topic 4 Topic 2 Topic 5 Topic 3 Topic 6
  • 15. IDEA? retention.bytes - which deletes the oldest messages when the total size of a partition exceeds a threshold. ● Reserve storage per principal (quota) ● Let the clients manage their reserved storage ● Make retention.bytes mandatory on topic ● Feedback to clients around their usage/growth Discarded Options: ● Kubernetes elasticity ● Network attached or remote storage options reserved space for safety Reserved quotas per principal Principal quota Principal quota Principal quota
  • 16. Determine cluster capacity 1) Periodically fetch information from Cruise Control about the cluster Number of available brokers, disk information … 2) Use min disk capacity among brokers to calculate cluster capacity 3) Target 90% disk usage (headroom) Total capacity = (min broker disk * number of brokers) * 0.9 Kontrole Cruise Control Graphite (1) Periodic cron job (2) Available brokers, disk information (3) Calculate capacity, Publish metrics
  • 17. New Quota + Topic level configuration ● Reserve storage per principal (quota) (default 500MB) ● Add property `topic_capacity_bytes` per Kafka topic (not visible to Kafka brokers) to manage retention.bytes ● We do all the calculations under this value (including retention.bytes) topic_capacity_bytes = retention.bytes * partition_count * replica_count ● Whenever there is a partition count increase (i.e. done via Kontrole), retention.bytes (per partition) is re-calculated accordingly.
  • 18. New Quota Creation Kontrole Cruise Control mysql (1) Create principal quota (2) Get available brokers, disk information (3) Get existing quotas (5) Save quota (4) Validate if new quota fits into cluster
  • 19. New Topic Creation Kontrole mysql (1) Create topic with topic_capacity_bytes (2) Get principal’s quota (3) Enough space for the new topic? (4) No, reject. Ask for quota increase (4) Yes, topic fits, go on! Create topic with relevant retention.bytes Kafka Cluster
  • 22. Add Alerting ● Warn/notify before topic_capacity_bytes configuration kicks in and start deleting data. ● Actions: ○ reduce the retention.ms configuration, or ○ increase the topic capacity.
  • 23. Onboard Existing Clusters ● Simulating scenarios on test cluster ● Operational documentation ● Stakeholder management ● Documentation for clients ● Enable capacity project on a cluster ○ Calculate / Add topic_capacity_bytes to each topic (with extra) ○ Calculate / Add quotas per principal
  • 24. Migration Challenges ● Revert strategy ○ Dynamic flag to disable the project on cluster ● Sanity check if cluster is suitable ○ Brokers may have non-uniform storage capacity ○ With extras, all quotas may not fit into the available capacity
  • 26. What is next? ● Allow teams to extend their quota if there is enough capacity (self service) ● Send usage report to the teams, with the capacity allocated to the principal vs. their usage (cost attribution)
  • 27. Booking.com Facebook: facebook.com/booking.com Instagram: @bookingcom Twitter: @booking.com; @bookingcomnews Linkedin: nl.linkedin.com/company/booking.com Youtube: youtube.com/booking Join Booking.com as a partner join.booking.com Join the Booking.com team careers.booking.com Questions?