Apache Kafka is now nearly ubiquitous in modern data pipelines and use cases. While the Kafka development model is elegantly simple, operating Kafka clusters in production environments is a challenge. It’s hard to troubleshoot misbehaving Kafka clusters, especially when there are potentially hundreds or thousands of topics, producers and consumers and billions of messages.
The root cause of why real-time applications is lag may be due to an application problem – like poor data partitioning or load imbalance – or due to a Kafka problem – like resource exhaustion or suboptimal configuration. Therefore getting the best performance, predictability, and reliability for Kafka-based applications can be difficult. In the end, the operation of your Kafka powered analytics pipelines could themselves benefit from machine learning (ML).
Similaire à Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, Unravel Data and Nate Snapp, Adobe) Kafka Summit London 2019 (20)
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, Unravel Data and Nate Snapp, Adobe) Kafka Summit London 2019
1. 1
Using Machine
Learning to
Understand Kafka
Runtime Behavior
Shivnath Babu
Cofounder/CTO, Unravel Data
Adjunct Professor, Duke University
shivnath@unraveldata.com
Nate Snapp
Big Data Engineering
Adobe, Palo Alto Networks, Omniture
LinkedIn or nate.snapp@gmail.com
2. 2
Meet the speakers
• Cofounder/CTO at Unravel
• Adjunct Professor of Computer
Science at Duke University
• Focusing on ease-of-use and
manageability of data apps & platforms
• Recipient of US National Science
Foundation CAREER Award, three
IBM Faculty Awards, HP Labs
Innovation Research Award
Shivnath BabuNate Snapp
• Senior SRE from Adobe, Palo Alto
Networks, and Omniture
• 12 years experience in streaming
• First 6 years on proprietary
streaming analytics for 9/10 Fortune
500, 20B events daily, 10K+ servers
• Last 2 years have moved to Kafka
• Blogging on SRE, Hadoop, and data
streaming space at natesnapp.com
3. 3
MODERN DATA APPLICATIONS
Machine Learning Predictive Analytics
AI loT
ENVIRONMENTS
On-Premises HybridCloud
PLATFORMS & TECHNOLOGIES
NoSQL SQL MPP API
01
uncover
ADAPTIVE DATA COLLECTION
understand
02
DATA MODEL & CORRELATION
ANALYTICS
ENGINE
AUTOMATION
ENGINE
TUNING
ENGINE
INFERENCE
ENGINE
unravel
03
DASHBOARDS
AUTO-ACTIONS
SMART ALERTS
REPORTING
RECOMMENDATIONS
4. 4
• Clusters with 6-29 brokers
• Confluent Kafka 5.2.1, Apache Kafka 2.2.0-cp2
• 1700 topics across all clusters
• Largest topics top out with over 20K+ messages/sec
• Smaller topics are 300-500 messages/sec
• Large self-service components
• Ingress is a mix of separate Kafka, Java client API as well as load balanced
REST API frontends; some clusters have use of Schema Registry
• Egress is a mix of custom endpoints, and in many cases, HDFS sink
What Kafka setups?
10. 10
• Runtime schema changes
• “Flexible-Rigid Schema”
• Timeouts causing rebalance storms
• Leader affinity and poor assignment
• Poor partition assignment
Elements of surprise!
11. 11
Anomaly
Detection
Most enterprises now have mission-critical
streaming apps
Predictive
Maintenance
Threat
Monitoring
Recommendation
Engines
Real-time customer
sentiment analysis
12. 12
Streaming data architecture must be reliable
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
13. 13
Many problems can cause unreliable performance
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Untimely results
14. 14
Many problems can cause unreliable performance
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Untimely resultsPoor partitioning Inefficient Configuration Resource contention
+ + =
15. 15
DevOps face many challenges today
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSORIoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Poor partitioning Inefficient Configuration Resource contention
+ +
• No single tool
• No correlation across the stack
• No application view
• No insights
• No recommendations
• No automated actions
=
Untimely results
16. 16
How we can empower DevOps teams
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Platform Metrics App Metrics
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Bring all performance data into
one complete & correlated view
17. 17
How we can empower DevOps teams
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Platform Metrics App Metrics
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Provide out-of-the-box
intelligence with
Machine Learning (ML)
18. 18
How we can empower DevOps teams
IoT Sensors
Database
Other Data
Dashboard
Result Store
Other Output
Platform Metrics App Metrics
STREAM STORE
Kafka HBase Spark Flink
REAL-TIME PROCESSOR
App Platform
Interaction Metrics
Automate actions
smartly with
Artificial Intelligence (AI)
23. 23
1. Detecting load imbalance among Kafka brokers
2. Detecting load imbalance among Kafka partitions
Use Cases
24. 24
Detecting load imbalance among Kafka brokers
Brokers kabo2 and
kabo3 have much
higher number of
incoming messages
than broker kabo1
25. 25
Algorithms for Outlier Detection
Picture credit: http://historum.com/asian-history/128081-aryan-migration-theory-update-128.html
• Based on one feature Vs.
multiple features
• Is the distribution of data
assumed?
1. Z-score: How many standard
deviations is a data point from the
mean
26. 26
• Based on one feature Vs.
multiple features
• Is the distribution of data
assumed?
1. Z-score: How many standard
deviations is a data point from the
mean
2. DBScan: Density-based clustering
3. Isolation forests
4. Deep learning (e.g., Autoencoders)
Algorithms for Outlier Detection
Picture credit: http://en.proft.me/2017/02/3/density-based-clustering-r/
30. 30
Predicting when SLAs are in danger of being missed
Latency SLA is
3 minutes
Latency SLA can be
missed by this time
Current time is here
31. 31
• Many standard time-series forecasting
techniques: ARIMA, Holt-Winters
• Deep-learning techniques (e.g., LSTM)
Algorithms for Forecasting
32. 32
• Many standard time-series forecasting
techniques: ARIMA, Holt-Winters
• Deep-learning techniques (e.g., LSTM)
• Facebook’s Prophet Algorithm: Mixes stats
methods & judgment from domain experts
• Uses Generative Additive Model (GAM)
• Decomposed time-series model: trend,
seasonality, holidays, and error term
Algorithms for Forecasting
y(t) = trend(t) + periodic(t) + shock(t) + error
33. 33
• Many standard time-series forecasting
techniques: ARIMA, Holt-Winters
• Deep-learning techniques (e.g., LSTM)
• Facebook’s Prophet Algorithm: Mixes stats
methods & judgment from domain experts
• Uses Generative Additive Model (GAM)
• Decomposed time-series model: trend,
seasonality, holidays, and error term
• Advantages:
• Fits faster than ARIMA
• Models various growth trends
• Can handle unevenly spaced data
• Defaults often produce accurate forecasts
Algorithms for Forecasting
35. 35
1. An unexpected change that needs your attention
2. Smart alerts:
• False negatives should be minimal
• False positives should be minimal
Use Cases
37. 37
Algorithms for Anomaly Detection
Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2
• Deviation from forecasts
38. 38
Algorithms for Anomaly Detection
Picture credit: https://blog.statsbot.co/time-series-anomaly-detection-algorithms-1cef5519aef2
• Deviation from forecasts
• ARIMA
• Regression trees
• Prophet
• STL: Seasonal and Trend
Decomposition using Loess
• Topic of intensive
research
• Deep learning (e.g., LSTM)
40. 40
• Fast root-causing of problems
• What lower-level cause led to the change in the
streaming application’s performance?
Use Cases
41. 41
What caused the unexpected change in performance?
Anomaly
What caused it?
100s of time series from every level of the stack!
LATENCY is 421.07% WORSE THAN THE BASELINE
42. 42
• Be aware of the many pitfalls
• E.g., trends can make arbitrary time
series look correlated!
• Pick robust time-series
similarity metrics
• E.g., Euclidean distance Vs. Dynamic
Time Warping
Algorithms for Correlation Analysis
Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html
Euclidean
Distance
Dynamic
Time
Warping
43. 43
• Be aware of the many pitfalls
• E.g., trends can make arbitrary time
series look correlated!
• Pick robust time-series
similarity metrics
• E.g., Euclidean distance Vs. Dynamic
Time Warping
• Carefully incorporate domain
knowledge
• E.g., what caused latency SLA miss?
• Application-level problem?
• Resource allocation problem?
• Platform-level problem?
• Data-level problem?
Algorithms for Correlation Analysis
Picture credit: https://izbicki.me/blog/converting-images-into-time-series-for-data-mining.html
Euclidean
Distance
Dynamic
Time
Warping
45. 45
1. Helps answer what-if and optimization questions
• What is the best number of partitions?
• What is the best setting of timeouts to avoid rebalance storms?
• What is the best partition rebalancing action to take?
• What will the impact of adding a new broker be?
2. Enables Auto Actions for resource/cost efficiency & SLA management
Use Cases
47. 47
• Performance = Func(Input Features)
• Have to find the best set of input features
• Supervised learning is often possible: Training data is available
or easy to generate
Algorithms for Learning Models
Picture credit: https://myslide.cn/slides/8328#
48. 48
Summary: Meeting Kafka DevOps Goals with AI/ML
Throughput goal
Stability goal
Latency goal
Resource usage/cost goal
Data loss tolerance goal
App-level Goals
Platform-level Goals
Planning/growth goal
AI/ML Algorithms
Outlier Detection
Forecasting
Anomaly Detection
Correlation Analysis
Model Learning
49. 49
AIOps: Rich opportunities to address
distributed application performance
management as AI/ML problems
Start your free trial: unraveldata.com/free-trial
Visit us at the Unravel booth
And yes, we are hiring!
shivnath@unraveldata.com