Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Real-time Processing with Flink
for Machine Learning at Netflix
Elliot Chow
Agenda
Recommendations @ Netflix
Data For Machine Learning
Processing with Flink
State/Join
Event-Time & Watermarks
Checkpo...
Recommendations
Recommendations
139 million+ members
190+ countries
450 billion+ unique
events/day
700+ Kafka topics
Scale
Impressions
Member Activity
Log-in
Click
Play
Search
...
Recommendations Data
Context
Features - inputs to recommendation algorithms
...
Sessionization
Join with Recommendations Data
Output Data Format
Processing with Flink
Historically...
Spark + Spark Streaming
Some Challenges
Processing-time
Checkpointing performance and compatibility
Switching to Flink
Event-time Processing
Incremental Checkpointing
Custom Serializers
Internal Netflix Support
High-level Data Flow
Challenges and Considerations
Challenges and Considerations
Many microservices involved
Different join keys
Different expiration policies
Scale
Join / Window Implementation
Attempt I
Attempt I
class Event // ...
class State // ...
class Output // ...
def insert(input: Event, state: State): State = // ......
Attempt I
class Event // ...
class State // ...
class Output // ...
def insert(input: Event, state: State): State = // ......
Attempt I - Issues
State object is too large
Out-of-memory, even with rate-limiting outliers
Serialization/deserialization...
Attempt I - Issues
State object is too large
Out-of-memory, even with rate-limiting outliers
Serialization/deserialization...
Attempt II
Use Flink's windowing API
Sliding Windows
Attempt II - Issues
Many copies of each event
Attempt II - Issues
Difficult to manage expiration for different events
Attempt III
Custom ProcessFunction
Manual window management
Break down state into many state objects
Use MapState, ListSta...
Attempt III
Maintain frequently-accessed metadata in ValueState
Minimum/maximum timestamps
Existing timers
Number of event...
Attempt III
Optimize for writes (RocksDB backend)
Only read metadata during inserts
Insert (append) events to ListState
De...
Attempt III
Randomly offset the windows
Member Window Start Window End
1 __ : 00 __ : 09
1 __ : 10 __ : 19
2 __ : 01 __ : 1...
Event-Time & Watermarks
Event-Time & Watermarks
Watermarking Crash Course
Event-time: time associated with the actual event
Watermark: a time mark...
Event-Time & Watermarks
Watermarking Crash Course
Example: BoundedOutOfOrdernessTimestampExtractor
where outOfOrderness = ...
Event-Time & Watermarks
Watermarking Crash Course
Watermark is maintained per partition
The watermark of an operator is co...
A Couple Quick Observations
1. Event-time timestamps must be correct
2. If the watermark of any partition stops progressin...
Why Has Time Stopped?
Why Has Time Stopped?
System is unhealthy
Delays in input data sources
Backpressure
Underprovisioned cluster
Even a single...
Why Has Time Stopped?
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Region Failover
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Region Failover
Kafka Sk...
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Region Failover
Kafka Sk...
Why Has Time Stopped?
(Slightly) Custom Watermark
Assigner
Based on BoundedOutOfOrdernessTimestampExtractor
1. Detect inactivity
2. Force time t...
Possible Improvements
More sophisticated inactivity detection
More flexible forced-time-progression
Detect inactivity at th...
Checkpointing Large State
Checkpointing Large State
One unresponsive TM can cause slowness or even failure of entire
checkpoint
Checkpointing Large State
Resource intensive (2x-3x CPU/Network)
Checkpointing Large State
Reduce interval and add min-pause between checkpoints
Increases duplicates when restoring job
La...
An Observation About The State
Large portion of total state is recommendations data
Only ID and timestamp are needed for t...
Move Some State Out Of Flink
Keep only ID and timestamp in Flink
Move data to an external store
Fetching becomes an order ...
Possible Improvements
Checkpoint to/restore from persistent EBS
Incremental savepoint
Clean restart after checkpoint
Monitoring and Understanding
The Job
Monitoring and Understanding
The Job
Flink Metrics
numberOfFailedCheckpoints, lastCheckpointDuration
inputQueueLength, out...
Monitoring and Understanding
The Job
Instance-/Container-Level Metrics
CPU, Network, Disk, Memory, GC, ...
Check for unbal...
Monitoring and Understanding
The Job
Time and Watermarks
Event timestamps of inputs
Relative to wall-clock time & watermar...
Monitoring and Understanding
The Job
Performance
Issues often only appear at scale
Time all parts of application
Look at C...
Monitoring and Understanding
The Job
State
Difficult to get insights about entire state at a point in time
Take a savepoint
...
Wrap-Up
Job has been running well in production, especially after
moving to 1.7
Continue to work on robustness, failure re...
Thanks!
Questions?
Prochain SlideShare
Chargement dans…5
×

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow

552 vues

Publié le

eal-time Processing with Flink for Machine Learning at Netflix
Machine learning plays a critical role in providing a great Netflix member experience. It is used to drive many parts of the site including video recommendations, search results ranking, and selection of artwork images. Providing high-fidelity, near real-time data is increasingly important for these machine learning pipelines, especially as multi-armed bandit and reinforcement learning techniques, in addition to more ""traditional"" supervised learning, become more prevalent. With access to this data, models are able to converge more quickly, features can be updated more frequently, and analysis can be done in a more timely manner.

In this talk, we will focus on the practical details of leveraging Flink to process trillions of events per day, work with the time dimension, and manage large and frequently-changing state. We will discuss different processing schemes and dataflows, scalability and resiliency challenges we tackled, operational considerations, and instrumentation we added for monitoring job health in production.

Publié dans : Technologie
  • Soyez le premier à commenter

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow

  1. 1. Real-time Processing with Flink for Machine Learning at Netflix Elliot Chow
  2. 2. Agenda Recommendations @ Netflix Data For Machine Learning Processing with Flink State/Join Event-Time & Watermarks Checkpointing Monitoring and Understanding The Job
  3. 3. Recommendations
  4. 4. Recommendations
  5. 5. 139 million+ members 190+ countries 450 billion+ unique events/day 700+ Kafka topics Scale
  6. 6. Impressions
  7. 7. Member Activity Log-in Click Play Search ...
  8. 8. Recommendations Data Context Features - inputs to recommendation algorithms ...
  9. 9. Sessionization
  10. 10. Join with Recommendations Data
  11. 11. Output Data Format
  12. 12. Processing with Flink
  13. 13. Historically... Spark + Spark Streaming Some Challenges Processing-time Checkpointing performance and compatibility
  14. 14. Switching to Flink Event-time Processing Incremental Checkpointing Custom Serializers Internal Netflix Support
  15. 15. High-level Data Flow
  16. 16. Challenges and Considerations
  17. 17. Challenges and Considerations Many microservices involved Different join keys Different expiration policies Scale
  18. 18. Join / Window Implementation
  19. 19. Attempt I
  20. 20. Attempt I class Event // ... class State // ... class Output // ... def insert(input: Event, state: State): State = // ... def emit(time: Timestamp): (State, List[Output]) = // ...
  21. 21. Attempt I class Event // ... class State // ... class Output // ... def insert(input: Event, state: State): State = // ... def emit(time: Timestamp): (State, List[Output]) = // ... Store State in ValueState for each member Call insert in processElement Call emit in onTimer Use custom Protobuf TypeSerializer
  22. 22. Attempt I - Issues State object is too large Out-of-memory, even with rate-limiting outliers Serialization/deserialization of entire state for inserting events is too costly
  23. 23. Attempt I - Issues State object is too large Out-of-memory, even with rate-limiting outliers Serialization/deserialization of entire state for inserting events is too costly All windows get triggered simultaneously Bursty resource usage
  24. 24. Attempt II Use Flink's windowing API Sliding Windows
  25. 25. Attempt II - Issues Many copies of each event
  26. 26. Attempt II - Issues Difficult to manage expiration for different events
  27. 27. Attempt III Custom ProcessFunction Manual window management Break down state into many state objects Use MapState, ListState, and ValueState where appropriate Use a combination of event-time and processing-time timers
  28. 28. Attempt III Maintain frequently-accessed metadata in ValueState Minimum/maximum timestamps Existing timers Number of events and bytes (rate-limiting)
  29. 29. Attempt III Optimize for writes (RocksDB backend) Only read metadata during inserts Insert (append) events to ListState Deduplicate events at read time; write back deduplicated events
  30. 30. Attempt III Randomly offset the windows Member Window Start Window End 1 __ : 00 __ : 09 1 __ : 10 __ : 19 2 __ : 01 __ : 10 2 __ : 11 __ : 20
  31. 31. Event-Time & Watermarks
  32. 32. Event-Time & Watermarks Watermarking Crash Course Event-time: time associated with the actual event Watermark: a time marker stating that all data prior to this time has been seen Event-time triggers fire based on the watermark
  33. 33. Event-Time & Watermarks Watermarking Crash Course Example: BoundedOutOfOrdernessTimestampExtractor where outOfOrderness = 10 minutes Event-Time 10:00 10:08 10:05 10:06 10:15 Max Event-Time 10:00 10:08 10:08 10:08 10:15 Watermark 09:50 09:58 09:58 09:58 10:05
  34. 34. Event-Time & Watermarks Watermarking Crash Course Watermark is maintained per partition The watermark of an operator is computed as the minimum watermark of its inputs Partition 1 09:50 09:58 09:58 09:58 10:05 Partition 2 09:53 09:57 09:58 10:03 10:08 Operator 09:50 09:57 09:58 09:58 10:05
  35. 35. A Couple Quick Observations 1. Event-time timestamps must be correct 2. If the watermark of any partition stops progressing, time will stop
  36. 36. Why Has Time Stopped?
  37. 37. Why Has Time Stopped? System is unhealthy Delays in input data sources Backpressure Underprovisioned cluster Even a single, bad TM can drag the entire job
  38. 38. Why Has Time Stopped?
  39. 39. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data
  40. 40. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs
  41. 41. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs Region Failover
  42. 42. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs Region Failover Kafka Skip-Partitions Feature
  43. 43. Why Has Time Stopped? System appears healthy - somewhere, there is not enough data Scheduled jobs Region Failover Kafka Skip-Partitions Feature Topic is overprovisioned (# partitions : events/second > 1)
  44. 44. Why Has Time Stopped?
  45. 45. (Slightly) Custom Watermark Assigner Based on BoundedOutOfOrdernessTimestampExtractor 1. Detect inactivity 2. Force time to forward when inactive 3. Record metrics per partition per source
  46. 46. Possible Improvements More sophisticated inactivity detection More flexible forced-time-progression Detect inactivity at the source
  47. 47. Checkpointing Large State
  48. 48. Checkpointing Large State One unresponsive TM can cause slowness or even failure of entire checkpoint
  49. 49. Checkpointing Large State Resource intensive (2x-3x CPU/Network)
  50. 50. Checkpointing Large State Reduce interval and add min-pause between checkpoints Increases duplicates when restoring job Large catch-up after restore
  51. 51. An Observation About The State Large portion of total state is recommendations data Only ID and timestamp are needed for the join
  52. 52. Move Some State Out Of Flink Keep only ID and timestamp in Flink Move data to an external store Fetching becomes an order of magnitude slower (network call vs. local disk)
  53. 53. Possible Improvements Checkpoint to/restore from persistent EBS Incremental savepoint Clean restart after checkpoint
  54. 54. Monitoring and Understanding The Job
  55. 55. Monitoring and Understanding The Job Flink Metrics numberOfFailedCheckpoints, lastCheckpointDuration inputQueueLength, outputQueueLength currentLowWatermark fullRestarts, downtime ...
  56. 56. Monitoring and Understanding The Job Instance-/Container-Level Metrics CPU, Network, Disk, Memory, GC, ... Check for unbalanced processing
  57. 57. Monitoring and Understanding The Job Time and Watermarks Event timestamps of inputs Relative to wall-clock time & watermark Watermark relative to wall-clock time At different operators Break down by task
  58. 58. Monitoring and Understanding The Job Performance Issues often only appear at scale Time all parts of application Look at CPU flamegraphs Replay from earliest offset (Kafka)
  59. 59. Monitoring and Understanding The Job State Difficult to get insights about entire state at a point in time Take a savepoint Manually schedule timer for every key to collect metrics
  60. 60. Wrap-Up Job has been running well in production, especially after moving to 1.7 Continue to work on robustness, failure recovery, and operational ease Trade-off some consistency for higher availability Auto-scaling
  61. 61. Thanks! Questions?

×