Creating millions of user sessions using Complex Event Processing
Every day, Yelp connects millions of consumers with great local businesses through the website and mobile apps. We strive to provide our users with an ever-evolving, excellent experience by constantly running a plethora of experiments based on user activity.
A user session encapsulates all of a single user’s activity until the user has been dormant for 30 minutes. Creating user sessions requires us to process hundreds of millions of log events occurring daily and applying filters on them. Due to the large volume of log events, creation of these sessions presents us with several application level challenges, including: handling of late events, filtering bot traffic, etc. Features like event time and exactly once processing that are provided by Flink made building such a large scale streaming application like ours possible.
Our main motivation to move towards streaming from batch processing stemmed from the fact that our feedback on analysis based on user sessions was always a day late and as an added bonus it also meant integrating with our state-of-the-art data-pipeline ecosystem.
In this talk we will not only discuss why Yelp moved from creating user sessions using batch jobs to generating them in near-real-time using Apache Flink but also highlight issues we encountered with continuous bot traffic that never closed the session window, adding custom triggers for long running sessions, duplicate events while allowing late events to be processed, auditing of the created sessions etc.
25. Stale Topics
● Event time processing
● Event time watermark: to signal progress in event time.
● Watermarks are crucial when events can be out-of-order
31. S3 Throttling/ Checkpointing
● Due to large ingestion rate, we were checkpointing
frequently.
● Caused us to be throttled by S3 , causing checkpoint
failures
34. ● Slow data structure to store timer state
● Stores timers in-memory
Flink Internal Timer State
35. ● Slow data structure to store timer state
○ HeapInternalTimerService O(n) deletion
operation
○ High CPU usage
○ 50 million session mark → super slow processing
Flink Internal Timer State