Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.
Meetup: https://www.meetup.com/druidio/events/252515792/
5. Example questions
Realtime
• How is the new pickup location in SFO airport affecting the market?
Geospatial
• Are the promos we deployed earlier in a sub-region effective at moving the metrics we
thought they would move?
Anomaly
• Alert GMs when subregion conversion is low because lack of supply.
5
8. Limitations
• Only yesterday’s data is queryable
in analytical db
• P75 query latency in presto is 30
seconds
8
Requirements
• Data freshness < 1 minute
• P95 query latency < 5 seconds
• Geospatial support
10. Apache Flink - Stream processor
• Scalable/performant distributed stream processor
• API heavily influenced by Google’s Dataflow Model
• Event time processing
• APIs
‒ Functional APIs
‒ Stream SQL
‒ Direct APIs
• Joins
• Windowing
• Supports batch execution
10
11. Druid - Columnar database
● Scalable in-memory columnar database
● Support for geospatial data
● Extensible
● Native integration with superset
● Real time ingestion
11
12. Flink Stream SQL
● Familiarity with SQL
● Powerful semantics for data manipulation
● Streaming and batch mode
● Extensibility via UDF
● Joins
12
13. UDFs
● Geohash
● Geo region extraction
● URL cardinality reduction/normalization
○ /users/d9cca721a735d/location -> /users/{hash}/location
○ /v1//api// -> /v1/api
● User agent parsing
○ OS name / version
○ App Name / version
● Sampling
13
17. Validation of ingestion-spec
• Ingestion spec under source control
• Protobuf schema based compile time validation
‒ SQL
‒ Data type
‒ Column names
• Integration tests on sample data
17
20. Goal - all events in druid in realtime
• If you log it, you will find it
• Automagic druid spec
‒ Offline analysis for dimensions/metrics
‒ Cardinality analysis
‒ Reasonable defaults
• Auto provisioning of various resources
‒ Kafka topic for a new event
20