Companies doing any kind of advertising typically have an attribution process that joins users’ conversions with the impressions that they were served or that they clicked on. The standard workflow is typically a batch job that runs every few hours or once a day.
However, as technology gets more sophisticated, advertisers are looking for more real-time reporting and results. This talk presents an example of a foundational architecture for near real-time attribution and advanced analytics against real-time impression and conversion data using Structured Streaming and Databricks Delta.
2. Introduction
2#ExpSAIS13
• Goal:
Provide tools and information
that can help you build more
real-time / lower latency
attribution pipelines
• Crawl, Walk, Run: Pull Model
Carylpreviously MediaMath / SE / PM
for Attribution, SA for Databricks
4. Introduction
What is Databricks Delta?
Delta is a data management capability that
brings data reliability and performance
optimizations to the cloud data lake.
4#ExpSAIS13
8. Attribution Challenges
Scale
• Often dealing with millions to billions of data
points per attribution window
Complexity
• Simple, last-click model is still common
• MTA and more sophisticated attribution on rise
8#ExpSAIS13
14. • How can we optimize performance?
• Levers:
– Delta Tools
• Optimize
• ZOrder
• Caching
• Data Skipping
– Join on Stream
– Cluster Size
Managing Performance
14#ExpSAIS13
15. Handling Complexity
• Flexibility with Complex Logic
– Forking streams
– Logic on query vs. in-stream
• Late or Corrected Data
– Upserts
– Views automatically update when raw data changed
15#ExpSAIS13
16. Conclusion
• Unification of Batch & Streaming
• Easy APIs for Managing Performance
• Flexible and Scalable Analytics on Near
Real-Time Data
16#ExpSAIS13