SlideShare une entreprise Scribd logo
1  sur  27
is an authoring platform which allows create,
distribute and monetise engaging content.
ETL (Extract, Transform and Load)
is a process responsible for pulling data out of the source systems and placing it into a data
warehouse.
User Interaction events
A/B testing
Advertisement metrics
Notifications from other
internal and external
services
Pipeline design objectives
1. Freshness: under one hour
2. Requires minimum DevOp investment
3. Scalable
4. Extendable / Modular
Data warehouse tech and consumers
Operational
analytics
ETL
Different types of ETL
Stream
Retrieve from database
tables
Pull from a service
Streaming ETL : Data collection
~ 500Mil
events/day
View, Action, Close,
Widget View, Scroll,
Click, Swipe ,Type,
Play, Banner
Loaded, Video
started, Video
playing ....
Collector Rest Service
Streaming ETL : Event sample
Streaming ETL : Collector REST service
ECS + Autoscale
1. Breaks the batch to separate events
2. Adds timestamp
3. Adds unique id to each event
4. Adds ip, user-agent and headers
Streaming ETL : Fault tolerance 1
ECS + Autoscale AWS Kinesis
Sorted events in a single shard
Keeps data for 24 - 72 hours
Multiple applications can read
Spark can read from the beginning or
the latest
Streaming ETL : Event log as received
AWS Kinesis By processed timestreaming
…
13:10
13:20
13:30
13:40
13:50
...
Streaming ETL : event distribution
The longer the backlog the
higher the event timestamp
variance across kinesis
shards.
Streaming ETL : Event log by timestamp
By processed time By collector time
13:10
13:20
13:30
13:10
13:20
13:30
Streaming ETL : Session
TIME
5 min
Session starts with the first event from a page
and ends 5 min after the last.
1 partition
Streaming ETL : Session
Raw events
Incomplete sessions
Enriched events
Enrichment
Union Save
incomple
te events
Union Save
incomple
te events
Enrichment
Streaming ETL : Session
13:10
Enriched
13:20
13:30
13:10
13:20
13:30
By collector time
Group in
Session
Enrich
Streaming ETL : Enrichment
Group in
Session
Enrich Smearing properties over all events in the
session
Ip to location
User agent to Device capabilities
Interaction KPI triggers calculation:
- Loaded
- Started
- Engaged
- Complete
Story metadata
DB to DB ETL
S3
Orchestration : Dependencies
Raw events
Enriched sessions
Agg by page
Quiz stats
ETL
ETL
ETL
Orchestration: Data atlas
Orchestration
Some of the tasks to orchestrate:
1. Load a single partition and insert events
into a table according to its collector
timestamp
2. Enrich a partition
3. Load to Vertica
4. Clear intermediate files
5. Wait for partition
6. Wait for file
7. Trigger another pipeline
Apache Airflow
Orchestration : Airflow
Orchestration : Airflow
Direct Acyclic Graph of tasks
Orchestration : Airflow
~ 50 DAGs
~ 1000 tasks
And growing
Orchestration : Airflow
Task example
Orchestration : Airflow
Web server
Broker
Redis/RabbitMQ
DB
Celery / Executor
Scheduler
celery
worker
Task process
Task process
Task process
Celery / Executor
worker
Task process
Task process
Task process
Task process
Orchestration : Airflow Deployment
ECS Airflow worker cluster ECS Airflow services cluster
Shared EFS
for DAGs
worker
worker
worker
worker
scheduler
webserver
Redis
RDS Postgres
Our Plans
Simplify streaming part by using spark structured streaming
Improve delivery cycle of all the components
Speed layer

Contenu connexe

Similaire à ETL in Playbuzz

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...Flink Forward
 
Responding to extended events in near real time
Responding to extended events in near real timeResponding to extended events in near real time
Responding to extended events in near real timeGianluca Sartori
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONSUMIT KUMAR
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Stream Processing with Apache Flink
Stream Processing with Apache FlinkStream Processing with Apache Flink
Stream Processing with Apache FlinkC4Media
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Rule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixHostedbyConfluent
 
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...Christoph Adler
 
Monitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachineMonitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachineWooga
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEventsNeil Avery
 
EQR Reporting: Rails + Amazon EC2
EQR Reporting:  Rails + Amazon EC2EQR Reporting:  Rails + Amazon EC2
EQR Reporting: Rails + Amazon EC2jeperkins4
 
Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaZach Cox
 
Why and how to test logging - DevOps Showcase North - Feb 2016 - Matthew Skelton
Why and how to test logging - DevOps Showcase North - Feb 2016 - Matthew SkeltonWhy and how to test logging - DevOps Showcase North - Feb 2016 - Matthew Skelton
Why and how to test logging - DevOps Showcase North - Feb 2016 - Matthew SkeltonSkelton Thatcher Consulting Ltd
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
 
From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)
From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)
From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)André Vala
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 

Similaire à ETL in Playbuzz (20)

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
 
Responding to extended events in near real time
Responding to extended events in near real timeResponding to extended events in near real time
Responding to extended events in near real time
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATION
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Stream Processing with Apache Flink
Stream Processing with Apache FlinkStream Processing with Apache Flink
Stream Processing with Apache Flink
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Rule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at Netflix
 
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
 
Monitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachineMonitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachine
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEvents
 
NATE-Central-Log
NATE-Central-LogNATE-Central-Log
NATE-Central-Log
 
EQR Reporting: Rails + Amazon EC2
EQR Reporting:  Rails + Amazon EC2EQR Reporting:  Rails + Amazon EC2
EQR Reporting: Rails + Amazon EC2
 
Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and Samza
 
Why and how to test logging - DevOps Showcase North - Feb 2016 - Matthew Skelton
Why and how to test logging - DevOps Showcase North - Feb 2016 - Matthew SkeltonWhy and how to test logging - DevOps Showcase North - Feb 2016 - Matthew Skelton
Why and how to test logging - DevOps Showcase North - Feb 2016 - Matthew Skelton
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)
From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)
From Event Receivers to SharePoint Webhooks (SPS Lisbon 2017)
 
Scalenics overview
Scalenics overviewScalenics overview
Scalenics overview
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 

Dernier

Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Dernier (20)

Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

ETL in Playbuzz

  • 1. is an authoring platform which allows create, distribute and monetise engaging content.
  • 2. ETL (Extract, Transform and Load) is a process responsible for pulling data out of the source systems and placing it into a data warehouse. User Interaction events A/B testing Advertisement metrics Notifications from other internal and external services
  • 3. Pipeline design objectives 1. Freshness: under one hour 2. Requires minimum DevOp investment 3. Scalable 4. Extendable / Modular
  • 4. Data warehouse tech and consumers Operational analytics ETL
  • 5. Different types of ETL Stream Retrieve from database tables Pull from a service
  • 6. Streaming ETL : Data collection ~ 500Mil events/day View, Action, Close, Widget View, Scroll, Click, Swipe ,Type, Play, Banner Loaded, Video started, Video playing .... Collector Rest Service
  • 7. Streaming ETL : Event sample
  • 8. Streaming ETL : Collector REST service ECS + Autoscale 1. Breaks the batch to separate events 2. Adds timestamp 3. Adds unique id to each event 4. Adds ip, user-agent and headers
  • 9. Streaming ETL : Fault tolerance 1 ECS + Autoscale AWS Kinesis Sorted events in a single shard Keeps data for 24 - 72 hours Multiple applications can read Spark can read from the beginning or the latest
  • 10. Streaming ETL : Event log as received AWS Kinesis By processed timestreaming … 13:10 13:20 13:30 13:40 13:50 ...
  • 11. Streaming ETL : event distribution The longer the backlog the higher the event timestamp variance across kinesis shards.
  • 12. Streaming ETL : Event log by timestamp By processed time By collector time 13:10 13:20 13:30 13:10 13:20 13:30
  • 13. Streaming ETL : Session TIME 5 min Session starts with the first event from a page and ends 5 min after the last. 1 partition
  • 14. Streaming ETL : Session Raw events Incomplete sessions Enriched events Enrichment Union Save incomple te events Union Save incomple te events Enrichment
  • 15. Streaming ETL : Session 13:10 Enriched 13:20 13:30 13:10 13:20 13:30 By collector time Group in Session Enrich
  • 16. Streaming ETL : Enrichment Group in Session Enrich Smearing properties over all events in the session Ip to location User agent to Device capabilities Interaction KPI triggers calculation: - Loaded - Started - Engaged - Complete Story metadata
  • 17. DB to DB ETL S3
  • 18. Orchestration : Dependencies Raw events Enriched sessions Agg by page Quiz stats ETL ETL ETL
  • 20. Orchestration Some of the tasks to orchestrate: 1. Load a single partition and insert events into a table according to its collector timestamp 2. Enrich a partition 3. Load to Vertica 4. Clear intermediate files 5. Wait for partition 6. Wait for file 7. Trigger another pipeline Apache Airflow
  • 22. Orchestration : Airflow Direct Acyclic Graph of tasks
  • 23. Orchestration : Airflow ~ 50 DAGs ~ 1000 tasks And growing
  • 25. Orchestration : Airflow Web server Broker Redis/RabbitMQ DB Celery / Executor Scheduler celery worker Task process Task process Task process Celery / Executor worker Task process Task process Task process Task process
  • 26. Orchestration : Airflow Deployment ECS Airflow worker cluster ECS Airflow services cluster Shared EFS for DAGs worker worker worker worker scheduler webserver Redis RDS Postgres
  • 27. Our Plans Simplify streaming part by using spark structured streaming Improve delivery cycle of all the components Speed layer