1. is an authoring platform which allows create,
distribute and monetise engaging content.
2. ETL (Extract, Transform and Load)
is a process responsible for pulling data out of the source systems and placing it into a data
warehouse.
User Interaction events
A/B testing
Advertisement metrics
Notifications from other
internal and external
services
3. Pipeline design objectives
1. Freshness: under one hour
2. Requires minimum DevOp investment
3. Scalable
4. Extendable / Modular
8. Streaming ETL : Collector REST service
ECS + Autoscale
1. Breaks the batch to separate events
2. Adds timestamp
3. Adds unique id to each event
4. Adds ip, user-agent and headers
9. Streaming ETL : Fault tolerance 1
ECS + Autoscale AWS Kinesis
Sorted events in a single shard
Keeps data for 24 - 72 hours
Multiple applications can read
Spark can read from the beginning or
the latest
10. Streaming ETL : Event log as received
AWS Kinesis By processed timestreaming
…
13:10
13:20
13:30
13:40
13:50
...
11. Streaming ETL : event distribution
The longer the backlog the
higher the event timestamp
variance across kinesis
shards.
12. Streaming ETL : Event log by timestamp
By processed time By collector time
13:10
13:20
13:30
13:10
13:20
13:30
13. Streaming ETL : Session
TIME
5 min
Session starts with the first event from a page
and ends 5 min after the last.
1 partition
14. Streaming ETL : Session
Raw events
Incomplete sessions
Enriched events
Enrichment
Union Save
incomple
te events
Union Save
incomple
te events
Enrichment
15. Streaming ETL : Session
13:10
Enriched
13:20
13:30
13:10
13:20
13:30
By collector time
Group in
Session
Enrich
16. Streaming ETL : Enrichment
Group in
Session
Enrich Smearing properties over all events in the
session
Ip to location
User agent to Device capabilities
Interaction KPI triggers calculation:
- Loaded
- Started
- Engaged
- Complete
Story metadata
20. Orchestration
Some of the tasks to orchestrate:
1. Load a single partition and insert events
into a table according to its collector
timestamp
2. Enrich a partition
3. Load to Vertica
4. Clear intermediate files
5. Wait for partition
6. Wait for file
7. Trigger another pipeline
Apache Airflow
25. Orchestration : Airflow
Web server
Broker
Redis/RabbitMQ
DB
Celery / Executor
Scheduler
celery
worker
Task process
Task process
Task process
Celery / Executor
worker
Task process
Task process
Task process
Task process