13. DATA ANALYSIS AND REPORTING
2010-2012
PRODUCTION DB
ANALYTICS DB
AT Internet
14. DATA ANALYSIS AND REPORTING
Listens
Sounds
Users
Comments
Favorites
Shares
Reposts
Impressions
Clicks
Conversions
Suggestions
Downloads
Taggings
Uploads
15. DATA ANALYSIS AND REPORTING
Listens
timestamp
duration
sound
owner
listener
API-key
(location)
country
16. DATA ANALYSIS AND REPORTING
additional metadata:
• location within sound
• context (location on site)
• segmentation
Listening creates >6000 events/s
BIG DATA
18. HADOOP TO THE RESCUE
listen data
listen metadata
search data
recommender data
product testing data
backend production data
backend logs
19. HADOOP AND DATA DEMOCRATIZATION
Data is siloed on hadoop
Data governance not existing
Technical hurdles for access
Not realtime
Slow access
20. AMAZON REDSHIFT
Fast fully managed DW service
Optimized for petabyte or more
datasets
Fast query and I/O performance
Columnar storage technology
21. BI INFRASTRUCTURE
2013
Source Systems
Staging Area
DataWarehouse
Data Exploration
Amazon EMR
Hadoop
Pig/Ruby Scripts
COPY
MySql
(production db)
Pig/Ruby Scripts
AT Internet
ETL Scripts
External Systems
Job execution powered by:
ETL Scripts
23. ATI Data Query
Create query:
1. filter on funnel
pages
2.select metrics
and dimension
3.add REST URL to
ETL pipeline
24. Source Systems
Staging Area
DataWarehouse
Data Exploration
Amazon EMR
Hadoop
Pig/Ruby Scripts
COPY
MySql
(production db)
Pig/Ruby Scripts
AT Internet
ETL Scripts
External Systems
Job execution powered by:
ETL Scripts
25. DATA EXPLORATION
Simple and fast access to data
More time for “deep dives” into
data
Individualized Reporting
Allows interactivity between users
Integrated with RedShift
26. DATA DEMOCRATIZATION
• Reports designed by end users
• Central repository for data analysis
• User interaction
• Data from one source only
• Scalable solution
• Data to the people!
30. IMPORT DATA FROM SOURCE SYSTEMS
First: Gather data from the several source systems into S3
Hadoop
Full/Daily Imports
MySql
(production db)
External Systems
MapReduce for:
- Listens
- Plays
- Impressions
- Affiliations
- ...
31. IMPORT DATA FROM SOURCE SYSTEMS
Second: Rebuild staging area tables for full imports
Based on configuration files
tracks
users
client
applications
Create statements generated
...
Re-create DISTKEYS and SORTKEYS
Full control in changes in the data
model
Staging Area
yaml config files
32. IMPORT DATA FROM SOURCE SYSTEMS
Third: Import the data from S3 to RedShift
tracks
Full import: TRUNCATE & COPY
Daily import: COPY
users
Staging Area
client
applications
...
34. ETL AND DW DATAMODEL
DataWarehouse
ETL Layer 1 & 2
ETL Layer 3
ETL Layer 4
Data Exploration
Staging Area
Data Cleaning
Data Transformation
Data Presentation
SQL
Ruby/SQL Scripts
Data Aggregation
Ruby/SQL Scripts
35. JOB SCHEDULE AND EXECUTION
Job-scheduling tool developed
internally
Set dependencies between jobs
Execution in multiple machines
Supports all the ETL layers
36. TIMELINE
Week 2
•
•
Week 4
Gap Analysis
Business Exploration
(requirements
interviews)
Week 6
Week 8
Week 10
Week 12
Week 14
Week 16
Requirement Analysis
•
•
Information Mapping
Design
Solution Design (Draft)
End of Analysis Stage
•
•
Define Infrastructure
Design Data Model
Infrastructure Ready!
•
•
•
Build ETL
Build Data Cubes
Design Reports/Dashboards (Presentation
Layer)
BI 1.0 is built!
•
•
System/Integration
Tests
User Acceptance
BI 1.0 is tested!
•
•
User Workshops
BI 1.0 Evaluation
BI 1.0 is ready
to use!
Milestones
Analysis Stage
Design & Build
Test & Deploy