Sound cloud - User & Partner Conference - AT Internet

Big Data with Amazon Redshift and ATI
November, 27th 2013

SOUNDCLOUD IS THE WORLD’S
LEADING AUDIO PLATFORM

Every minute, creators upload

12hrs
of audio

reaching over

250m
people every month

PRESIDENT OBAMA

FOO FIGHTERS

SNOOP LION

MADONNA

SKRILLEX

MACKLEMORE

JOHN OLIVER
(DAILY SHOW/BUGLE)

How‘s the sales
funnel performing
in Brazil and what‘s
the split between
products?

DATA DEMOCRATIZATION

• Avoid Silos
• Remove unnecessary restrictions
• Provide simple tools
• Teach People how to use data


In one sentence:
Deliver the right information to the
right person at the right time.

DATA ANALYSIS AND REPORTING
2010-2012
PRODUCTION DB

ANALYTICS DB

AT Internet


Listens
Sounds
Users
Comments
Favorites
Shares
Reposts

Impressions
Clicks
Conversions
Suggestions
Downloads
Taggings
Uploads


Listens

timestamp
duration
sound
owner
listener
API-key
(location)
country

additional metadata:
• location within sound
• context (location on site)
• segmentation
Listening creates >6000 events/s

BIG DATA

HADOOP TO THE RESCUE

2 Datacenter in AMS
200+ Nodes

HADOOP TO THE RESCUE

listen data
listen metadata
search data
recommender data
product testing data
backend production data
backend logs

HADOOP AND DATA DEMOCRATIZATION

Data is siloed on hadoop
Data governance not existing
Technical hurdles for access
Not realtime
Slow access

AMAZON REDSHIFT
Fast fully managed DW service
Optimized for petabyte or more
datasets
Fast query and I/O performance
Columnar storage technology

BI INFRASTRUCTURE
2013

Source Systems
Staging Area

DataWarehouse

Data Exploration

Amazon EMR
Hadoop

Pig/Ruby Scripts

COPY
MySql
(production db)

Pig/Ruby Scripts

AT Internet

ETL Scripts
External Systems

Job execution powered by:

ETL Scripts

ATI Data Query

Create query:
1. filter on funnel
pages
2.select metrics
and dimension
3.add REST URL to
ETL pipeline

Source Systems
Staging Area

DataWarehouse

Data Exploration

Amazon EMR
Hadoop

Pig/Ruby Scripts

COPY
MySql
(production db)

Pig/Ruby Scripts

AT Internet

ETL Scripts
External Systems

Job execution powered by:

ETL Scripts

DATA EXPLORATION
Simple and fast access to data
More time for “deep dives” into
data
Individualized Reporting
Allows interactivity between users
Integrated with RedShift

• Reports designed by end users
• Central repository for data analysis
• User interaction
• Data from one source only
• Scalable solution
• Data to the people!

THANK YOU!
P.S. WE’RE HIRING.
SOUNDCLOUD.COM/JOBS

IMPORT DATA FROM SOURCE SYSTEMS
First: Gather data from the several source systems into S3

Hadoop

Full/Daily Imports
MySql
(production db)

External Systems

MapReduce for:
- Listens
- Plays
- Impressions
- Afﬁliations
- ...

Second: Rebuild staging area tables for full imports
Based on configuration files
tracks

users

client
applications

Create statements generated
...

Re-create DISTKEYS and SORTKEYS
Full control in changes in the data
model

Staging Area

yaml config files

Third: Import the data from S3 to RedShift

tracks

Full import: TRUNCATE & COPY
Daily import: COPY

users

Staging Area

client
applications

...

ETL AND DW DATAMODEL
ETL scripts divided into layers:
- Layer 1: Staging -> DW (dimensions)
- Layer 2: Staging -> DW (fact tables - raw data)
- Layer 3: DW -> DW (aggregated fact tables)
- Layer 4: DW -> Reporting Data Cubes (reporting data)

ETL AND DW DATAMODEL
DataWarehouse
ETL Layer 1 & 2

ETL Layer 3

ETL Layer 4

Data Exploration

Staging Area

Data Cleaning
Data Transformation

Data Presentation
SQL

Ruby/SQL Scripts
Data Aggregation
Ruby/SQL Scripts

JOB SCHEDULE AND EXECUTION
Job-scheduling tool developed
internally
Set dependencies between jobs
Execution in multiple machines
Supports all the ETL layers

TIMELINE
Week 2
•
•

Week 4

Gap Analysis
Business Exploration
(requirements
interviews)

Week 6

Week 8

Week 10

Week 12

Week 14

Week 16

Requirement Analysis

•
•

Information Mapping
Design
Solution Design (Draft)

End of Analysis Stage

•
•

Define Infrastructure
Design Data Model

Infrastructure Ready!

•
•
•

Build ETL
Build Data Cubes
Design Reports/Dashboards (Presentation
Layer)

BI 1.0 is built!

•
•

System/Integration
Tests
User Acceptance
BI 1.0 is tested!

•
•

User Workshops
BI 1.0 Evaluation

BI 1.0 is ready
to use!

Milestones

Analysis Stage

Design & Build

Test & Deploy

Sound cloud - User & Partner Conference - AT Internet

Recommandé

Recommandé

Contenu connexe

Similaire à Sound cloud - User & Partner Conference - AT Internet

Similaire à Sound cloud - User & Partner Conference - AT Internet (20)

Plus de AT Internet

Plus de AT Internet (20)

Dernier

Dernier (20)

Sound cloud - User & Partner Conference - AT Internet