SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Big Data with Amazon Redshift and ATI
November, 27th 2013
HI, I’M OLE
SOUNDCLOUD IS THE WORLD’S
LEADING AUDIO PLATFORM
Every minute, creators upload

12hrs
of audio
reaching over

250m
people every month
8%
of the internet
PRESIDENT OBAMA

FOO FIGHTERS

SNOOP LION

MADONNA

SKRILLEX

MACKLEMORE

JOHN OLIVER
(DAILY SHOW/BUGLE)
How‘s the sales
funnel performing
in Brazil and what‘s
the split between
products?
DATA DEMOCRATIZATION

• Avoid Silos
• Remove unnecessary restrictions
• Provide simple tools
• Teach People how to use data
DATA DEMOCRATIZATION

In one sentence:
Deliver the right information to the
right person at the right time.
DATA ANALYSIS AND REPORTING
2010-2012
PRODUCTION DB

ANALYTICS DB

AT Internet
DATA ANALYSIS AND REPORTING

Listens
Sounds
Users
Comments
Favorites
Shares
Reposts

Impressions
Clicks
Conversions
Suggestions
Downloads
Taggings
Uploads
DATA ANALYSIS AND REPORTING

Listens

timestamp
duration
sound
owner
listener
API-key
(location)
country
DATA ANALYSIS AND REPORTING
additional metadata:
• location within sound
• context (location on site)
• segmentation
Listening creates >6000 events/s

BIG DATA
HADOOP TO THE RESCUE

2 Datacenter in AMS
200+ Nodes
HADOOP TO THE RESCUE

listen data
listen metadata
search data
recommender data
product testing data
backend production data
backend logs
HADOOP AND DATA DEMOCRATIZATION

Data is siloed on hadoop
Data governance not existing
Technical hurdles for access
Not realtime
Slow access
AMAZON REDSHIFT
Fast fully managed DW service
Optimized for petabyte or more
datasets
Fast query and I/O performance
Columnar storage technology
BI INFRASTRUCTURE
2013

Source Systems
Staging Area

DataWarehouse

Data Exploration

Amazon EMR
Hadoop

Pig/Ruby Scripts

COPY
MySql
(production db)

Pig/Ruby Scripts

AT Internet

ETL Scripts
External Systems

Job execution powered by:

ETL Scripts
How‘s the sales
funnel performing
in Brazil and what‘s
the split between
products?
ATI Data Query

Create query:
1. filter on funnel
pages
2.select metrics
and dimension
3.add REST URL to
ETL pipeline
Source Systems
Staging Area

DataWarehouse

Data Exploration

Amazon EMR
Hadoop

Pig/Ruby Scripts

COPY
MySql
(production db)

Pig/Ruby Scripts

AT Internet

ETL Scripts
External Systems

Job execution powered by:

ETL Scripts
DATA EXPLORATION
Simple and fast access to data
More time for “deep dives” into
data
Individualized Reporting
Allows interactivity between users
Integrated with RedShift
DATA DEMOCRATIZATION
• Reports designed by end users
• Central repository for data analysis
• User interaction
• Data from one source only
• Scalable solution
• Data to the people!
QUESTIONS?
THANK YOU!
P.S. WE’RE HIRING.
SOUNDCLOUD.COM/JOBS
APPENDIX
IMPORT DATA FROM SOURCE SYSTEMS
First: Gather data from the several source systems into S3

Hadoop

Full/Daily Imports
MySql
(production db)

External Systems

MapReduce for:
- Listens
- Plays
- Impressions
- Affiliations
- ...
IMPORT DATA FROM SOURCE SYSTEMS
Second: Rebuild staging area tables for full imports
Based on configuration files
tracks

users

client
applications

Create statements generated
...

Re-create DISTKEYS and SORTKEYS
Full control in changes in the data
model

Staging Area

yaml config files
IMPORT DATA FROM SOURCE SYSTEMS
Third: Import the data from S3 to RedShift

tracks

Full import: TRUNCATE & COPY
Daily import: COPY

users

Staging Area

client
applications

...
ETL AND DW DATAMODEL
ETL scripts divided into layers:
- Layer 1: Staging -> DW (dimensions)
- Layer 2: Staging -> DW (fact tables - raw data)
- Layer 3: DW -> DW (aggregated fact tables)
- Layer 4: DW -> Reporting Data Cubes (reporting data)
ETL AND DW DATAMODEL
DataWarehouse
ETL Layer 1 & 2

ETL Layer 3

ETL Layer 4

Data Exploration

Staging Area

Data Cleaning
Data Transformation

Data Presentation
SQL

Ruby/SQL Scripts
Data Aggregation
Ruby/SQL Scripts
JOB SCHEDULE AND EXECUTION
Job-scheduling tool developed
internally
Set dependencies between jobs
Execution in multiple machines
Supports all the ETL layers
TIMELINE
Week 2
•
•

Week 4

Gap Analysis
Business Exploration
(requirements
interviews)

Week 6

Week 8

Week 10

Week 12

Week 14

Week 16

Requirement Analysis

•
•

Information Mapping
Design
Solution Design (Draft)

End of Analysis Stage

•
•

Define Infrastructure
Design Data Model

Infrastructure Ready!

•
•
•

Build ETL
Build Data Cubes
Design Reports/Dashboards (Presentation
Layer)

BI 1.0 is built!

•
•

System/Integration
Tests
User Acceptance
BI 1.0 is tested!

•
•

User Workshops
BI 1.0 Evaluation

BI 1.0 is ready
to use!

Milestones

Analysis Stage

Design & Build

Test & Deploy

Contenu connexe

Similaire à Sound cloud - User & Partner Conference - AT Internet

Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
Alluxio, Inc.
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Rohit Dubey
 
Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information Architecture
Inside Analysis
 
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data AnalysisAWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
Amazon Web Services
 

Similaire à Sound cloud - User & Partner Conference - AT Internet (20)

High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information Architecture
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
 
Big Data
Big DataBig Data
Big Data
 
Big data use cases in the cloud presentation
Big data use cases in the cloud presentationBig data use cases in the cloud presentation
Big data use cases in the cloud presentation
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écran
 
Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data AnalysisAWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
 

Plus de AT Internet

Plus de AT Internet (20)

[INFOGRAPHIE] Une stratégie digital analytics orientée confidentialité
[INFOGRAPHIE] Une stratégie digital analytics orientée confidentialité[INFOGRAPHIE] Une stratégie digital analytics orientée confidentialité
[INFOGRAPHIE] Une stratégie digital analytics orientée confidentialité
 
Reeport Partner presentation - Mixing site- and ad- centric data despite the ...
Reeport Partner presentation - Mixing site- and ad- centric data despite the ...Reeport Partner presentation - Mixing site- and ad- centric data despite the ...
Reeport Partner presentation - Mixing site- and ad- centric data despite the ...
 
Présentation partenaire OnCrawl - Comment ouvrir l’Appétit des Moteurs de Rec...
Présentation partenaire OnCrawl - Comment ouvrir l’Appétit des Moteurs de Rec...Présentation partenaire OnCrawl - Comment ouvrir l’Appétit des Moteurs de Rec...
Présentation partenaire OnCrawl - Comment ouvrir l’Appétit des Moteurs de Rec...
 
Altice Média Customer Success - App store optimisation
Altice Média Customer Success - App store optimisationAltice Média Customer Success - App store optimisation
Altice Média Customer Success - App store optimisation
 
L'Équipe Customer Success - Using analytics to fuel efficient personalisation
L'Équipe Customer Success - Using analytics to fuel efficient personalisationL'Équipe Customer Success - Using analytics to fuel efficient personalisation
L'Équipe Customer Success - Using analytics to fuel efficient personalisation
 
Cas client Credit Agricole - Approche data-driven : de la stratégie au déploi...
Cas client Credit Agricole - Approche data-driven : de la stratégie au déploi...Cas client Credit Agricole - Approche data-driven : de la stratégie au déploi...
Cas client Credit Agricole - Approche data-driven : de la stratégie au déploi...
 
RGPD & Data Privacy : la CNIL au Digital Analytics Forum 2018
RGPD & Data Privacy : la CNIL au Digital Analytics Forum 2018RGPD & Data Privacy : la CNIL au Digital Analytics Forum 2018
RGPD & Data Privacy : la CNIL au Digital Analytics Forum 2018
 
Reeport @ Digital Analytics Forum 2018: Defining KPIs that matter
Reeport @ Digital Analytics Forum 2018: Defining KPIs that matterReeport @ Digital Analytics Forum 2018: Defining KPIs that matter
Reeport @ Digital Analytics Forum 2018: Defining KPIs that matter
 
OnCrawl @ Digital Analytics Forum 2018 : le référencement naturel augmenté
OnCrawl @ Digital Analytics Forum 2018 : le référencement naturel augmentéOnCrawl @ Digital Analytics Forum 2018 : le référencement naturel augmenté
OnCrawl @ Digital Analytics Forum 2018 : le référencement naturel augmenté
 
Kamp'n @ Digital Analytics Forum 2018 : la puissance d'AT Internet dans vos F...
Kamp'n @ Digital Analytics Forum 2018 : la puissance d'AT Internet dans vos F...Kamp'n @ Digital Analytics Forum 2018 : la puissance d'AT Internet dans vos F...
Kamp'n @ Digital Analytics Forum 2018 : la puissance d'AT Internet dans vos F...
 
Machine Learning in Marketing - Jim Sterne @ Digital Analytics Forum 2018
Machine Learning in Marketing - Jim Sterne @ Digital Analytics Forum 2018Machine Learning in Marketing - Jim Sterne @ Digital Analytics Forum 2018
Machine Learning in Marketing - Jim Sterne @ Digital Analytics Forum 2018
 
Le Digital Analytics, arme de conversion massive pour les sites marchands - P...
Le Digital Analytics, arme de conversion massive pour les sites marchands - P...Le Digital Analytics, arme de conversion massive pour les sites marchands - P...
Le Digital Analytics, arme de conversion massive pour les sites marchands - P...
 
Analytics et SEO : les clés d'une stratégie réussie - We Love SEO 2018
Analytics et SEO : les clés d'une stratégie réussie - We Love SEO 2018Analytics et SEO : les clés d'une stratégie réussie - We Love SEO 2018
Analytics et SEO : les clés d'une stratégie réussie - We Love SEO 2018
 
AT Internet & Mazeberry : de la data analytics au mix marketing maitrisé
AT Internet & Mazeberry : de la data analytics au mix marketing maitriséAT Internet & Mazeberry : de la data analytics au mix marketing maitrisé
AT Internet & Mazeberry : de la data analytics au mix marketing maitrisé
 
AT Internet & Mazeberry: from analytics to a fully optimised marketing mix
AT Internet & Mazeberry: from analytics to a fully optimised marketing mixAT Internet & Mazeberry: from analytics to a fully optimised marketing mix
AT Internet & Mazeberry: from analytics to a fully optimised marketing mix
 
[DAF 2017] Digital Analytics 4.0: Are You Ready?
[DAF 2017] Digital Analytics 4.0: Are You Ready?[DAF 2017] Digital Analytics 4.0: Are You Ready?
[DAF 2017] Digital Analytics 4.0: Are You Ready?
 
[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Ludivine Lille (EY)
[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Ludivine Lille (EY)[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Ludivine Lille (EY)
[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Ludivine Lille (EY)
 
[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Clémence Scottez (CNIL)
[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Clémence Scottez (CNIL)[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Clémence Scottez (CNIL)
[DAF 2017] RGPD 2018 : Êtes-vous prêt ? par Clémence Scottez (CNIL)
 
[DAF 2017] Analytics Suite 2 - Data you can trust
[DAF 2017] Analytics Suite 2 - Data you can trust[DAF 2017] Analytics Suite 2 - Data you can trust
[DAF 2017] Analytics Suite 2 - Data you can trust
 
[DAF 2017] Analytics Suite 2 - Insights for everyone
[DAF 2017] Analytics Suite 2 - Insights for everyone[DAF 2017] Analytics Suite 2 - Insights for everyone
[DAF 2017] Analytics Suite 2 - Insights for everyone
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Sound cloud - User & Partner Conference - AT Internet

  • 1. Big Data with Amazon Redshift and ATI November, 27th 2013
  • 3. SOUNDCLOUD IS THE WORLD’S LEADING AUDIO PLATFORM
  • 4. Every minute, creators upload 12hrs of audio
  • 7.
  • 8. PRESIDENT OBAMA FOO FIGHTERS SNOOP LION MADONNA SKRILLEX MACKLEMORE JOHN OLIVER (DAILY SHOW/BUGLE)
  • 9.
  • 10. How‘s the sales funnel performing in Brazil and what‘s the split between products?
  • 11. DATA DEMOCRATIZATION • Avoid Silos • Remove unnecessary restrictions • Provide simple tools • Teach People how to use data
  • 12. DATA DEMOCRATIZATION In one sentence: Deliver the right information to the right person at the right time.
  • 13. DATA ANALYSIS AND REPORTING 2010-2012 PRODUCTION DB ANALYTICS DB AT Internet
  • 14. DATA ANALYSIS AND REPORTING Listens Sounds Users Comments Favorites Shares Reposts Impressions Clicks Conversions Suggestions Downloads Taggings Uploads
  • 15. DATA ANALYSIS AND REPORTING Listens timestamp duration sound owner listener API-key (location) country
  • 16. DATA ANALYSIS AND REPORTING additional metadata: • location within sound • context (location on site) • segmentation Listening creates >6000 events/s BIG DATA
  • 17. HADOOP TO THE RESCUE 2 Datacenter in AMS 200+ Nodes
  • 18. HADOOP TO THE RESCUE listen data listen metadata search data recommender data product testing data backend production data backend logs
  • 19. HADOOP AND DATA DEMOCRATIZATION Data is siloed on hadoop Data governance not existing Technical hurdles for access Not realtime Slow access
  • 20. AMAZON REDSHIFT Fast fully managed DW service Optimized for petabyte or more datasets Fast query and I/O performance Columnar storage technology
  • 21. BI INFRASTRUCTURE 2013 Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
  • 22. How‘s the sales funnel performing in Brazil and what‘s the split between products?
  • 23. ATI Data Query Create query: 1. filter on funnel pages 2.select metrics and dimension 3.add REST URL to ETL pipeline
  • 24. Source Systems Staging Area DataWarehouse Data Exploration Amazon EMR Hadoop Pig/Ruby Scripts COPY MySql (production db) Pig/Ruby Scripts AT Internet ETL Scripts External Systems Job execution powered by: ETL Scripts
  • 25. DATA EXPLORATION Simple and fast access to data More time for “deep dives” into data Individualized Reporting Allows interactivity between users Integrated with RedShift
  • 26. DATA DEMOCRATIZATION • Reports designed by end users • Central repository for data analysis • User interaction • Data from one source only • Scalable solution • Data to the people!
  • 28. THANK YOU! P.S. WE’RE HIRING. SOUNDCLOUD.COM/JOBS
  • 30. IMPORT DATA FROM SOURCE SYSTEMS First: Gather data from the several source systems into S3 Hadoop Full/Daily Imports MySql (production db) External Systems MapReduce for: - Listens - Plays - Impressions - Affiliations - ...
  • 31. IMPORT DATA FROM SOURCE SYSTEMS Second: Rebuild staging area tables for full imports Based on configuration files tracks users client applications Create statements generated ... Re-create DISTKEYS and SORTKEYS Full control in changes in the data model Staging Area yaml config files
  • 32. IMPORT DATA FROM SOURCE SYSTEMS Third: Import the data from S3 to RedShift tracks Full import: TRUNCATE & COPY Daily import: COPY users Staging Area client applications ...
  • 33. ETL AND DW DATAMODEL ETL scripts divided into layers: - Layer 1: Staging -> DW (dimensions) - Layer 2: Staging -> DW (fact tables - raw data) - Layer 3: DW -> DW (aggregated fact tables) - Layer 4: DW -> Reporting Data Cubes (reporting data)
  • 34. ETL AND DW DATAMODEL DataWarehouse ETL Layer 1 & 2 ETL Layer 3 ETL Layer 4 Data Exploration Staging Area Data Cleaning Data Transformation Data Presentation SQL Ruby/SQL Scripts Data Aggregation Ruby/SQL Scripts
  • 35. JOB SCHEDULE AND EXECUTION Job-scheduling tool developed internally Set dependencies between jobs Execution in multiple machines Supports all the ETL layers
  • 36. TIMELINE Week 2 • • Week 4 Gap Analysis Business Exploration (requirements interviews) Week 6 Week 8 Week 10 Week 12 Week 14 Week 16 Requirement Analysis • • Information Mapping Design Solution Design (Draft) End of Analysis Stage • • Define Infrastructure Design Data Model Infrastructure Ready! • • • Build ETL Build Data Cubes Design Reports/Dashboards (Presentation Layer) BI 1.0 is built! • • System/Integration Tests User Acceptance BI 1.0 is tested! • • User Workshops BI 1.0 Evaluation BI 1.0 is ready to use! Milestones Analysis Stage Design & Build Test & Deploy