Hadoop at Musicmetric

•

1 j'aime•884 vues

Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week. Note: Regarding slide 14; we have since switched to Oozie to coordinate Hadoop workflows.

Technologie Business

Hadoop at Musicmetric

Dr Jameel Syed
April 2012

Music has moved online
• The world has changed
– Do you buy vinyl/tapes/CDs of music?
– Do you buy music downloads?
– Do you download illegal content from BitTorrent?
– Do you listen to music on YouTube?
– Do you “like” bands on Facebook?
– Do you subscribe to Spotify?
– Do you listen on the radio to the weekly charts on a
Sunday afternoon?
• What’s happening online?

Data Science in the Music Industry
• Raw Data
– Social media/networks (Facebook, YouTube,
Twitter, Last.fm...)
– BitTorrent
– Online reviews
• Raw Data -> Derived Data -> Insight
– Who is popular right now/in the immediate
future?
– What was the effect of appearing at a festival?
– Which artists are (becoming) popular with
listeners with certain demographics (in a
region)?
• Data processing, machine learning &
statistical methods
– Sentiment analysis
– Named Entity Recognition
– Ranking
– Segmentation

Data Pipeline - Overview

Data Processing
Anomaly Key-Value Web
Raw Data Aggregation API
Detection Store Application

• Engineering approach
– KISS
– Decoupled components

Data Pipeline - Input

Data Processing
Anomaly Key-Value Web
Raw Data Aggregation API
Detection Store Application

• Input
– Distributed data collection from public internet
sources
• Real-time system constraints: 24/7 hourly data
• Changing format, scope
– Customers providing private data feeds
• e.g. sales and streaming data

Data Pipeline - Output

Data Processing
Anomaly Key-Value Web
Raw Data Aggregation API
Detection Store Application

• Output
– Sparse data requests about hundreds of thousands of artists
– Timeliness
– Lots of combinations (by country/city, by release/track,
diff/cumulative, hourly/daily/weekly, charts…)
– Need to reprocess over EVERYTHING (new metadata, re-
delivery of data, anomaly detection)

Why Hadoop?
• Outgrew initial solution for data processing
over existing data
– How long should daily processing take?
– I/O (disk seeks)
• Additional data
– BitTorrent scale-up
– iTunes sales
– Spotify plays

Hadoop Cluster
• 12 physical servers + 2 KVM virtual machines
• Cloudera CDH3/Ubuntu 10.04 LTS
• 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)
• 24GB RAM, 4x 2TB WD
• Gb Ethernet (no link aggregation yet)
• ~2.5KW (max 4KW)

mm-addax mm-rhino-01 mm-rhino-02

Edge Server Primary Name Node Secondary Name Node
Job Tracker
mm-impala Zoo Keeper

NFS Server mm-rhino-03

DHCP/PXE/DNS Data Node 01
mm-rhino-10
mm-gazelle
Data Node 02
…
mm-rhino-11
Private Hadoop
network Data Node 09

Data Storage & Processing
Hadoop
Private Data Raw data Processed Time series

Voldemort

Public Data
Push To Preprocess Generate HDFS to KVS
Hadoop timeseries

RabbitMQ
To Hadoop Preprocess Timeseries To_KVS

• E.g. BitTorrent input data: per 1TB
• Pre-processed: 200GB
• Raw time series: 37GB
• Filtered/artist data: 2.5GB
• KVS: 1.9GB

Opportunities
• Hive/Pig/HBase
• Mahout
• Nutch

Open Questions & Challenges
• Organizational readiness
– Planning
– Access
– Experience
• Cluster maintenance
– Unlikely to replicate production setup
– 24/7 (ish)
– What can be switched off when (and is it handled automatically)?
• Resource scheduling
• Workflow
• Amazon EMR vs own hardware?
– Predictable workload/cost?
– In for a penny, in for a pound
– Hotel California
• DBA equivalent on Hadoop? HDA

We are hiring

jobs@musicmetric.com
@tilapia

Contenu connexe

Tendances

Intro big data analyticsHagar Alaa el-din

Intro to Python for Data ScienceTJ Stalcup

Data science a practitioner's perspectiveAmir Ziai

II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...Dr. Haxel Consult

II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult

Python for Data ScienceGabriel Moreira

Course Information for March 25th BatchUpXAcademy

Day in the life of a data librarian [presentation for ANU 23Things group]Jane Frazier

Lecture - Data MiningInternational Quality and Productivity Center (IQPC India)

II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...Dr. Haxel Consult

Tendances (10)

Intro big data analytics

Intro to Python for Data Science

Data science a practitioner's perspective

II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...

II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...

Python for Data Science

Course Information for March 25th Batch

Day in the life of a data librarian [presentation for ANU 23Things group]

Lecture - Data Mining

II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...

En vedette

Tic 1roabad15

Rada SeniorówFundacja "Merkury"

Formularz konsultacji społecznychFundacja "Merkury"

Wireless SystemsSaqib Ahmed

Tic 2javierd_rc

Neet株式会社（仮）の組織形態についてのご提案kakkun005

Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...IJSRD

manoj_kumar_resumeManoj Kumar

Pirates 7Bill MacDonald

selectionEli?ka Podzimkov

Lakshya_ConceptLakshya_NITIE

1314053JAGDAMBA PRASAD

Wist-je-datjes over UiTPASregio'sTom Van de Velde

B&D Eolas - Catalogue des formations webmarketing - 2015EOLAS, groupe Business & Decision

Tim Keefe - DRI Training Series Day UCC: Digitising Your Collectiondri_ireland

هرم الغذائي فاطمة المحيشي r12347890

Cover Proposal Pembangunan MasjidMa'shum Arif

Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)dri_ireland

GO Menstrual , de Miranda GrayPaola Pozzi

En vedette (19)

Tic 1

Rada Seniorów

Formularz konsultacji społecznych

Wireless Systems

Tic 2

Neet株式会社（仮）の組織形態についてのご提案

Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...

manoj_kumar_resume

Pirates 7

selection

Lakshya_Concept

1314053

Wist-je-datjes over UiTPASregio's

B&D Eolas - Catalogue des formations webmarketing - 2015

Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection

هرم الغذائي فاطمة المحيشي

Cover Proposal Pembangunan Masjid

Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)

GO Menstrual , de Miranda Gray

Similaire à Hadoop at Musicmetric

Hadoop on Azure, Blue elephantsOvidiu Dimulescu

Hadoop, Taming ElephantsOvidiu Dimulescu

Introduction to HadoopOvidiu Dimulescu

Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

Processing Big Datacwensel

Searching conversations with hadoopDataWorks Summit

Hadoop Distributed File Systemelliando dias

Borthakur hadoop univ-researchsaintdevil163

The Evolution of Big Data at SpotifyJosh Baer

GPU Acceleration for Financial ServicesKinetica

Steve Watt PresentationBig Data Houston

Hadoop for shanghai dev meetupRoby Chen

Introduction To Big Data & HadoopBlackvard

How to build a data stack from scratchVinayak Hegde

Hadoop-Quick introductionSandeep Singh

Hadoop ppt1chariorienit

Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi

Hadoop as data refinerySteve Loughran

Hadoop as Data Refinery - Steve LoughranJAX London

Similaire à Hadoop at Musicmetric (20)

Hadoop on Azure, Blue elephants

Hadoop, Taming Elephants

Introduction to Hadoop

Big Data/Hadoop Infrastructure Considerations

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

Processing Big Data

Searching conversations with hadoop

Hadoop Distributed File System

Borthakur hadoop univ-research

The Evolution of Big Data at Spotify

GPU Acceleration for Financial Services

Steve Watt Presentation

Hadoop for shanghai dev meetup

Introduction To Big Data & Hadoop

How to build a data stack from scratch

Hadoop-Quick introduction

Hadoop ppt1

Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes

Hadoop as data refinery

Hadoop as Data Refinery - Steve Loughran

Dernier

Manulife - Insurer Innovation Award 2024The Digital Insurer

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

A Year of the Servo Reboot: Where Are We Now?Igalia

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Dernier (20)

Manulife - Insurer Innovation Award 2024

HTML Injection Attacks: Impact and Mitigation Strategies

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Boost Fertility New Invention Ups Success Rates.pdf

Artificial Intelligence Chap.5 : Uncertainty

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

A Domino Admins Adventures (Engage 2024)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

A Year of the Servo Reboot: Where Are We Now?

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Powerful Google developer tools for immediate impact! (2023-24 C)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

How to Troubleshoot Apps for the Modern Connected Worker

Hadoop at Musicmetric

1. Hadoop at Musicmetric Dr Jameel Syed April 2012

2. Music has moved online • The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon? • What’s happening online?

3. How popular am I?

4. Who are my fans?

5. Where are my fans?

6. What is the press saying?

7. Who is popular?

8. Data Science in the Music Industry • Raw Data – Social media/networks (Facebook, YouTube, Twitter, Last.fm...) – BitTorrent – Online reviews • Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)? • Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation

9. Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Engineering approach – KISS – Decoupled components

10. Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data

11. Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)

12. Why Hadoop? • Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks) • Additional data – BitTorrent scale-up – iTunes sales – Spotify plays

13. Hadoop Cluster • 12 physical servers + 2 KVM virtual machines • Cloudera CDH3/Ubuntu 10.04 LTS • 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm) • 24GB RAM, 4x 2TB WD • Gb Ethernet (no link aggregation yet) • ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09

14. Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS • E.g. BitTorrent input data: per 1TB • Pre-processed: 200GB • Raw time series: 37GB • Filtered/artist data: 2.5GB • KVS: 1.9GB

15. Opportunities • Hive/Pig/HBase • Mahout • Nutch

16. Open Questions & Challenges • Organizational readiness – Planning – Access – Experience • Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)? • Resource scheduling • Workflow • Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California • DBA equivalent on Hadoop? HDA

17. We are hiring jobs@musicmetric.com @tilapia

Hadoop at Musicmetric

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

En vedette

En vedette (19)

Similaire à Hadoop at Musicmetric

Similaire à Hadoop at Musicmetric (20)

Dernier

Dernier (20)

Hadoop at Musicmetric