Overview of how we use Hadoop at Musicmetric as part of our data processing pipeline. Presented at the April 2012 Hadoop User Group London meetup as part of Big Data Week.
Note: Regarding slide 14; we have since switched to Oozie to coordinate Hadoop workflows.
2. Music has moved online
• The world has changed
– Do you buy vinyl/tapes/CDs of music?
– Do you buy music downloads?
– Do you download illegal content from BitTorrent?
– Do you listen to music on YouTube?
– Do you “like” bands on Facebook?
– Do you subscribe to Spotify?
– Do you listen on the radio to the weekly charts on a
Sunday afternoon?
• What’s happening online?
8. Data Science in the Music Industry
• Raw Data
– Social media/networks (Facebook, YouTube,
Twitter, Last.fm...)
– BitTorrent
– Online reviews
• Raw Data -> Derived Data -> Insight
– Who is popular right now/in the immediate
future?
– What was the effect of appearing at a festival?
– Which artists are (becoming) popular with
listeners with certain demographics (in a
region)?
• Data processing, machine learning &
statistical methods
– Sentiment analysis
– Named Entity Recognition
– Ranking
– Segmentation
9. Data Pipeline - Overview
Data Processing
Anomaly Key-Value Web
Raw Data Aggregation API
Detection Store Application
• Engineering approach
– KISS
– Decoupled components
10. Data Pipeline - Input
Data Processing
Anomaly Key-Value Web
Raw Data Aggregation API
Detection Store Application
• Input
– Distributed data collection from public internet
sources
• Real-time system constraints: 24/7 hourly data
• Changing format, scope
– Customers providing private data feeds
• e.g. sales and streaming data
11. Data Pipeline - Output
Data Processing
Anomaly Key-Value Web
Raw Data Aggregation API
Detection Store Application
• Output
– Sparse data requests about hundreds of thousands of artists
– Timeliness
– Lots of combinations (by country/city, by release/track,
diff/cumulative, hourly/daily/weekly, charts…)
– Need to reprocess over EVERYTHING (new metadata, re-
delivery of data, anomaly detection)
12. Why Hadoop?
• Outgrew initial solution for data processing
over existing data
– How long should daily processing take?
– I/O (disk seeks)
• Additional data
– BitTorrent scale-up
– iTunes sales
– Spotify plays
13. Hadoop Cluster
• 12 physical servers + 2 KVM virtual machines
• Cloudera CDH3/Ubuntu 10.04 LTS
• 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)
• 24GB RAM, 4x 2TB WD
• Gb Ethernet (no link aggregation yet)
• ~2.5KW (max 4KW)
mm-addax mm-rhino-01 mm-rhino-02
Edge Server Primary Name Node Secondary Name Node
Job Tracker
mm-impala Zoo Keeper
NFS Server mm-rhino-03
DHCP/PXE/DNS Data Node 01
mm-rhino-10
mm-gazelle
Data Node 02
…
mm-rhino-11
Private Hadoop
network Data Node 09
14. Data Storage & Processing
Hadoop
Private Data Raw data Processed Time series
Voldemort
Public Data
Push To Preprocess Generate HDFS to KVS
Hadoop timeseries
RabbitMQ
To Hadoop Preprocess Timeseries To_KVS
• E.g. BitTorrent input data: per 1TB
• Pre-processed: 200GB
• Raw time series: 37GB
• Filtered/artist data: 2.5GB
• KVS: 1.9GB
16. Open Questions & Challenges
• Organizational readiness
– Planning
– Access
– Experience
• Cluster maintenance
– Unlikely to replicate production setup
– 24/7 (ish)
– What can be switched off when (and is it handled automatically)?
• Resource scheduling
• Workflow
• Amazon EMR vs own hardware?
– Predictable workload/cost?
– In for a penny, in for a pound
– Hotel California
• DBA equivalent on Hadoop? HDA