SlideShare une entreprise Scribd logo
1  sur  17
Hadoop at Musicmetric

     Dr Jameel Syed
         April 2012
Music has moved online
• The world has changed
  –   Do you buy vinyl/tapes/CDs of music?
  –   Do you buy music downloads?
  –   Do you download illegal content from BitTorrent?
  –   Do you listen to music on YouTube?
  –   Do you “like” bands on Facebook?
  –   Do you subscribe to Spotify?
  –   Do you listen on the radio to the weekly charts on a
      Sunday afternoon?
• What’s happening online?
How popular am I?
Who are my fans?
Where are my fans?
What is the press saying?
Who is popular?
Data Science in the Music Industry
• Raw Data
    – Social media/networks (Facebook, YouTube,
      Twitter, Last.fm...)
    – BitTorrent
    – Online reviews
• Raw Data -> Derived Data -> Insight
    – Who is popular right now/in the immediate
      future?
    – What was the effect of appearing at a festival?
    – Which artists are (becoming) popular with
      listeners with certain demographics (in a
      region)?
• Data processing, machine learning &
  statistical methods
    –   Sentiment analysis
    –   Named Entity Recognition
    –   Ranking
    –   Segmentation
Data Pipeline - Overview

                  Data Processing
              Anomaly                    Key-Value           Web
   Raw Data                Aggregation               API
              Detection                    Store           Application




• Engineering approach
  – KISS
  – Decoupled components
Data Pipeline - Input

                  Data Processing
              Anomaly                    Key-Value           Web
   Raw Data                Aggregation               API
              Detection                    Store           Application




• Input
  – Distributed data collection from public internet
    sources
      • Real-time system constraints: 24/7 hourly data
      • Changing format, scope
  – Customers providing private data feeds
      • e.g. sales and streaming data
Data Pipeline - Output

                   Data Processing
               Anomaly                    Key-Value           Web
   Raw Data                 Aggregation               API
               Detection                    Store           Application




• Output
  – Sparse data requests about hundreds of thousands of artists
  – Timeliness
  – Lots of combinations (by country/city, by release/track,
    diff/cumulative, hourly/daily/weekly, charts…)
  – Need to reprocess over EVERYTHING (new metadata, re-
    delivery of data, anomaly detection)
Why Hadoop?
• Outgrew initial solution for data processing
  over existing data
  – How long should daily processing take?
  – I/O (disk seeks)
• Additional data
  – BitTorrent scale-up
  – iTunes sales
  – Spotify plays
Hadoop Cluster
•    12 physical servers + 2 KVM virtual machines
•    Cloudera CDH3/Ubuntu 10.04 LTS
•    2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm)
•    24GB RAM, 4x 2TB WD
•    Gb Ethernet (no link aggregation yet)
•    ~2.5KW (max 4KW)

       mm-addax                 mm-rhino-01                mm-rhino-02

    Edge Server              Primary Name Node          Secondary Name Node
                                 Job Tracker
      mm-impala                  Zoo Keeper

     NFS Server                                   mm-rhino-03

       DHCP/PXE/DNS                   Data Node 01
                                                  mm-rhino-10
      mm-gazelle
                                      Data Node 02
                                              …
                                                  mm-rhino-11
    Private Hadoop
    network                           Data Node 09
Data Storage & Processing
                             Hadoop
      Private Data           Raw data       Processed        Time series


                                                                                    Voldemort


      Public Data
                              Push To     Preprocess    Generate      HDFS to KVS
                              Hadoop                    timeseries


                             RabbitMQ
                              To Hadoop   Preprocess    Timeseries     To_KVS


•   E.g. BitTorrent input data: per 1TB
•   Pre-processed: 200GB
•   Raw time series: 37GB
•   Filtered/artist data: 2.5GB
•   KVS: 1.9GB
Opportunities
• Hive/Pig/HBase
• Mahout
• Nutch
Open Questions & Challenges
• Organizational readiness
    – Planning
    – Access
    – Experience
• Cluster maintenance
    – Unlikely to replicate production setup
    – 24/7 (ish)
    – What can be switched off when (and is it handled automatically)?
• Resource scheduling
• Workflow
• Amazon EMR vs own hardware?
    – Predictable workload/cost?
    – In for a penny, in for a pound
    – Hotel California
• DBA equivalent on Hadoop? HDA
We are hiring

jobs@musicmetric.com
      @tilapia

Contenu connexe

Tendances

Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Data science a practitioner's perspective
Data science  a practitioner's perspectiveData science  a practitioner's perspective
Data science a practitioner's perspectiveAmir Ziai
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...Dr. Haxel Consult
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Course Information for March 25th Batch
Course Information for March 25th BatchCourse Information for March 25th Batch
Course Information for March 25th BatchUpXAcademy
 
Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Jane Frazier
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...Dr. Haxel Consult
 

Tendances (10)

Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Data science a practitioner's perspective
Data science  a practitioner's perspectiveData science  a practitioner's perspective
Data science a practitioner's perspective
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Course Information for March 25th Batch
Course Information for March 25th BatchCourse Information for March 25th Batch
Course Information for March 25th Batch
 
Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
II-SDV 2017: Applications of RNN (Recurrent Neural Networks) within Machine T...
 

En vedette

Formularz konsultacji społecznych
Formularz konsultacji społecznychFormularz konsultacji społecznych
Formularz konsultacji społecznychFundacja "Merkury"
 
Wireless Systems
Wireless SystemsWireless Systems
Wireless SystemsSaqib Ahmed
 
Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案kakkun005
 
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...IJSRD
 
manoj_kumar_resume
manoj_kumar_resumemanoj_kumar_resume
manoj_kumar_resumeManoj Kumar
 
Wist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio'sWist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio'sTom Van de Velde
 
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your CollectionTim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collectiondri_ireland
 
هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي r12347890
 
Cover Proposal Pembangunan Masjid
Cover Proposal Pembangunan MasjidCover Proposal Pembangunan Masjid
Cover Proposal Pembangunan MasjidMa'shum Arif
 
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)dri_ireland
 
GO Menstrual , de Miranda Gray
GO Menstrual , de Miranda GrayGO Menstrual , de Miranda Gray
GO Menstrual , de Miranda GrayPaola Pozzi
 

En vedette (19)

Tic 1
Tic 1Tic 1
Tic 1
 
Rada Seniorów
Rada SeniorówRada Seniorów
Rada Seniorów
 
Formularz konsultacji społecznych
Formularz konsultacji społecznychFormularz konsultacji społecznych
Formularz konsultacji społecznych
 
Wireless Systems
Wireless SystemsWireless Systems
Wireless Systems
 
Tic 2
Tic 2Tic 2
Tic 2
 
Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案Neet株式会社(仮)の組織形態についてのご提案
Neet株式会社(仮)の組織形態についてのご提案
 
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
 
manoj_kumar_resume
manoj_kumar_resumemanoj_kumar_resume
manoj_kumar_resume
 
Pirates 7
Pirates 7Pirates 7
Pirates 7
 
selection
selectionselection
selection
 
Lakshya_Concept
Lakshya_ConceptLakshya_Concept
Lakshya_Concept
 
1314053
13140531314053
1314053
 
Wist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio'sWist-je-datjes over UiTPASregio's
Wist-je-datjes over UiTPASregio's
 
B&D Eolas - Catalogue des formations webmarketing - 2015
B&D Eolas - Catalogue des formations webmarketing - 2015B&D Eolas - Catalogue des formations webmarketing - 2015
B&D Eolas - Catalogue des formations webmarketing - 2015
 
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your CollectionTim Keefe - DRI Training Series Day UCC: Digitising Your Collection
Tim Keefe - DRI Training Series Day UCC: Digitising Your Collection
 
هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي هرم الغذائي فاطمة المحيشي
هرم الغذائي فاطمة المحيشي
 
Cover Proposal Pembangunan Masjid
Cover Proposal Pembangunan MasjidCover Proposal Pembangunan Masjid
Cover Proposal Pembangunan Masjid
 
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
Rebecca Grant - Archiving and Digital Preservation (Figshare Fest)
 
GO Menstrual , de Miranda Gray
GO Menstrual , de Miranda GrayGO Menstrual , de Miranda Gray
GO Menstrual , de Miranda Gray
 

Similaire à Hadoop at Musicmetric

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyJosh Baer
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesKinetica
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratchVinayak Hegde
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 

Similaire à Hadoop at Musicmetric (20)

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at Spotify
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
How to build a data stack from scratch
How to build a data stack from scratchHow to build a data stack from scratch
How to build a data stack from scratch
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 

Dernier

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Hadoop at Musicmetric

  • 1. Hadoop at Musicmetric Dr Jameel Syed April 2012
  • 2. Music has moved online • The world has changed – Do you buy vinyl/tapes/CDs of music? – Do you buy music downloads? – Do you download illegal content from BitTorrent? – Do you listen to music on YouTube? – Do you “like” bands on Facebook? – Do you subscribe to Spotify? – Do you listen on the radio to the weekly charts on a Sunday afternoon? • What’s happening online?
  • 4. Who are my fans?
  • 5. Where are my fans?
  • 6. What is the press saying?
  • 8. Data Science in the Music Industry • Raw Data – Social media/networks (Facebook, YouTube, Twitter, Last.fm...) – BitTorrent – Online reviews • Raw Data -> Derived Data -> Insight – Who is popular right now/in the immediate future? – What was the effect of appearing at a festival? – Which artists are (becoming) popular with listeners with certain demographics (in a region)? • Data processing, machine learning & statistical methods – Sentiment analysis – Named Entity Recognition – Ranking – Segmentation
  • 9. Data Pipeline - Overview Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Engineering approach – KISS – Decoupled components
  • 10. Data Pipeline - Input Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Input – Distributed data collection from public internet sources • Real-time system constraints: 24/7 hourly data • Changing format, scope – Customers providing private data feeds • e.g. sales and streaming data
  • 11. Data Pipeline - Output Data Processing Anomaly Key-Value Web Raw Data Aggregation API Detection Store Application • Output – Sparse data requests about hundreds of thousands of artists – Timeliness – Lots of combinations (by country/city, by release/track, diff/cumulative, hourly/daily/weekly, charts…) – Need to reprocess over EVERYTHING (new metadata, re- delivery of data, anomaly detection)
  • 12. Why Hadoop? • Outgrew initial solution for data processing over existing data – How long should daily processing take? – I/O (disk seeks) • Additional data – BitTorrent scale-up – iTunes sales – Spotify plays
  • 13. Hadoop Cluster • 12 physical servers + 2 KVM virtual machines • Cloudera CDH3/Ubuntu 10.04 LTS • 2x Quad Core Xeon E5620 2.4Ghz (HT, 32nm) • 24GB RAM, 4x 2TB WD • Gb Ethernet (no link aggregation yet) • ~2.5KW (max 4KW) mm-addax mm-rhino-01 mm-rhino-02 Edge Server Primary Name Node Secondary Name Node Job Tracker mm-impala Zoo Keeper NFS Server mm-rhino-03 DHCP/PXE/DNS Data Node 01 mm-rhino-10 mm-gazelle Data Node 02 … mm-rhino-11 Private Hadoop network Data Node 09
  • 14. Data Storage & Processing Hadoop Private Data Raw data Processed Time series Voldemort Public Data Push To Preprocess Generate HDFS to KVS Hadoop timeseries RabbitMQ To Hadoop Preprocess Timeseries To_KVS • E.g. BitTorrent input data: per 1TB • Pre-processed: 200GB • Raw time series: 37GB • Filtered/artist data: 2.5GB • KVS: 1.9GB
  • 16. Open Questions & Challenges • Organizational readiness – Planning – Access – Experience • Cluster maintenance – Unlikely to replicate production setup – 24/7 (ish) – What can be switched off when (and is it handled automatically)? • Resource scheduling • Workflow • Amazon EMR vs own hardware? – Predictable workload/cost? – In for a penny, in for a pound – Hotel California • DBA equivalent on Hadoop? HDA