Hive at Last.fm

•

5 j'aime•717 vues

This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.

Technologie Formation

What is Last.fm?
A music community website, powered by scrobbling
that provides personalised radio.

We aggregate scrobbles. A single scrobble is the
smallest unit of music attention data.

1 scrobble = (track, artist, timestamp).

In numbers
• 40 million users visit the site every month
• 39 billion scrobbles (600 per second)
• 400k personalised radio stations per day

enter hadoop...

Hadoop cluster

• 44 nodes
• 8 cores per node
• 16 gig ram per node
• 4x 1TB 7200rpm disks per node

Hadoop what is it good for?

• Charts
• Reporting
• Corrections
• Site stats / metrics
• Neighbours
• Recommendations

But wait, can you tell us about <stuff/>?

• How many?
• When?
• Where?
• Who?
• Why? Why not?

Ad hoc questions

• We get them all the time.
• Questions are good things, but answers take up
time.
• We would typically write programs once, run
once.

enter Hive...

What is Hive?
"Hive is a data warehouse infrastructure built on
top of Hadoop"

You get an SQL-like language for queries.

Start queries from a shell, file, jdbc, thrift.

Why we chose Hive?

• SQL familiarity suits non data engineers.
• It integrates well with existing data sets.
• It worked.

eg: http://www.flickr.com/photos/lozzd/4203345000/

Example:

SELECT artistid, insertdate, count(1)
FROM scrobbles
WHERE (trackid = 10019 OR trackid = 368575614)
AND insertdate >= '2009-12-01'
AND insertdate <= '2009-12-31'
GROUP BY artistid, insertdate
ORDER BY artistid, insertdate;

Example:

Users that
scrobble
? Users that
use the radio

Example:

SELECT count(1) FROM scrobbles GROUP BY userid;

SELECT count(1) FROM radiologs GROUP BY userid;

SELECT count(1) FROM
radiologs r JOIN scrobbles s
ON r.userid = s.userid
GROUP BY r.userid;

Example:

Consider a user's scrobbles and radio listens for just one track
First scrobble!

Scrobbles

Radio

Time

Example:

SELECT r.userid, r.trackid, count(1)
FROM
(
SELECT userid, trackid, min(unixtime) as unixtime
FROM scrobbles GROUP BY userid, trackid
) s
JOIN
radiologs r
ON r.userid = s.userid AND r.trackid = r.trackid
WHERE s.unixtime < r.unixtime
GROUP BY r.userid, r.trackid

Other nice things about hive

• Joins are really really easy (most of the time).

Preparing a search index
The crowd Labels
cloud scrobbles

Charts corrections Catalogue

Hive

Solr

artists albums tracks

Not so great

• No recordio.
• Really huge joins can cause out of memory
exceptions.

Recommandé

20160708 データ処理のプラットフォームとしてのpython 札幌Ryuji Tamagawa

20161004 データ処理のプラットフォームとしてのpythonとpandas 東京Ryuji Tamagawa

800万人の"食べたい"をHadoopで分散処理Tatsuya Sasaki

Storm at SpotifyNeville Li

Scaling Your Team and Technology: The Agile Way - Erik DuindamAvisi B.V.

Scio - Moving to Google Cloud, A Spotify StoryNeville Li

Scala Data Pipelines @ SpotifyNeville Li

SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit

Recommandé

20160708 データ処理のプラットフォームとしてのpython 札幌Ryuji Tamagawa

20161004 データ処理のプラットフォームとしてのpythonとpandas 東京Ryuji Tamagawa

800万人の"食べたい"をHadoopで分散処理Tatsuya Sasaki

Storm at SpotifyNeville Li

Scaling Your Team and Technology: The Agile Way - Erik DuindamAvisi B.V.

Scio - Moving to Google Cloud, A Spotify StoryNeville Li

Scala Data Pipelines @ SpotifyNeville Li

SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit

Elephant in the cloudrhatr

Kick-R: Get your own R instance with 36 cores on AWSKiwamu Okabe

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza

BDT201 AWS Data Pipeline - AWS re: Invent 2012Amazon Web Services

Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza

Apache Spark: killer or savior of Apache Hadoop?rhatr

Making sense of performance and identifying stragglers in Data Analytics Fram...manish ranjan

Who’s Afraid of Graphs?Codemotion

Hadoop導入事例 in クックパッドTatsuya Sasaki

Sorry - How Bieber broke Google Cloud at SpotifyNeville Li

Spark - Alexis Seigneurin (English)Alexis Seigneurin

Hadoop in Data WarehousingAlexey Grigorev

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

Playlist Recommendations @ SpotifyNikhil Tibrewal

Hadoop Overview & Architecture EMC

Hadoop londonYahoo Developer Network

RDA BootcampHelen Linda

Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico

Cloudera searchMark Kerzner

Devops kc meetup_5_20_2013Aaron Blythe

Yahoo compares Storm and SparkChicago Hadoop Users Group

Hadoop with PythonDonald Miner

Contenu connexe

Tendances

Elephant in the cloudrhatr

Kick-R: Get your own R instance with 36 cores on AWSKiwamu Okabe

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza

BDT201 AWS Data Pipeline - AWS re: Invent 2012Amazon Web Services

Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza

Apache Spark: killer or savior of Apache Hadoop?rhatr

Making sense of performance and identifying stragglers in Data Analytics Fram...manish ranjan

Who’s Afraid of Graphs?Codemotion

Hadoop導入事例 in クックパッドTatsuya Sasaki

Sorry - How Bieber broke Google Cloud at SpotifyNeville Li

Spark - Alexis Seigneurin (English)Alexis Seigneurin

Hadoop in Data WarehousingAlexey Grigorev

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

Tendances (13)

Elephant in the cloud

Kick-R: Get your own R instance with 36 cores on AWS

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

BDT201 AWS Data Pipeline - AWS re: Invent 2012

Lens: Data exploration with Dask and Jupyter widgets

Apache Spark: killer or savior of Apache Hadoop?

Making sense of performance and identifying stragglers in Data Analytics Fram...

Who’s Afraid of Graphs?

Hadoop導入事例 in クックパッド

Sorry - How Bieber broke Google Cloud at Spotify

Spark - Alexis Seigneurin (English)

Hadoop in Data Warehousing

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Similaire à Hive at Last.fm

Playlist Recommendations @ SpotifyNikhil Tibrewal

Hadoop Overview & Architecture EMC

Hadoop londonYahoo Developer Network

RDA BootcampHelen Linda

Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico

Cloudera searchMark Kerzner

Devops kc meetup_5_20_2013Aaron Blythe

Yahoo compares Storm and SparkChicago Hadoop Users Group

Hadoop with PythonDonald Miner

Hadoop Overview kdd2011Milind Bhandarkar

TriHUG Feb: Hive on sparktrihug

On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence

OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho

The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła

Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy

Music Personalization : Real time Platforms.Esh Vckay

OCF.tw's talk about "Introduction to spark"Giivee The

Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq

Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar

Processing Large GraphsNishant Gandhi

Similaire à Hive at Last.fm (20)

Playlist Recommendations @ Spotify

Hadoop Overview & Architecture

Hadoop london

RDA Bootcamp

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Cloudera search

Devops kc meetup_5_20_2013

Yahoo compares Storm and Spark

Hadoop with Python

Hadoop Overview kdd2011

TriHUG Feb: Hive on spark

On the need for a W3C community group on RDF Stream Processing

OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...

The Evolution of Hadoop at Spotify - Through Failures and Pain

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

Music Personalization : Real time Platforms.

OCF.tw's talk about "Introduction to spark"

Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...

Practical Problem Solving with Apache Hadoop & Pig

Processing Large Graphs

Plus de Skills Matter

5 things cucumber is bad at by Richard LawrenceSkills Matter

Patterns for slick database applicationsSkills Matter

Scala e xchange 2013 haoyi li on metascala a tiny diy jvmSkills Matter

Oscar reiken jr on our success at manheimSkills Matter

Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter

Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter

Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter

Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter

Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter

Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter

Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter

Dmitry mozorov on code quotations code as-data for f#Skills Matter

A poet's guide_to_acceptance_testingSkills Matter

Russ miles-cloudfoundry-deep-diveSkills Matter

Serendipity-neo4jSkills Matter

Simon Peyton Jones: Managing parallelismSkills Matter

Plug 20110217Skills Matter

Lug presentationSkills Matter

I went to_a_communications_workshop_and_they_tSkills Matter

Plug saikuSkills Matter

Plus de Skills Matter (20)

5 things cucumber is bad at by Richard Lawrence

Patterns for slick database applications

Scala e xchange 2013 haoyi li on metascala a tiny diy jvm

Oscar reiken jr on our success at manheim

Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...

Cukeup nyc ian dees on elixir, erlang, and cucumberl

Cukeup nyc peter bell on getting started with cucumber.js

Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...

Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...

Progressive f# tutorials nyc don syme on keynote f# in the open source world

Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...

Dmitry mozorov on code quotations code as-data for f#

A poet's guide_to_acceptance_testing

Russ miles-cloudfoundry-deep-dive

Serendipity-neo4j

Simon Peyton Jones: Managing parallelism

Plug 20110217

Lug presentation

I went to_a_communications_workshop_and_they_t

Plug saiku

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Histor y of HAM Radio presentation slidevu2urc

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

GenCyber Cyber Security Day PresentationMichael W. Hawkins

A Call to Action for Generative AI in 2024Results

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Slack Application Development 101 Slidespraypatel2

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Histor y of HAM Radio presentation slide

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Unblocking The Main Thread Solving ANRs and Frozen Frames

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Finology Group – Insurtech Innovation Award 2024

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Driving Behavioral Change for Information Management through Data-Driven Gree...

Automating Google Workspace (GWS) & more with Apps Script

GenCyber Cyber Security Day Presentation

A Call to Action for Generative AI in 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Breaking the Kubernetes Kill Chain: Host Path Mount

Slack Application Development 101 Slides

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Hive at Last.fm

1. Hive at Last.fm March 2010

2. What is Last.fm? A music community website, powered by scrobbling that provides personalised radio. We aggregate scrobbles. A single scrobble is the smallest unit of music attention data. 1 scrobble = (track, artist, timestamp).

3. In numbers • 40 million users visit the site every month • 39 billion scrobbles (600 per second) • 400k personalised radio stations per day enter hadoop...

4. Hadoop cluster • 44 nodes • 8 cores per node • 16 gig ram per node • 4x 1TB 7200rpm disks per node

5. Hadoop what is it good for? • Charts • Reporting • Corrections • Site stats / metrics • Neighbours • Recommendations

6. But wait, can you tell us about <stuff/>? • How many? • When? • Where? • Who? • Why? Why not?

7. Ad hoc questions • We get them all the time. • Questions are good things, but answers take up time. • We would typically write programs once, run once. enter Hive...

8. What is Hive? "Hive is a data warehouse infrastructure built on top of Hadoop" You get an SQL-like language for queries. Start queries from a shell, file, jdbc, thrift.

9. Hive: SQL warehouse Hadoop

10. Why we chose Hive? • SQL familiarity suits non data engineers. • It integrates well with existing data sets. • It worked.

11. Johan set it up..

12. eg: http://www.flickr.com/photos/lozzd/4203345000/

13. Example: SELECT artistid, insertdate, count(1) FROM scrobbles WHERE (trackid = 10019 OR trackid = 368575614) AND insertdate >= '2009-12-01' AND insertdate <= '2009-12-31' GROUP BY artistid, insertdate ORDER BY artistid, insertdate;

14. Example: Users that scrobble ? Users that use the radio

15. Example: SELECT count(1) FROM scrobbles GROUP BY userid; SELECT count(1) FROM radiologs GROUP BY userid; SELECT count(1) FROM radiologs r JOIN scrobbles s ON r.userid = s.userid GROUP BY r.userid;

16. Example: Consider a user's scrobbles and radio listens for just one track First scrobble! Scrobbles Radio Time

17. Example: SELECT r.userid, r.trackid, count(1) FROM ( SELECT userid, trackid, min(unixtime) as unixtime FROM scrobbles GROUP BY userid, trackid ) s JOIN radiologs r ON r.userid = s.userid AND r.trackid = r.trackid WHERE s.unixtime < r.unixtime GROUP BY r.userid, r.trackid

18. Other nice things about hive • Joins are really really easy (most of the time).

19. Preparing a search index The crowd Labels cloud scrobbles Charts corrections Catalogue Hive Solr artists albums tracks

20. Not so great • No recordio. • Really huge joins can cause out of memory exceptions.