SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
DiscoRank: Optimizing Discoverability
on SoundCloud
Amélie Anglade
• Developer at SoundCloud
• SoundCloud is the
world’s largest social
sound platform
• Academic background in
Music Information
Retrieval (MIR)
• Design, prototype and
implement Machine
Learning algorithms for
music discovery
DISCOVERABILITY ?
PAGERANK
• The web is a graph:
• nodes = web pages
• edges = hyperlinks
• The (Page)rank of a node depends on the link
structure of the graph
WEB AND PAGERANK
RANDOM SURFER
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
Nodes visited more often:
• Nodes with many links
• Coming from frequently visited nodes
RANDOM SURFER
A
B
C
D
E
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution
of surfer’s position
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
If N nodes in graph,
probability to teleport
to any other node
(including self) = 1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
1/N
TELEPORT
A
B
C
D
E
1/N
1/N
1/N
1/N
α
?
1-α
1/N
At regular node: invoke
teleport operation with
probability α and
standard random walk
with probability (1 - α)
Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state:
the PageRank vector.
PAGERANK EQUATION
SOUNDCLOUD
DISCORANK
DISCORANK
A
B
C
D
EUser
User
Track
Playlist
favorite
follow
featured in
• Search across People, Sounds, Sets, Groups
• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track
• Track is featured in Playlist
...
• New big (but sparse)
adjacency matrix:
UNIVERSAL SEARCH
• How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the
higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE
PERFORMANCE
OPTIMIZATION
• Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:
• Sparse matrix
• Optimized storage of the graph in memory
• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank
realtime
A VERY LARGE GRAPH
•
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:
• All edges details are stored in memory in a byte[]
• buffer the byte[] into an opaque byte block pool
• no object
• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:
• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes
• Delta encode the “to” node ids
USING SPARSITY
• We keep versioned copies of:
• the DiscoRank vector of results
• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch
once a week
• In between:
• we create additional graph segments with new
entities and events
• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:
• Also allows for experimentation
VERSIONED DISCORANK
• MySQL batch jobs
• DiscoRank results stored in
HDFS
• At the end of every
DiscoRank run we re-load it
in ElasticSearch:
• For each item we combine
its Lucene score with its
DiscoRank
INTEGRATION IN
OUR INFRASTRUCTURE
Amélie Anglade
Sound/Music Information Retrieval Engineer
about.me/utstikkar
@utstikkar
We’re hiring!
www.soundcloud.com

Contenu connexe

Tendances

Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierArunabha Saha
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...Simplilearn
 
Introduction to Approximation Algorithms
Introduction to Approximation AlgorithmsIntroduction to Approximation Algorithms
Introduction to Approximation AlgorithmsJhoirene Clemente
 
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...Simplilearn
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsSwapnil Agrawal
 
Data Visualization using matplotlib
Data Visualization using matplotlibData Visualization using matplotlib
Data Visualization using matplotlibBruno Gonçalves
 
Multiple intelligences approach to Number Systems
Multiple intelligences approach to  Number SystemsMultiple intelligences approach to  Number Systems
Multiple intelligences approach to Number SystemsKatrin Becker
 
Graph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptxGraph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptxHome
 
Design and Analysis of Algorithm ppt for unit one
Design and Analysis of Algorithm ppt for unit oneDesign and Analysis of Algorithm ppt for unit one
Design and Analysis of Algorithm ppt for unit onessuserb7c8b8
 
Differences between rgb and cmyk color schemes
Differences between rgb and cmyk color schemesDifferences between rgb and cmyk color schemes
Differences between rgb and cmyk color schemesDhanasekar181
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender SystemsRoelof van Zwol
 

Tendances (20)

Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Graph coloring
Graph coloringGraph coloring
Graph coloring
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
 
Introduction to Approximation Algorithms
Introduction to Approximation AlgorithmsIntroduction to Approximation Algorithms
Introduction to Approximation Algorithms
 
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machin...
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 
Backtracking
Backtracking  Backtracking
Backtracking
 
Data Visualization using matplotlib
Data Visualization using matplotlibData Visualization using matplotlib
Data Visualization using matplotlib
 
Multiple intelligences approach to Number Systems
Multiple intelligences approach to  Number SystemsMultiple intelligences approach to  Number Systems
Multiple intelligences approach to Number Systems
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Color theory
Color theoryColor theory
Color theory
 
Graph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptxGraph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptx
 
Graph coloring problem
Graph coloring problemGraph coloring problem
Graph coloring problem
 
Design and Analysis of Algorithm ppt for unit one
Design and Analysis of Algorithm ppt for unit oneDesign and Analysis of Algorithm ppt for unit one
Design and Analysis of Algorithm ppt for unit one
 
Differences between rgb and cmyk color schemes
Differences between rgb and cmyk color schemesDifferences between rgb and cmyk color schemes
Differences between rgb and cmyk color schemes
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
 
Greedy method
Greedy method Greedy method
Greedy method
 
Daa unit 4
Daa unit 4Daa unit 4
Daa unit 4
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 

Similaire à DiscoRank: optimizing discoverability on SoundCloud

Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...DataStax Academy
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghubit-people
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Balboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public ResourceBalboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public ResourceAnna Chiaretta Lavatelli
 
JavaScript History
JavaScript HistoryJavaScript History
JavaScript HistoryRhio Kim
 
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3jasinb
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.pptEqinNiftalyev
 
LiveCoding Package for Pharo
LiveCoding Package for PharoLiveCoding Package for Pharo
LiveCoding Package for PharoESUG
 
Implementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesImplementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesJose Enrique Ruiz
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
RDA for Music: Scores
RDA for Music: ScoresRDA for Music: Scores
RDA for Music: ScoresALATechSource
 
Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ SpotifyNikhil Tibrewal
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep diveYves Goeleven
 

Similaire à DiscoRank: optimizing discoverability on SoundCloud (20)

Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Balboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public ResourceBalboa Park Commons: Collaborative Digitization for a Public Resource
Balboa Park Commons: Collaborative Digitization for a Public Resource
 
JavaScript History
JavaScript HistoryJavaScript History
JavaScript History
 
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
Solving Visibility and Streaming in The Witcher 3: Wild Hunt with Umbra 3
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
WebServices_Grid.ppt
WebServices_Grid.pptWebServices_Grid.ppt
WebServices_Grid.ppt
 
LiveCoding Package for Pharo
LiveCoding Package for PharoLiveCoding Package for Pharo
LiveCoding Package for Pharo
 
Implementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxiesImplementing a VO archive for datacubes of galaxies
Implementing a VO archive for datacubes of galaxies
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Maa
MaaMaa
Maa
 
RDA for Music: Scores
RDA for Music: ScoresRDA for Music: Scores
RDA for Music: Scores
 
Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 

Dernier

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

DiscoRank: optimizing discoverability on SoundCloud

  • 1. DiscoRank: Optimizing Discoverability on SoundCloud Amélie Anglade
  • 2. • Developer at SoundCloud • SoundCloud is the world’s largest social sound platform • Academic background in Music Information Retrieval (MIR) • Design, prototype and implement Machine Learning algorithms for music discovery
  • 4.
  • 5.
  • 6.
  • 8. • The web is a graph: • nodes = web pages • edges = hyperlinks • The (Page)rank of a node depends on the link structure of the graph WEB AND PAGERANK
  • 12. Nodes visited more often: • Nodes with many links • Coming from frequently visited nodes RANDOM SURFER A B C D E
  • 13. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 14. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 15. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 16. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 17. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 18. Adjacency matrix A COMPUTING THE PAGERANK A B C D E Transition probability matrix M Probability distribution of surfer’s position
  • 22. If N nodes in graph, probability to teleport to any other node (including self) = 1/N TELEPORT A B C D E 1/N 1/N 1/N 1/N 1/N
  • 23. TELEPORT A B C D E 1/N 1/N 1/N 1/N α ? 1-α 1/N At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)
  • 24. Probability distribution of the surfer at any time is a vector. COMPUTING THE PAGERANK That vector converges to a steady state: the PageRank vector.
  • 27.
  • 29. • Search across People, Sounds, Sets, Groups • One unique rank vector that contains all entities • Weight the links based on the type of event: • User favorites Track • Track is featured in Playlist ... • New big (but sparse) adjacency matrix: UNIVERSAL SEARCH
  • 30.
  • 31. • How do we identify content that is trending? • The more recent a listen, favorite, etc. (event) the higher the weight • Multiply each event (=edge) by a time decay: • New adjacency matrix: BACK TO EXPLORE
  • 33. • Millions of entities(=nodes) and events(=edges) • First DiscoRank: several hours of computation • Trimmed down to a few minutes using: • Sparse matrix • Optimized storage of the graph in memory • Versioned copies of the DiscoRank • So technically we could compute the DiscoRank realtime A VERY LARGE GRAPH
  • 34. • • Re-mapping entity ids • Memory optimization so the graph holds in memory: • All edges details are stored in memory in a byte[] • buffer the byte[] into an opaque byte block pool • no object • sort the buffered byte[] in place • On disk and when computing the DiscoRank: • Delta encoded ordered adjacency lists: • One “from” node, several “to” nodes • Delta encode the “to” node ids USING SPARSITY
  • 35. • We keep versioned copies of: • the DiscoRank vector of results • the DiscoRank graph • We rebuild the entire DiscoRank graph from scratch once a week • In between: • we create additional graph segments with new entities and events • and use as prior for the DiscoRank computation the results of the previous DiscoRank run • Side effect: • Also allows for experimentation VERSIONED DISCORANK
  • 36. • MySQL batch jobs • DiscoRank results stored in HDFS • At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine its Lucene score with its DiscoRank INTEGRATION IN OUR INFRASTRUCTURE
  • 37. Amélie Anglade Sound/Music Information Retrieval Engineer about.me/utstikkar @utstikkar We’re hiring! www.soundcloud.com