SlideShare a Scribd company logo
1 of 11
An Approach to R Package Recommendation Engine Alex Lin alin@intelligentmining.com Twitter: @alinatwork
Initial Thoughts The data set expected to have very strong package-package relationships (dependencies and related package functionalities). The data set (training + test) is not sparse. Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.
Steps Modified k-Nearest Neighbor algorithm. User average & package average as prior bias. User-specific package Maintainer Affinity. Matrix factorization (MF) to post-process the residuals. Other rules.
Modified k-Nearest Neighbor algorithm Calculate cosine similarity for each pkg-pkg pair. Scale the cosine similarity with “square user support”  ie. cosine * (support / ttl_user_cnt)**2 Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed.  For unknown records, we choose to take ZAN  approach. We treat the unknown entries as negative.  k=all
User average and Package average as prior bias User average  = user installed pkg count / user observation count Package average  = pkg installed by users count / pkg observation count Add them into the kNN result score.
User-specific Package Maintainer Affinity This metric measured as the installed package percent of a given maintainer for an user.  We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.
So Far – baseline model Very heuristic Public AUC = 0.976x
Matrix Factorization Analyze the residuals only. The goal is to find out structural errors in our baseline prediction. prediction := baseline_output + residual residual := pkg_bias + user_bias + pkgFactors . userFactors residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics
Matrix Factorization – cont. Minimizes truncated square error with batch gradient descent (BGD) Pairwise comparison
Other Rules For those duplicate records found exist in both testing and training set, copy answers from training set. Assume when a user install a package P, the user also installs the packages that P depends on.
Final Result Public AUC = 0.984914   Final AUC = 0.979565

More Related Content

Recently uploaded

Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Precisely
 

Recently uploaded (20)

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

An Approach to R Package Recommendation Engine

  • 1. An Approach to R Package Recommendation Engine Alex Lin alin@intelligentmining.com Twitter: @alinatwork
  • 2. Initial Thoughts The data set expected to have very strong package-package relationships (dependencies and related package functionalities). The data set (training + test) is not sparse. Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.
  • 3. Steps Modified k-Nearest Neighbor algorithm. User average & package average as prior bias. User-specific package Maintainer Affinity. Matrix factorization (MF) to post-process the residuals. Other rules.
  • 4. Modified k-Nearest Neighbor algorithm Calculate cosine similarity for each pkg-pkg pair. Scale the cosine similarity with “square user support” ie. cosine * (support / ttl_user_cnt)**2 Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed. For unknown records, we choose to take ZAN approach. We treat the unknown entries as negative. k=all
  • 5. User average and Package average as prior bias User average = user installed pkg count / user observation count Package average = pkg installed by users count / pkg observation count Add them into the kNN result score.
  • 6. User-specific Package Maintainer Affinity This metric measured as the installed package percent of a given maintainer for an user. We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.
  • 7. So Far – baseline model Very heuristic Public AUC = 0.976x
  • 8. Matrix Factorization Analyze the residuals only. The goal is to find out structural errors in our baseline prediction. prediction := baseline_output + residual residual := pkg_bias + user_bias + pkgFactors . userFactors residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics
  • 9. Matrix Factorization – cont. Minimizes truncated square error with batch gradient descent (BGD) Pairwise comparison
  • 10. Other Rules For those duplicate records found exist in both testing and training set, copy answers from training set. Assume when a user install a package P, the user also installs the packages that P depends on.
  • 11. Final Result Public AUC = 0.984914 Final AUC = 0.979565

Editor's Notes

  1. AUC = 0.85
  2. AUC = 0.95x
  3. Xui is residual and Residual = pkg_bias + user_bias + (pkgFactors . userFactors)
  4. Xui is residual and Residual = pkg_bias + user_bias + (pkgFactors . userFactors)
  5. Xui is residual and Residual = pkg_bias + user_bias + (pkgFactors . userFactors)