SlideShare une entreprise Scribd logo
1  sur  54
Juliet Hougland and Jonathan Natkins

REAL-TIME RECOMMENDATIONS FOR RETAIL:
ARCHITECTURE, ALGORITHMS, AND DESIGN
Who Are We?
Jonathan Natkins
Field Engineer at
WibiData
Before that, Cloudera
Software Engineer
Before that, Vertica
Software/Field Engineer

Juliet Hougland
Data Scientist, previously
at WibiData
MS in Applied Math
BA in Math-Physics
Recommendations in Retail
Personalized versus Non-Personalized
Recommendations in Retail
Personalized versus Non-Personalized
Recommendations in Retail
Personalized versus Non-Personalized
Recommender Contexts
Taste History
Based on everything you know about a user
Interests over months/years

Current Taste
Based on a user’s immediate history
Interests over minutes/hours

Ephemeral
Extreme version of current taste
For example, location

Demographic*
Similar to taste history, but less subjective
Geographic region, age bracket, etc.
Why Does Real-Time Matter?

Relevancy
I am a Special Snowflake

Natty
Requirements for a Real-Time System
General System Requirements
Handle millions of customers/users
Support collection and storage of complex data
Static and event-series

Real-Time System Requirements
Quickly retrieve subsets of data for a single user
Aggregate/derive new, first-class data per user
What is Kiji?

kiji.org
github.com/kijiproject

The Kiji project is a
modular, opensource framework
for building realtime applications
that collect, store,
and analyze
entity-centric data
What is Kiji?

kiji.org
github.com/kijiproject

The Kiji project is a
modular, opensource framework
for building realtime applications
that collect, store,
and analyze
entity-centric data
Three Challenges
Developing models for use in real-time
Scoring models in real-time
Deploying models into a production
environment
How Can We Make Real-Time Models?
Population interests
change slowly

Individual interests
change quickly
How Can We Make Real-Time Models?
Population interests
change slowly

Models don’t need
to be retrained
frequently

Individual interests
change quickly
How Can We Make Real-Time Models?
Population interests
change slowly

Models don’t need
to be retrained
frequently

Application of a model
should be fast

Individual interests
change quickly
A Common Workflow
Train a model over
the entire dataset
Save fitted model
parameters to a file or
another table
Access the model
parameters when
generating new
recommendations
based on new data

This is
EXPENSIVE
Developing Models
KijiExpress
Scala interface for interacting with Kiji data
Uses Scalding for designing complex dataflows

Model Lifecycle
Allows analysts and data scientists to break apart
a model into phases
Scoring Models in Real-Time
Batch isn’t real-time
Scoring Models in Real-Time
Batch isn’t real-time

Number of
Users

Number of Interactions
Scoring Models in Real-Time
Batch isn’t real-time

Number of
Users

A few users with
many interactions

Number of Interactions
Scoring Models in Real-Time
Batch isn’t real-time
A lot of users with
few interactions
Number of
Users

A few users with
many interactions

Number of Interactions
Fresheners Compute Lazily
Read a column
Get from HBase

Client

KijiScoring Server

HBase
Fresheners Compute Lazily
Read a column
Get from HBase

Client
Freshness
Policy

KijiScoring Server

HBase
Fresheners Compute Lazily
Read a column
Get from HBase

Client
Yes, return to client

KijiScoring Server

Freshness
Policy

HBase
Fresheners Compute Lazily
Read a column
Get from HBase

Client

NO

Freshness
Policy

Scorer

KijiScoring Server

HBase
Fresheners Compute Lazily
Read a column
Get from HBase

Client
Freshness
Policy

Scorer

Yes, return to client

Write back for next time

KijiScoring Server

HBase
Kiji Application Stack
Deployment Challenges
Kiji Model Repository
Link between application and models
Stores Freshener metadata
FreshnessPolicy, Scorer, attached column
Location of trained model

Stores Scorer code
Code repository makes model scoring code available
to the application from a central location

New models can be deployed to the Model
Repository and made immediately available to
the application
Kiji Model Repository
Retail Recommendation
Types of Recommenders
Recommendation
Algorithms

Collaborative
Filtering
Methods

Memory
Based

Content
Based
Methods

Model
Based
Content-Based Recommenders
Build models around entities using
features that we think reflect
inherent characteristics

Orange-Nosed

Lab Assistant
Meeps a lot
Content-Based Recommenders

safer

faster

knife
Pandora: Content-Based

Expertly-Characterized
Music
Collaborative Filtering
Represent users-item
affinities as a sparse
matrix

Users ≈ Rows
Items ≈ Columns

Beaker

Banana
Slicer

Pineapple
Slicer
Aspirational Ratings
I put in my queue…

I actually watch
Collaborative Filtering
Represent users-item
affinities as a sparse
matrix

Users ≈ Rows
Items ≈ Columns

Beaker

Banana
Slicer

Pineapple
Slicer
Collaborative Filtering: How It Works
Similar Users

Similar Products

Simple aggregate predictors
Similar Entities
What do we mean by similar?
Jaccard Index: a measure of set similarity
Cosine Similarity: the angle between two vectors
Pearson Correlation: statistical measure, similar to cosine

Naively, we could compare every entity to each other
…But that would not scale
will with increasing
numbers of entities
Building the Similarity Matrix
Collaborative Filtering: Is This Useful?
Problem: Too much data!
Tracking user preferences and all their events generates huge
amounts of data

Problem: Too little data!
Dimensions of user-space and item-space are usually very large
More variables makes it more difficult to generate user
preferences

Problem: Cold start
If you don’t know anything about a user, what should you
recommend?

Problem: More ratings means slower computations
Identifying neighborhoods of entities is expensive
Collaborative Filtering: Why Is It Useful?
Because it works
Content-agnostic
All that matters is co-occurrence of events
Amazon: Item-Item Collaborative Filtering

>
Used for personalized recommendations
Fill screen real estate with related items
Produces specific, but non-creepy
recommendations
Linden, G.; Smith, B.; York, J., "Amazon.com recommendations: item-to-item collaborative filtering," Internet Computing, IEEE , vol.7,
no.1, pp.76,80, Jan/Feb 2003
Item-Item Collaborative Filtering

Beaker buys a banana slicer
Then:
Generate list of candidate items to predict ratings for
Predict ratings for candidate items
Select Top-N items
Accessing External Data
KeyValueStore API enables external data access
when applying a model
External data might be…
Trained model parameters
Hierarchical/Taxonomic data
Geo-lookup

Store external data flexibly
Text files, sequence files, Kiji tables, etc.
Data access is decoupled from use during execution

If the data doesn’t fit in memory, put it in a table
How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
Ignore unrated items

There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
How Much Less Work Can We Do?
We can choose a
predictor that allows
us to truncate a sum
Ignore dissimilar items

There are two ways
terms in the sum of
our predictor can be
small
No rating
Small similarity
How Much Less Work Can We Do?
If we only present a few recommendations,
we don’t need to predict ratings for all items
Choose your candidate set to estimate ratings
wisely or infer from nearest neighbors
Organizing Data in Item-Item CF
Accessing Data During Freshening
Want to Know More?
The Kiji Project
kiji.org
github.com/kijiproject

Questions about this presentation?
Twitter: @JulietHougland or @nattyice
Email: natty@wibidata.com

Contenu connexe

Tendances

Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyDatabricks
 
Machine Learning at Hand with Power BI
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BIIvo Andreev
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at PipedriveAndré Karpištšenko
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleNoriaki Tatsumi
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 
MAALBS Big Data agile framwork
MAALBS Big Data agile framwork MAALBS Big Data agile framwork
MAALBS Big Data agile framwork balvis_ms
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowLviv Startup Club
 
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Flavio Clesio
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning ModelsTash Bickley
 
Fast Data Intelligence in the IoT - real-time data analytics with Spark
Fast Data Intelligence in the IoT - real-time data analytics with SparkFast Data Intelligence in the IoT - real-time data analytics with Spark
Fast Data Intelligence in the IoT - real-time data analytics with SparkBas Geerdink
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...
Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...
Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...Sri Ambati
 
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYC
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYCMegan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYC
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYCSri Ambati
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireDatabricks
 
Skill up in machine learning using Azure ML
Skill up in machine learning using Azure MLSkill up in machine learning using Azure ML
Skill up in machine learning using Azure MLMostafa
 
Towards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthTowards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthDatabricks
 

Tendances (20)

Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and Cheaply
 
Machine Learning at Hand with Power BI
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BI
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
MAALBS Big Data agile framwork
MAALBS Big Data agile framwork MAALBS Big Data agile framwork
MAALBS Big Data agile framwork
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
 
Fast Data Intelligence in the IoT - real-time data analytics with Spark
Fast Data Intelligence in the IoT - real-time data analytics with SparkFast Data Intelligence in the IoT - real-time data analytics with Spark
Fast Data Intelligence in the IoT - real-time data analytics with Spark
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...
Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...
Rahul Bhuman, Tech Mahindra - Truck roll prediction using Driverless AI - H2O...
 
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYC
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYCMegan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYC
Megan Kurka, H2O.ai - AutoDoc with H2O Driverless AI - H2O World 2019 NYC
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Skill up in machine learning using Azure ML
Skill up in machine learning using Azure MLSkill up in machine learning using Azure ML
Skill up in machine learning using Azure ML
 
Towards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthTowards Personalization in Global Digital Health
Towards Personalization in Global Digital Health
 

En vedette

Shopzilla On Concurrency
Shopzilla On ConcurrencyShopzilla On Concurrency
Shopzilla On ConcurrencyRodney Barlow
 
Bizrate Insights iMedia Conference Presentation
Bizrate Insights iMedia Conference PresentationBizrate Insights iMedia Conference Presentation
Bizrate Insights iMedia Conference PresentationConnexity
 
LA Salesforce.com User Group: Shopzilla and Informatica Cloud
LA Salesforce.com User Group: Shopzilla and Informatica CloudLA Salesforce.com User Group: Shopzilla and Informatica Cloud
LA Salesforce.com User Group: Shopzilla and Informatica CloudDarren Cunningham
 
Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...
Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...
Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...Joshua Long
 
Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...
Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...
Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...MongoDB
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyAlexandros Karatzoglou
 
Shopzilla - Performance By Design
Shopzilla - Performance By DesignShopzilla - Performance By Design
Shopzilla - Performance By DesignTim Morrow
 
Working Effectively With Legacy Code
Working Effectively With Legacy CodeWorking Effectively With Legacy Code
Working Effectively With Legacy CodeNaresh Jain
 
Retail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
Retail Reference Architecture Part 2: Real-Time, Geo Distributed InventoryRetail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
Retail Reference Architecture Part 2: Real-Time, Geo Distributed InventoryMongoDB
 
5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...
5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...
5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...Internet Marketing Software - WordStream
 
Big Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business SchoolBig Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business SchoolLaurent Kinet
 
Big data retail_industry_by VivekChutke
Big data retail_industry_by VivekChutkeBig data retail_industry_by VivekChutke
Big data retail_industry_by VivekChutkevchutke
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
 
Building a Machine Learning App with AWS Lambda
Building a Machine Learning App with AWS LambdaBuilding a Machine Learning App with AWS Lambda
Building a Machine Learning App with AWS LambdaSri Ambati
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)Amazon Web Services
 
Big Data in Retail: too big to ignore
Big Data in Retail: too big to ignoreBig Data in Retail: too big to ignore
Big Data in Retail: too big to ignorevalantic NL
 
Continuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile DevelopmentContinuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile DevelopmentDynatrace
 
Connected Retail Reference Architecture
Connected Retail Reference ArchitectureConnected Retail Reference Architecture
Connected Retail Reference ArchitectureWSO2
 

En vedette (20)

Shopzilla On Concurrency
Shopzilla On ConcurrencyShopzilla On Concurrency
Shopzilla On Concurrency
 
Bizrate Insights iMedia Conference Presentation
Bizrate Insights iMedia Conference PresentationBizrate Insights iMedia Conference Presentation
Bizrate Insights iMedia Conference Presentation
 
LA Salesforce.com User Group: Shopzilla and Informatica Cloud
LA Salesforce.com User Group: Shopzilla and Informatica CloudLA Salesforce.com User Group: Shopzilla and Informatica Cloud
LA Salesforce.com User Group: Shopzilla and Informatica Cloud
 
Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...
Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...
Better Living Through Messaging - Leveraging the HornetQ Message Broker at Sh...
 
Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...
Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...
Retail Reference Architecture Part 3: Scalable Insight Component Providing Us...
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 Sydney
 
Shopzilla - Performance By Design
Shopzilla - Performance By DesignShopzilla - Performance By Design
Shopzilla - Performance By Design
 
Working Effectively With Legacy Code
Working Effectively With Legacy CodeWorking Effectively With Legacy Code
Working Effectively With Legacy Code
 
Retail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
Retail Reference Architecture Part 2: Real-Time, Geo Distributed InventoryRetail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
Retail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
 
5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...
5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...
5 Conversion Rate Hacks That Yield Massive 3-5x Conversion Rate Improvements ...
 
Big Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business SchoolBig Data for the Retail Business I Swan Insights I Solvay Business School
Big Data for the Retail Business I Swan Insights I Solvay Business School
 
Big data retail_industry_by VivekChutke
Big data retail_industry_by VivekChutkeBig data retail_industry_by VivekChutke
Big data retail_industry_by VivekChutke
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
Building a Machine Learning App with AWS Lambda
Building a Machine Learning App with AWS LambdaBuilding a Machine Learning App with AWS Lambda
Building a Machine Learning App with AWS Lambda
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
 
The Big Data Revolution in Retail
The Big Data Revolution in RetailThe Big Data Revolution in Retail
The Big Data Revolution in Retail
 
Big Data in Retail: too big to ignore
Big Data in Retail: too big to ignoreBig Data in Retail: too big to ignore
Big Data in Retail: too big to ignore
 
Continuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile DevelopmentContinuous Performance Testing and Monitoring in Agile Development
Continuous Performance Testing and Monitoring in Agile Development
 
Connected Retail Reference Architecture
Connected Retail Reference ArchitectureConnected Retail Reference Architecture
Connected Retail Reference Architecture
 

Similaire à Real-Time Recommendations for Retail: Architecture, Algorithms, and Design

BI on Big Data Presentation
BI on Big Data PresentationBI on Big Data Presentation
BI on Big Data PresentationArcadia Data
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
Getting the Most Out of Your E-Resources: Measuring Success
Getting the Most Out of Your E-Resources: Measuring SuccessGetting the Most Out of Your E-Resources: Measuring Success
Getting the Most Out of Your E-Resources: Measuring Successkramsey
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecyclebartlowe
 
Web analytics webinar
Web analytics webinarWeb analytics webinar
Web analytics webinarJim Jansen
 
Web analytics presentation
Web analytics presentationWeb analytics presentation
Web analytics presentationJim Jansen
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Moneyball, Libraries, and more - Ithaka collections presentation
Moneyball, Libraries, and more - Ithaka collections presentationMoneyball, Libraries, and more - Ithaka collections presentation
Moneyball, Libraries, and more - Ithaka collections presentationGreg Raschke
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
 
Caching Business Logic in the Database
Caching Business Logic in the DatabaseCaching Business Logic in the Database
Caching Business Logic in the DatabaseJonathan Levin
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Sematext Group, Inc.
 
SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015Michael Zoltowski
 
Simplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel DebSimplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel DebNovoniel Deb
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Peter Gfader
 
Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Looker
 
bookrecommendations-230615063942-3b1016c9 (1).pdf
bookrecommendations-230615063942-3b1016c9 (1).pdfbookrecommendations-230615063942-3b1016c9 (1).pdf
bookrecommendations-230615063942-3b1016c9 (1).pdf13DikshaDatir
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 

Similaire à Real-Time Recommendations for Retail: Architecture, Algorithms, and Design (20)

BI on Big Data Presentation
BI on Big Data PresentationBI on Big Data Presentation
BI on Big Data Presentation
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
Getting the Most Out of Your E-Resources: Measuring Success
Getting the Most Out of Your E-Resources: Measuring SuccessGetting the Most Out of Your E-Resources: Measuring Success
Getting the Most Out of Your E-Resources: Measuring Success
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 
Web analytics webinar
Web analytics webinarWeb analytics webinar
Web analytics webinar
 
Web analytics presentation
Web analytics presentationWeb analytics presentation
Web analytics presentation
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Moneyball, Libraries, and more - Ithaka collections presentation
Moneyball, Libraries, and more - Ithaka collections presentationMoneyball, Libraries, and more - Ithaka collections presentation
Moneyball, Libraries, and more - Ithaka collections presentation
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
Caching Business Logic in the Database
Caching Business Logic in the DatabaseCaching Business Logic in the Database
Caching Business Logic in the Database
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011
 
SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015
 
Data .pptx
Data .pptxData .pptx
Data .pptx
 
Simplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel DebSimplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel Deb
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...
 
bookrecommendations-230615063942-3b1016c9 (1).pdf
bookrecommendations-230615063942-3b1016c9 (1).pdfbookrecommendations-230615063942-3b1016c9 (1).pdf
bookrecommendations-230615063942-3b1016c9 (1).pdf
 
Book Recommendations.pptx
Book Recommendations.pptxBook Recommendations.pptx
Book Recommendations.pptx
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 

Dernier

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Real-Time Recommendations for Retail: Architecture, Algorithms, and Design

  • 1. Juliet Hougland and Jonathan Natkins REAL-TIME RECOMMENDATIONS FOR RETAIL: ARCHITECTURE, ALGORITHMS, AND DESIGN
  • 2. Who Are We? Jonathan Natkins Field Engineer at WibiData Before that, Cloudera Software Engineer Before that, Vertica Software/Field Engineer Juliet Hougland Data Scientist, previously at WibiData MS in Applied Math BA in Math-Physics
  • 3. Recommendations in Retail Personalized versus Non-Personalized
  • 4. Recommendations in Retail Personalized versus Non-Personalized
  • 5. Recommendations in Retail Personalized versus Non-Personalized
  • 6. Recommender Contexts Taste History Based on everything you know about a user Interests over months/years Current Taste Based on a user’s immediate history Interests over minutes/hours Ephemeral Extreme version of current taste For example, location Demographic* Similar to taste history, but less subjective Geographic region, age bracket, etc.
  • 7. Why Does Real-Time Matter? Relevancy
  • 8. I am a Special Snowflake Natty
  • 9. Requirements for a Real-Time System General System Requirements Handle millions of customers/users Support collection and storage of complex data Static and event-series Real-Time System Requirements Quickly retrieve subsets of data for a single user Aggregate/derive new, first-class data per user
  • 10. What is Kiji? kiji.org github.com/kijiproject The Kiji project is a modular, opensource framework for building realtime applications that collect, store, and analyze entity-centric data
  • 11. What is Kiji? kiji.org github.com/kijiproject The Kiji project is a modular, opensource framework for building realtime applications that collect, store, and analyze entity-centric data
  • 12. Three Challenges Developing models for use in real-time Scoring models in real-time Deploying models into a production environment
  • 13. How Can We Make Real-Time Models? Population interests change slowly Individual interests change quickly
  • 14. How Can We Make Real-Time Models? Population interests change slowly Models don’t need to be retrained frequently Individual interests change quickly
  • 15. How Can We Make Real-Time Models? Population interests change slowly Models don’t need to be retrained frequently Application of a model should be fast Individual interests change quickly
  • 16. A Common Workflow Train a model over the entire dataset Save fitted model parameters to a file or another table Access the model parameters when generating new recommendations based on new data This is EXPENSIVE
  • 17. Developing Models KijiExpress Scala interface for interacting with Kiji data Uses Scalding for designing complex dataflows Model Lifecycle Allows analysts and data scientists to break apart a model into phases
  • 18. Scoring Models in Real-Time Batch isn’t real-time
  • 19. Scoring Models in Real-Time Batch isn’t real-time Number of Users Number of Interactions
  • 20. Scoring Models in Real-Time Batch isn’t real-time Number of Users A few users with many interactions Number of Interactions
  • 21. Scoring Models in Real-Time Batch isn’t real-time A lot of users with few interactions Number of Users A few users with many interactions Number of Interactions
  • 22. Fresheners Compute Lazily Read a column Get from HBase Client KijiScoring Server HBase
  • 23. Fresheners Compute Lazily Read a column Get from HBase Client Freshness Policy KijiScoring Server HBase
  • 24. Fresheners Compute Lazily Read a column Get from HBase Client Yes, return to client KijiScoring Server Freshness Policy HBase
  • 25. Fresheners Compute Lazily Read a column Get from HBase Client NO Freshness Policy Scorer KijiScoring Server HBase
  • 26. Fresheners Compute Lazily Read a column Get from HBase Client Freshness Policy Scorer Yes, return to client Write back for next time KijiScoring Server HBase
  • 29. Kiji Model Repository Link between application and models Stores Freshener metadata FreshnessPolicy, Scorer, attached column Location of trained model Stores Scorer code Code repository makes model scoring code available to the application from a central location New models can be deployed to the Model Repository and made immediately available to the application
  • 33. Content-Based Recommenders Build models around entities using features that we think reflect inherent characteristics Orange-Nosed Lab Assistant Meeps a lot
  • 36. Collaborative Filtering Represent users-item affinities as a sparse matrix Users ≈ Rows Items ≈ Columns Beaker Banana Slicer Pineapple Slicer
  • 37. Aspirational Ratings I put in my queue… I actually watch
  • 38. Collaborative Filtering Represent users-item affinities as a sparse matrix Users ≈ Rows Items ≈ Columns Beaker Banana Slicer Pineapple Slicer
  • 39. Collaborative Filtering: How It Works Similar Users Similar Products Simple aggregate predictors
  • 40. Similar Entities What do we mean by similar? Jaccard Index: a measure of set similarity Cosine Similarity: the angle between two vectors Pearson Correlation: statistical measure, similar to cosine Naively, we could compare every entity to each other …But that would not scale will with increasing numbers of entities
  • 42. Collaborative Filtering: Is This Useful? Problem: Too much data! Tracking user preferences and all their events generates huge amounts of data Problem: Too little data! Dimensions of user-space and item-space are usually very large More variables makes it more difficult to generate user preferences Problem: Cold start If you don’t know anything about a user, what should you recommend? Problem: More ratings means slower computations Identifying neighborhoods of entities is expensive
  • 43. Collaborative Filtering: Why Is It Useful? Because it works Content-agnostic All that matters is co-occurrence of events
  • 44. Amazon: Item-Item Collaborative Filtering > Used for personalized recommendations Fill screen real estate with related items Produces specific, but non-creepy recommendations Linden, G.; Smith, B.; York, J., "Amazon.com recommendations: item-to-item collaborative filtering," Internet Computing, IEEE , vol.7, no.1, pp.76,80, Jan/Feb 2003
  • 45. Item-Item Collaborative Filtering Beaker buys a banana slicer Then: Generate list of candidate items to predict ratings for Predict ratings for candidate items Select Top-N items
  • 46. Accessing External Data KeyValueStore API enables external data access when applying a model External data might be… Trained model parameters Hierarchical/Taxonomic data Geo-lookup Store external data flexibly Text files, sequence files, Kiji tables, etc. Data access is decoupled from use during execution If the data doesn’t fit in memory, put it in a table
  • 47. How Much Less Work Can We Do? We can choose a predictor that allows us to truncate a sum There are two ways terms in the sum of our predictor can be small No rating Small similarity
  • 48. How Much Less Work Can We Do? We can choose a predictor that allows us to truncate a sum There are two ways terms in the sum of our predictor can be small No rating Small similarity
  • 49. How Much Less Work Can We Do? We can choose a predictor that allows us to truncate a sum Ignore unrated items There are two ways terms in the sum of our predictor can be small No rating Small similarity
  • 50. How Much Less Work Can We Do? We can choose a predictor that allows us to truncate a sum Ignore dissimilar items There are two ways terms in the sum of our predictor can be small No rating Small similarity
  • 51. How Much Less Work Can We Do? If we only present a few recommendations, we don’t need to predict ratings for all items Choose your candidate set to estimate ratings wisely or infer from nearest neighbors
  • 52. Organizing Data in Item-Item CF
  • 53. Accessing Data During Freshening
  • 54. Want to Know More? The Kiji Project kiji.org github.com/kijiproject Questions about this presentation? Twitter: @JulietHougland or @nattyice Email: natty@wibidata.com

Notes de l'éditeur

  1. Natty, thanks for that great description of the infrastructure that Kiji provides. Now we want to take the next step- go from infrastructure to actually showing our customers items they may be interested in.1. Recommending a group of items. If you are taking the time to make predictions you may as well present people with many options. This means recommendations arent about finding the best item to recommend. It is about predicting to best group of items. This gives us some leeway in terms of exact value.2. Users often will not be logged in at first. If we want to present good recs we need to be able to do it based on their current sessions browsing history. You can’t have personalized recs for non logged in users without real-time recs.3. Online retails want to present vast catalogs to their large user base. For most online retail sites the point of recommendations is organizing information in a way that is relevant and useful to their customer.
  2. I want to give us a broad overview of the types of approaches to recommendations available. We have got a finite amount of time here, so I will focus a lot of what I say on a simple implementation of collaborative filtering. I don’t want to give the impression that it is the only, or absolute best solution to the recommendations problem. Like with any prediction problem, there are many ways to tackle retail recommendation.There are two main types of recommendation algorithms. In a realistic system, they willalways be used together
  3. Use item descriptions, user generated tags, expert generated tags (Pandora) in order to build representations.A major pro of content based models is that they better handle unrated items. Good way to get around the cold start problem for a rec systems. Good way to bootstrap your way to getting user ratings or augment other methods of recommendation.The down side is that processingand building models around textual information can be very challenging.
  4. Just look at it.In a content based system, the hope is that the content you are basing your recs on is a good indicator of other items that are related in a relevant way.So, if we had a good content based recommender, after observing an interest in a banana slicer, it would recommend that you trying using a butter knife quickly.
  5. Pandora is the first recommendation system I remember consciously interacting with.Pandora: Expert TaggingThey have a team of musicians (domain expertise in invaluable) to listen to music and apply tags to songs.From a seeded station they begin to present to you variations on the original attributes of the song you started with. Your likes and dislikes as expressed to their systems helps it learn what attributes you like and dislike.Expert tagging is expensive and is a bottleneck in introducing new items.
  6. In collaborative filtering recommendation algorithms, we base our prediction based purely on expressed preferences.we think of storing user-item ratings as a matrix where the rows correspond to users and the columns correspond to items. We collectexplicit ratings and record them. Unfortunately, people don’t provide many ratings. Also, people lie.
  7. Gather feedback as explicit ratings or implicitly through user behavior (page views, put in shopping cart, starred/saved for later, bought)Lots of work in rec systems has been done around explicitly rated items.People lying about their preferences. They are aspirational.I put ken burns in my queue, but I watch a lot of the deadliest catch.New data can be added incrementally to the model.
  8. We can rely on implicit affinities for items instead of explicit ratings. We can track viewed or bought items and use a unary representation.Meaning, the happen, or… nothing, null, the void.
  9. Users that have expressed similar taste in the past should express similar taste in the future. (We represent users as a vector of ratings for items.)Items that have had similar profiles of user interest should continue to appeal to a similar collection of users. (we represent items as a vector of ratings by user.)For a target entity, we predict the unknown rating using information from other similar entities. A simple and common approach is to take the weighted average of ratings, where weights are some function of the similarity between entities. r_{i} = sum_{j}w_{ij}r_{j}
  10. identify items to select recommendations fromgenerate predicted scores, often through weighted averages in neighborhoods.return a list of top rated items
  11. - we had the cold start problem => content based recs- where do we keep our data? What do we need access to when we generate recs?design tablestrain model
  12. Too much data!Too little data! Data sparsity. Large number of users and items makes it hard to get a good sample of the potential “taste space.” Did you see gravity? Did the emptiness fo space strike you? This is like that, but much much emptier.Cold start: Troublesome because it requires ratings. If your system has no recorded interactions or explicit ratings, you can’t do this. Usually systems will be bootstrapped from recommenders that don’t require having user interactions recorded.Content based recommenders. Use item descriptions or tags to infer similarity between items, or generate profiles for users.Use data volunteered by users during registration or pulled in from facebook profiles to begin recommending items.The more ratings data you have, the slower you computation goes.
  13. Useful because it is content agnostic. Can be used for any variety of content, you just need items, users, and ratings. Can be used across languages. (If you are google, this is very important.)It performs well. It used used in many succesful commerical applications. Amazon, Netflix, Google. It just works well.
  14. Conceptual reasons amazon users item-item CFAmazon has more users than items. Computationally cheaper to focus model building around item relationships since their are less items.The relationships between items is also often simplerCan be used for personalized recs. It is especially useful when the only information you have about your user is a few items they have viewed in their current session.Item based CF is specific in the types of items is recommends, user based CF is more serendipitous in the types of items it recommends.Less creepy to get recommended very similar items to the one you are currently looking at than to have an accurate prediction when people don’t think Can fill screen real estate with similar items easily. “Customer who bought X also bought..” You are already doing the needed computation during the model training phase.
  15. Use banana slicer + banana slicer pile pic and eq hereSteps in generating rec:1. We don’t need to estimate rating for every item in the catalog if we will only present a few recs.Choose your candidate set to estimate ratings wisely or infer from nearest neighbors.2. Precompute item-item similarities- (N^2)M operation. In practice NM since ratings are sparse. At scoring time, aggregate ratings and similarities to predict ratings for unrated items.How can you organize the model data in such a way that what you need to generate predictions is accessible at scoring time?
  16. We need to be able to quickly access item-item similarities on a per item basis. How can we do this, quickly?Kiji provides the ability to access outside data source while freshening, running a MapReduce job, or testing through the KeyValue Stores. The KeyValueStore interface has many existing useful implementations you may use, or you may define your own custom one. Depending on you access need and total size of the data you need access to you may use a file backed KeyValue Store, or another KijiTable itself.Since item-item CF as we have stated it requires that we are able to access all item-item similarity pairs, our best choice is to use another Kiji Table to store this information. Accesing this information quickly then becomes an issue of table layout design.
  17. no rating => we should be able to query item-item similarities on a per items basis and
  18. Organization of data in your tables depends on your prediction function. We can see that in standard neighborhood based interpolation in CF that we need to be able to access all of a users ratings.Two tablesUsers table contains user info, product ratings, views, purchases, etc.Products table contains product info, and will be augmented with related/similar products
  19. Organization of data in your tables depends on your prediction function. We can see that in standard neighborhood based interpolation in CF that we need to be able to access all of a users ratings.