SlideShare une entreprise Scribd logo
1  sur  41
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
© 2014 MapR Technologies 3
Agenda
• Background – recommending with puppies and ponies
• Speed tricks
• Accuracy tricks
• Moving to real-time
© 2014 MapR Technologies 4
Puppies and Ponies
© 2014 MapR Technologies 5
Cooccurrence AnalysisCooccurrence Analysis
© 2014 MapR Technologies 6
How Often Do Items Co-occur
How often do items co-occur?
¢A A
A
© 2014 MapR Technologies 7
Which Co-occurrences are Interesting?
Which cooccurences are interesting?
Each row of indicators becomes a field in a
search engine document
¢A A
sparsify ¢A A( )
© 2014 MapR Technologies 8
Recommendations
Alice got an apple and
a puppyAlice
© 2014 MapR Technologies 9
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
© 2014 MapR Technologies 10
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple
© 2014 MapR Technologies 11
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob What else would Bob like?
© 2014 MapR Technologies 12
Recommendations
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob A puppy!
© 2014 MapR Technologies 13
By the way, like me, Bob also
wants a pony…
© 2014 MapR Technologies 14
Recommendations
?
Alice
Bob
Charles
Amelia
What if everybody gets a
pony?
What else would you recommend for
new user Amelia?
© 2014 MapR Technologies 15
Recommendations
?
Alice
Bob
Charles
Amelia
If everybody gets a pony, it’s
not a very good indicator of
what to else predict...
© 2014 MapR Technologies 16
Problems with Raw Co-occurrence
• Very popular items co-occur with everything or why it’s not very
helpful to know that everybody wants a pony…
– Examples: Welcome document; Elevator music
• Very widespread occurrence is not interesting to generate indicators
for recommendation
– Unless you want to offer an item that is constantly desired, such as
razor blades (or ponies)
• What we want is anomalous co-occurrence
– This is the source of interesting indicators of preference on which to
base recommendation
© 2014 MapR Technologies 17
Overview: Get Useful Indicators from Behaviors
1. Use log files to build history matrix of users x items
– Remember: this history of interactions will be sparse compared to all
potential combinations
2. Transform to a co-occurrence matrix of items x items
3. Look for useful indicators by identifying anomalous co-occurrences to
make an indicator matrix
– Log Likelihood Ratio (LLR) can be helpful to judge which co-
occurrences can with confidence be used as indicators of preference
– ItemSimilarityJob in Apache Mahout uses LLR
© 2014 MapR Technologies 18
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
© 2014 MapR Technologies 19
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence,
Computational Linguistics vol 19 no. 1 (1993)
© 2014 MapR Technologies 20
Collection of Documents: Insert Meta-Data
Search Engine
Item
meta-data
Document for
“puppy”
id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
Ingest easily via NFS
✔ indicators: (t1)
© 2014 MapR Technologies 22
Cooccurrence Mechanics
• Cooccurrence is just a self-join
for each user, i
for each history item j1 in Ai*
for each history item j2 in Ai*
count pair (j1, j2)
aij1
aij2
= ¢A A
i
å
© 2014 MapR Technologies 23
Cross-occurrence Mechanics
• Cross occurrence is just a self-join of adjoined matrices
for each user, i
for each history item j1 in Ai*
for each history item j2 in Bi*
count pair (j1, j2)
aij1
bij2
= ¢A B
i
å A | B[ ]¢ A | B[ ] =
¢A A ¢A B
¢B A ¢B B
é
ë
ê
ù
û
ú
© 2014 MapR Technologies 24
A word about scaling
© 2014 MapR Technologies 25
A few pragmatic tricks
• Downsample all user histories to max length (interaction cut)
– Can be random or most-recent (no apparent effect on accuracy)
– Prolific users are often pathological anyway
– Common limit is 300 items (no apparent effect on accuracy)
• Downsample all items to limit max viewers (frequency limit)
– Can be random or earliest (no apparent effect)
– Ubiquitous items are uninformative
– Common limit is 500 users (no apparent effect)
Schelter, et al. Scalable similarity-based neighborhood methods with MapReduce.
Proceedings of the sixth ACM conference on Recommender systems. 2012
© 2014 MapR Technologies 26
But note!
• Number of pairs for a user history with ki distinct items is ≈ ki
2/2
• Average size of user history increases with increasing dataset
– Average may grow more slowly than N (or not!)
– Full cooccurrence cost grows strictly faster than N
– i.e. it just doesn’t scale
• Downsampling interactions places bounds on per user cost
– Cooccurrence with interaction cut is scalable
t µ ki
2
i
å Î o(N)
t µ min kmax
2
,ki
2
( )< Nkmax
2
i
å Î O(N)
© 2014 MapR Technologies 27
0 200 400 600 800 1000
0123456
Benefit of down−sampling
User Limit
Pairs(x109
)
Without down−sampling
Track limit = 1000
500
200
Computed on 48,373,586 pair−wise triples
from the million song dataset
●
●
© 2014 MapR Technologies 28
Batch Scaling in Time Implies Scaling in Space
• Note:
– With frequency limit sampling, max cooccurrence count is small (<1000)
– With interaction cut, total number of non-zero pairs is relatively small
– Entire cooccurrence matrix can be stored in memory in ~10-15 GB
• Specifically:
– With interaction cut, cooccurrence scales in size
– Without interaction cut, cooccurrence does not scale size-wise
¢A A 0
Î O(N)
¢A A 0
Îw(N)
© 2014 MapR Technologies 29
Impact of Interaction Cut Downsampling
• Interaction cut allows batch cooccurrence analysis to be O(N) in
time and space
• This is intriguing
– Amortized cost is low
– Could this be extended to an on-line form?
• Incremental matrix factorization is hard
– Could cooccurrence be a key alternative?
• Scaling matters most at scale
– Cooccurrence is very accurate at large scale
– Factorization shows benefits at smaller scales
© 2014 MapR Technologies 30
Online update
© 2014 MapR Technologies 31
Requirements for Online Algorithms
• Each unit of input must require O(1) work
– Theoretical bound
• The constants have to be small enough on average
– Pragmatic constraint
• Total accumulated data must be small (enough)
– Pragmatic constraint
© 2014 MapR Technologies 32
Log Files
Search
Technology
Item
Meta-Data
via
NFS
MapR Cluster
via
NFS PostPre
Recommendations
New User
History
Web
Tier
Recommendations
happen in real-time
Batch co-
occurrence
Want this to be real-time
Real-time recommendations using MapR data platform
© 2014 MapR Technologies 33
Space Bound Implies Time Bound
• Because user histories are pruned, only a limited number of
value updates need be made with each new observation
• This bound is just twice the interaction cut kmax
– Which is a constant
• Bounding the number of updates trivially bounds the time
© 2014 MapR Technologies 34
Implications for Online Update
A+eij( )¢ A+eij( )- ¢A A = ¢eij A+ ¢A eij +djj
=
0
Ai*
0
é
ë
ê
ê
ê
ù
û
ú
ú
ú
+ 0 Ai*( )¢ 0
é
ë
ê
ù
û
ú+djj
© 2014 MapR Technologies 35
Ai* + Ai*( )¢ +dii
0
£ 2kmax -1Î O(1)
With interaction cut at kmax
© 2014 MapR Technologies 36
But Wait, It Gets Better
• The new observation may be pruned
– For users at the interaction cut, we can ignore updates
– For items at the frequency cut, we can ignore updates
– Ignoring updates only affects indicators, not recommendation query
– At million song dataset size, half of all updates are pruned
• On average ki is much less than the interaction cut
– For million song dataset, average appears to grow with log of frequency
limit, with little dependency on values of interaction cut > 200
• LLR cutoff avoids almost all updates to index
• Average grows slowly with frequency cut
© 2014 MapR Technologies 37
0 200 400 600 800 1000
05101520253035
Interaction cut (kmax)
kave
Frequency cut = 1000
= 500
= 200
© 2014 MapR Technologies 38
0 200 400 600 800 1000
05101520253035
Frequency cut
kave
© 2014 MapR Technologies 39
Recap
• Cooccurrence-based recommendations are simple
– Deploying with a search engine is even better
• Interaction cut and frequency cut are key to batch scalability
• Similar effect occurs in online form of updates
– Only dozens of updates per transaction needed
– Data structure required is relatively small
– Very, very few updates cause search engine updates
• Fully online recommendation is very feasible, almost easy
© 2014 MapR Technologies 40
More Details Available
available for free at
available for free at
http://www.mapr.com/practical-machine-learning
© 2014 MapR Technologies 41
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout
Apache Drill http://incubator.apache.org/drill/
Twitter @ApacheDrill
© 2014 MapR Technologies 42
Q&A
@mapr maprtech
tdunning@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Contenu connexe

Tendances

Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and RecommendationsTed Dunning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendationsTed Dunning
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 

Tendances (13)

Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 

En vedette

Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningMapR Technologies
 
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceHadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceMapR Technologies
 
The Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm SpanishThe Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm SpanishConnected Futures
 
เรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศเรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศMarg Kok
 
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensMapR Technologies
 
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORESUSO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORESChus Fernández de la Fuente
 

En vedette (9)

Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
 
Atlhug 20150625
Atlhug 20150625Atlhug 20150625
Atlhug 20150625
 
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected IntelligenceHadoop Summit EU - Crowd Sourcing Reflected Intelligence
Hadoop Summit EU - Crowd Sourcing Reflected Intelligence
 
The Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm SpanishThe Last Traffic Jam - LatAm Spanish
The Last Traffic Jam - LatAm Spanish
 
เรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศเรื่องที่ 2 แหล่งสารสนเทศ
เรื่องที่ 2 แหล่งสารสนเทศ
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
 
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORESUSO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
USO DE LAS ESTRATEGIAS EXPOSITIVAS PARA EL DESARROLLO DE ACTITUDES Y VALORES
 

Similaire à Dunning ml-conf-2014

Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with ChaosMapR Technologies
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With ChaosDataWorks Summit
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Network-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity SwitchesNetwork-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity SwitchesAJAY KHARAT
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series DatabaseDataWorks Summit
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
DFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersDFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 

Similaire à Dunning ml-conf-2014 (20)

Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Practical Computing with Chaos
Practical Computing with ChaosPractical Computing with Chaos
Practical Computing with Chaos
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Network-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity SwitchesNetwork-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity Switches
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Polyvalent Recommendations
Polyvalent RecommendationsPolyvalent Recommendations
Polyvalent Recommendations
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
DFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersDFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout Recommenders
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 

Plus de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dunning ml-conf-2014

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  • 3. © 2014 MapR Technologies 3 Agenda • Background – recommending with puppies and ponies • Speed tricks • Accuracy tricks • Moving to real-time
  • 4. © 2014 MapR Technologies 4 Puppies and Ponies
  • 5. © 2014 MapR Technologies 5 Cooccurrence AnalysisCooccurrence Analysis
  • 6. © 2014 MapR Technologies 6 How Often Do Items Co-occur How often do items co-occur? ¢A A A
  • 7. © 2014 MapR Technologies 7 Which Co-occurrences are Interesting? Which cooccurences are interesting? Each row of indicators becomes a field in a search engine document ¢A A sparsify ¢A A( )
  • 8. © 2014 MapR Technologies 8 Recommendations Alice got an apple and a puppyAlice
  • 9. © 2014 MapR Technologies 9 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles
  • 10. © 2014 MapR Technologies 10 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  • 11. © 2014 MapR Technologies 11 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob What else would Bob like?
  • 12. © 2014 MapR Technologies 12 Recommendations Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  • 13. © 2014 MapR Technologies 13 By the way, like me, Bob also wants a pony…
  • 14. © 2014 MapR Technologies 14 Recommendations ? Alice Bob Charles Amelia What if everybody gets a pony? What else would you recommend for new user Amelia?
  • 15. © 2014 MapR Technologies 15 Recommendations ? Alice Bob Charles Amelia If everybody gets a pony, it’s not a very good indicator of what to else predict...
  • 16. © 2014 MapR Technologies 16 Problems with Raw Co-occurrence • Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music • Very widespread occurrence is not interesting to generate indicators for recommendation – Unless you want to offer an item that is constantly desired, such as razor blades (or ponies) • What we want is anomalous co-occurrence – This is the source of interesting indicators of preference on which to base recommendation
  • 17. © 2014 MapR Technologies 17 Overview: Get Useful Indicators from Behaviors 1. Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations 2. Transform to a co-occurrence matrix of items x items 3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co- occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR
  • 18. © 2014 MapR Technologies 18 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
  • 19. © 2014 MapR Technologies 19 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3 Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
  • 20. © 2014 MapR Technologies 20 Collection of Documents: Insert Meta-Data Search Engine Item meta-data Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet Ingest easily via NFS ✔ indicators: (t1)
  • 21. © 2014 MapR Technologies 22 Cooccurrence Mechanics • Cooccurrence is just a self-join for each user, i for each history item j1 in Ai* for each history item j2 in Ai* count pair (j1, j2) aij1 aij2 = ¢A A i å
  • 22. © 2014 MapR Technologies 23 Cross-occurrence Mechanics • Cross occurrence is just a self-join of adjoined matrices for each user, i for each history item j1 in Ai* for each history item j2 in Bi* count pair (j1, j2) aij1 bij2 = ¢A B i å A | B[ ]¢ A | B[ ] = ¢A A ¢A B ¢B A ¢B B é ë ê ù û ú
  • 23. © 2014 MapR Technologies 24 A word about scaling
  • 24. © 2014 MapR Technologies 25 A few pragmatic tricks • Downsample all user histories to max length (interaction cut) – Can be random or most-recent (no apparent effect on accuracy) – Prolific users are often pathological anyway – Common limit is 300 items (no apparent effect on accuracy) • Downsample all items to limit max viewers (frequency limit) – Can be random or earliest (no apparent effect) – Ubiquitous items are uninformative – Common limit is 500 users (no apparent effect) Schelter, et al. Scalable similarity-based neighborhood methods with MapReduce. Proceedings of the sixth ACM conference on Recommender systems. 2012
  • 25. © 2014 MapR Technologies 26 But note! • Number of pairs for a user history with ki distinct items is ≈ ki 2/2 • Average size of user history increases with increasing dataset – Average may grow more slowly than N (or not!) – Full cooccurrence cost grows strictly faster than N – i.e. it just doesn’t scale • Downsampling interactions places bounds on per user cost – Cooccurrence with interaction cut is scalable t µ ki 2 i å Î o(N) t µ min kmax 2 ,ki 2 ( )< Nkmax 2 i å Î O(N)
  • 26. © 2014 MapR Technologies 27 0 200 400 600 800 1000 0123456 Benefit of down−sampling User Limit Pairs(x109 ) Without down−sampling Track limit = 1000 500 200 Computed on 48,373,586 pair−wise triples from the million song dataset ● ●
  • 27. © 2014 MapR Technologies 28 Batch Scaling in Time Implies Scaling in Space • Note: – With frequency limit sampling, max cooccurrence count is small (<1000) – With interaction cut, total number of non-zero pairs is relatively small – Entire cooccurrence matrix can be stored in memory in ~10-15 GB • Specifically: – With interaction cut, cooccurrence scales in size – Without interaction cut, cooccurrence does not scale size-wise ¢A A 0 Î O(N) ¢A A 0 Îw(N)
  • 28. © 2014 MapR Technologies 29 Impact of Interaction Cut Downsampling • Interaction cut allows batch cooccurrence analysis to be O(N) in time and space • This is intriguing – Amortized cost is low – Could this be extended to an on-line form? • Incremental matrix factorization is hard – Could cooccurrence be a key alternative? • Scaling matters most at scale – Cooccurrence is very accurate at large scale – Factorization shows benefits at smaller scales
  • 29. © 2014 MapR Technologies 30 Online update
  • 30. © 2014 MapR Technologies 31 Requirements for Online Algorithms • Each unit of input must require O(1) work – Theoretical bound • The constants have to be small enough on average – Pragmatic constraint • Total accumulated data must be small (enough) – Pragmatic constraint
  • 31. © 2014 MapR Technologies 32 Log Files Search Technology Item Meta-Data via NFS MapR Cluster via NFS PostPre Recommendations New User History Web Tier Recommendations happen in real-time Batch co- occurrence Want this to be real-time Real-time recommendations using MapR data platform
  • 32. © 2014 MapR Technologies 33 Space Bound Implies Time Bound • Because user histories are pruned, only a limited number of value updates need be made with each new observation • This bound is just twice the interaction cut kmax – Which is a constant • Bounding the number of updates trivially bounds the time
  • 33. © 2014 MapR Technologies 34 Implications for Online Update A+eij( )¢ A+eij( )- ¢A A = ¢eij A+ ¢A eij +djj = 0 Ai* 0 é ë ê ê ê ù û ú ú ú + 0 Ai*( )¢ 0 é ë ê ù û ú+djj
  • 34. © 2014 MapR Technologies 35 Ai* + Ai*( )¢ +dii 0 £ 2kmax -1Î O(1) With interaction cut at kmax
  • 35. © 2014 MapR Technologies 36 But Wait, It Gets Better • The new observation may be pruned – For users at the interaction cut, we can ignore updates – For items at the frequency cut, we can ignore updates – Ignoring updates only affects indicators, not recommendation query – At million song dataset size, half of all updates are pruned • On average ki is much less than the interaction cut – For million song dataset, average appears to grow with log of frequency limit, with little dependency on values of interaction cut > 200 • LLR cutoff avoids almost all updates to index • Average grows slowly with frequency cut
  • 36. © 2014 MapR Technologies 37 0 200 400 600 800 1000 05101520253035 Interaction cut (kmax) kave Frequency cut = 1000 = 500 = 200
  • 37. © 2014 MapR Technologies 38 0 200 400 600 800 1000 05101520253035 Frequency cut kave
  • 38. © 2014 MapR Technologies 39 Recap • Cooccurrence-based recommendations are simple – Deploying with a search engine is even better • Interaction cut and frequency cut are key to batch scalability • Similar effect occurs in online form of updates – Only dozens of updates per transaction needed – Data structure required is relatively small – Very, very few updates cause search engine updates • Fully online recommendation is very feasible, almost easy
  • 39. © 2014 MapR Technologies 40 More Details Available available for free at available for free at http://www.mapr.com/practical-machine-learning
  • 40. © 2014 MapR Technologies 41 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout Apache Drill http://incubator.apache.org/drill/ Twitter @ApacheDrill
  • 41. © 2014 MapR Technologies 42 Q&A @mapr maprtech tdunning@mapr.com Engage with us! MapR maprtech mapr-technologies

Notes de l'éditeur

  1. Mention that the Pony book said “RowSimilarityJob”…