SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Which Algorithms Really Matter?

©MapR Technologies 2013

1
Me, Us


Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG



MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s



Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR

©MapR Technologies 2013

2
Topic For Today


What is important? What is not?



Why?



What is the difference from academic research?



Some examples

©MapR Technologies 2013

4
What is Important?


Deployable



Robust



Transparent



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

5
What is Important?


Deployable
–

Clever prototypes don’t count if they can’t be standardized



Robust



Transparent



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

6
What is Important?


Deployable
–



Robust
–



Clever prototypes don’t count
Mishandling is common

Transparent
–

Will degradation be obvious?



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

7
What is Important?


Deployable
–



Robust
–



Will degradation be obvious?

Skillset and mindset matched?
–



Mishandling is common

Transparent
–



Clever prototypes don’t count

How long will your fancy data scientist enjoy doing standard ops tasks?

Proportionate
–

Where is the highest value per minute of effort?

©MapR Technologies 2013

8
Academic Goals vs Pragmatics


Academic goals
–
–

–



Reproducible
Isolate theoretically important aspects
Work on novel problems

Pragmatics
–
–
–
–
–

Highest net value
Available data is constantly changing
Diligence and consistency have larger impact than cleverness
Many systems feed themselves, exploration and exploitation are both
important
Engineering constraints on budget and schedule

©MapR Technologies 2013

9
Example 1:
Making Recommendations Better

©MapR Technologies 2013

10
Recommendation Advances


What are the most important algorithmic advances in
recommendations over the last 10 years?



Cooccurrence analysis?



Matrix completion via factorization?



Latent factor log-linear models?



Temporal dynamics?

©MapR Technologies 2013

11
The Winner – None of the Above


What are the most important algorithmic advances in
recommendations over the last 10 years?

1. Result dithering
2. Anti-flood

©MapR Technologies 2013

12
The Real Issues


Exploration



Diversity



Speed



Not the last fraction of a percent

©MapR Technologies 2013

13
Result Dithering


Dithering is used to re-order recommendation results
–

Re-ordering is done randomly



Dithering is guaranteed to make off-line performance worse



Dithering also has a near perfect record of making actual
performance much better

©MapR Technologies 2013

14
Result Dithering


Dithering is used to re-order recommendation results
–

Re-ordering is done randomly



Dithering is guaranteed to make off-line performance worse



Dithering also has a near perfect record of making actual
performance much better

“Made more difference than any other change”
©MapR Technologies 2013

15
Simple Dithering Algorithm


Generate synthetic score from log rank plus Gaussian

s = logr + N(0, e )


Pick noise scale to provide desired level of mixing

Dr µ r exp e


Typically

e Î [ 0.4, 0.8]


Oh… use floor(t/T) as seed

©MapR Technologies 2013

16
Example … ε = 0.5
1
1
1
1
1
1
1
2
4
2
3
2
©MapR Technologies 2013

2
2
4
2
6
2
2
1
1
1
1
1

6
3
3
4
2
3
3
3
2
5
5
3

5
8
2
3
3
5
4
5
7
3
4
4

3
5
6
15
4
24
6
7
3
4
2
7
17

4
7
7
7
16
7
12
6
9
7
7
12

13
6
11
13
9
17
5
4
8
13
8
17

16
34
10
19
5
13
14
17
5
6
6
16
Example … ε = log 2 = 0.69
1
1
1
1
1
1
1
2
2
3
11
1
©MapR Technologies 2013

2
8
3
2
5
2
3
4
3
4
1
8

8
14
8
10
33
7
5
11
1
1
2
7

3
15
2
7
15
3
23
8
4
2
4
3

9
3
10
3
2
5
9
3
6
10
5
22
18

15
2
5
8
9
4
7
1
7
11
7
11

7
22
7
6
11
19
4
44
8
15
3
2

6
10
4
14
29
6
2
9
33
14
14
33
Exploring The Second Page

©MapR Technologies 2013

19
Lesson 1:
Exploration is good

©MapR Technologies 2013

20
Example 2:
Bayesian Bandits

©MapR Technologies 2013

21
Bayesian Bandits


Based on Thompson sampling



Very general sequential test



Near optimal regret



Trade-off exploration and exploitation



Possibly best known solution for exploration/exploitation



Incredibly simple

©MapR Technologies 2013

22
Thompson Sampling


Select each shell according to the probability that it is the best



Probability that it is the best can be computed using posterior

é
ù
P(i is best) = ò I êE[ri | q ] = max E[rj | q ]ú P(q | D) dq
ë
û
j


But I promised a simple answer

©MapR Technologies 2013

23
Thompson Sampling – Take 2


Sample θ

q ~ P(q | D)


Pick i to maximize reward

i = argmax E[rj | q ]
j



Record result from using i

©MapR Technologies 2013

24
Fast Convergence
0.12
0.11
0.1
0.09
0.08

regret

0.07
0.06

ε- greedy, ε = 0.05
0.05
0.04

Bayesian Bandit with Gam m a- Norm al

0.03
0.02
0.01
0
0

100

200

300

400

500

600
n

©MapR Technologies 2013

25

700

800

900

1000

1100
Thompson Sampling on Ads

An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
©MapR Technologies 2013

26
Bayesian Bandits versus Result Dithering


Many useful systems are difficult to frame in fully Bayesian form



Thompson sampling cannot be applied without posterior sampling



Can still do useful exploration with dithering



But better to use Thompson sampling if possible

©MapR Technologies 2013

27
Lesson 2:
Exploration is pretty
easy to do and pays
big benefits.

©MapR Technologies 2013

28
Example 3:
On-line Clustering

©MapR Technologies 2013

29
The Problem


K-means clustering is useful for feature extraction or compression



At scale and at high dimension, the desirable number of clusters
increases



Very large number of clusters may require more passes through
the data



Super-linear scaling is generally infeasible

©MapR Technologies 2013

30
The Solution


Sketch-based algorithms produce a sketch of the data



Streaming k-means uses adaptive dp-means to produce this sketch
in the form of many weighted centroids which approximate the
original distribution



The size of the sketch grows very slowly with increasing data size



Many operations such as clustering are well behaved on sketches

Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.
Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.

©MapR Technologies 2013

31
An Example

©MapR Technologies 2013

32
An Example

©MapR Technologies 2013

33
The Cluster Proximity Features


Every point can be described by the nearest cluster
–
–



Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)
–
–



4.3 bits per point in this case
Significant error that can be decreased (to a point) by increasing number of
clusters

Error is negligible
Unwinds the data into a simple representation

Or we can increase the number of clusters (n fold increase adds log
n bits per point, decreases error by sqrt(n)

©MapR Technologies 2013

34
Diagonalized Cluster Proximity

©MapR Technologies 2013

35
Lots of Clusters Are Fine

©MapR Technologies 2013

36
Typical k-means Failure

Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together

©MapR Technologies 2013

37
Streaming k-means Ideas


By using a sketch with lots (k log N) of centroids, we avoid
pathological cases



We still get a very good result if the sketch is created
–
–

in one pass
with approximate search



In fact, adaptive dp-means works just fine



In the end, the sketch can be used for clustering or …

©MapR Technologies 2013

38
Lesson 3:
Sketches make big
data small.

©MapR Technologies 2013

39
Example 4:
Search Abuse

©MapR Technologies 2013

40
Recommendations

Alice

Charles

©MapR Technologies 2013

Alice got an apple and a
puppy

Charles got a bicycle

41
Recommendations

Alice

Bob

Charles

©MapR Technologies 2013

Alice got an apple and a
puppy

Bob got an apple

Charles got a bicycle

42
Recommendations

Alice

Bob

?

What else would Bob like?

Charles

©MapR Technologies 2013

43
Log Files
Alice
Charles
Charles
Alice

Alice
Bob
Bob
©MapR Technologies 2013

44
History Matrix: Users by Items

Alice

✔

Bob

✔

Charles

©MapR Technologies 2013

✔

✔
✔
✔

45

✔
Co-occurrence Matrix: Items by Items
How do you tell which co-occurrences are useful?.

1

2

1

1

2

©MapR Technologies 2013

1

0

-

0

1

1
46

0
0
Co-occurrence Binary Matrix

not
not

©MapR Technologies 2013

1
1

47

1
Indicator Matrix: Anomalous Co-Occurrence
Result: The marked row will be added to the indicator
field in the item document…

✔

✔

©MapR Technologies 2013

48
Indicator Matrix
That one row from indicator matrix becomes the indicator field in the Solr
document used to deploy the recommendation engine.

✔
id: t4
title: puppy
desc: The sweetest little puppy ever.
keywords: puppy, dog, pet
indicators:

(t1)

Note: data for the indicator field is added directly to meta-data for a document in
Solr index. You don’t need to create a separate index for the indicators.
©MapR Technologies 2013

49
Internals of the Recommender Engine

50

©MapR Technologies 2013

50
Internals of the Recommender Engine

51

©MapR Technologies 2013

51
Looking Inside LucidWorks
Real-time recommendation query and results: Evaluation

What to recommend if new user listened to 2122: Fats Domino & 303: Beatles?
Recommendation is “1710 : Chuck Berry”
52

©MapR Technologies 2013

52
Real-life example

©MapR Technologies 2013

53
Lesson 4:
Recursive search abuse pays
Search can implement recs
Which can implement search

©MapR Technologies 2013

54
Summary

©MapR Technologies 2013

55
©MapR Technologies 2013

56
Me, Us


Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG



MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s



Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR

©MapR Technologies 2013

57

Contenu connexe

Tendances

اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...
اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...
اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...abedelaziz benzine
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Sunil Nair
 
Algorithmes machine learning/ neural network / deep learning
Algorithmes machine learning/ neural network / deep learningAlgorithmes machine learning/ neural network / deep learning
Algorithmes machine learning/ neural network / deep learningBassem Brayek
 
Mémoire de fin d’études : Master II Big Data et fouille de données
Mémoire de fin d’études : Master II Big Data et fouille de donnéesMémoire de fin d’études : Master II Big Data et fouille de données
Mémoire de fin d’études : Master II Big Data et fouille de donnéesCamelia Mastani
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering Ashek Farabi
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning AlgorithmsWalaa Hamdy Assy
 
Machine learning for customer classification
Machine learning for customer classificationMachine learning for customer classification
Machine learning for customer classificationAndrew Barnes
 
optimisation cours.pdf
optimisation cours.pdfoptimisation cours.pdf
optimisation cours.pdfMouloudi1
 
Ppt techniques de colectes de donnees en suivi evaluation
Ppt techniques de colectes de donnees en suivi evaluationPpt techniques de colectes de donnees en suivi evaluation
Ppt techniques de colectes de donnees en suivi evaluationUSIGGENEVE
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningQuentin Ambard
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced dataSaurabhWani6
 
Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selectionAndrea Dal Pozzolo
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in RBabu Priyavrat
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classificationijtsrd
 
ورشة تطبيقات الواقع المعزز
ورشة تطبيقات الواقع المعززورشة تطبيقات الواقع المعزز
ورشة تطبيقات الواقع المعززnourah abdulrahman
 
22 Machine Learning Feature Selection
22 Machine Learning Feature Selection22 Machine Learning Feature Selection
22 Machine Learning Feature SelectionAndres Mendez-Vazquez
 
Slides de présentation de la thèse du doctorat
Slides de présentation de la thèse du doctoratSlides de présentation de la thèse du doctorat
Slides de présentation de la thèse du doctoratZyad Elkhadir
 

Tendances (20)

اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...
اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...
اثر تطبيق رمز الاستجابة السريعة QR code في تعزيز خدمة الرفوف المفتوحة بالمكتب...
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
 
Algorithmes machine learning/ neural network / deep learning
Algorithmes machine learning/ neural network / deep learningAlgorithmes machine learning/ neural network / deep learning
Algorithmes machine learning/ neural network / deep learning
 
Mémoire de fin d’études : Master II Big Data et fouille de données
Mémoire de fin d’études : Master II Big Data et fouille de donnéesMémoire de fin d’études : Master II Big Data et fouille de données
Mémoire de fin d’études : Master II Big Data et fouille de données
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 
Machine learning for customer classification
Machine learning for customer classificationMachine learning for customer classification
Machine learning for customer classification
 
optimisation cours.pdf
optimisation cours.pdfoptimisation cours.pdf
optimisation cours.pdf
 
Ppt techniques de colectes de donnees en suivi evaluation
Ppt techniques de colectes de donnees en suivi evaluationPpt techniques de colectes de donnees en suivi evaluation
Ppt techniques de colectes de donnees en suivi evaluation
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
 
Veille Informationnelle
Veille InformationnelleVeille Informationnelle
Veille Informationnelle
 
ورشة تطبيقات الواقع المعزز
ورشة تطبيقات الواقع المعززورشة تطبيقات الواقع المعزز
ورشة تطبيقات الواقع المعزز
 
22 Machine Learning Feature Selection
22 Machine Learning Feature Selection22 Machine Learning Feature Selection
22 Machine Learning Feature Selection
 
Slides de présentation de la thèse du doctorat
Slides de présentation de la thèse du doctoratSlides de présentation de la thèse du doctorat
Slides de présentation de la thèse du doctorat
 

Similaire à Which Algorithms Really Matter

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and RecommendationsTed Dunning
 
DFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersDFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersTed Dunning
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutMapR Technologies
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationTed Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...Matt Stubbs
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 

Similaire à Which Algorithms Really Matter (20)

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
DFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout RecommendersDFW Big Data talk on Mahout Recommenders
DFW Big Data talk on Mahout Recommenders
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache Mahout
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW...
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 

Plus de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 

Plus de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 

Dernier

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 

Dernier (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 

Which Algorithms Really Matter

  • 1. Which Algorithms Really Matter? ©MapR Technologies 2013 1
  • 2. Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR ©MapR Technologies 2013 2
  • 3. Topic For Today  What is important? What is not?  Why?  What is the difference from academic research?  Some examples ©MapR Technologies 2013 4
  • 4. What is Important?  Deployable  Robust  Transparent  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 5
  • 5. What is Important?  Deployable – Clever prototypes don’t count if they can’t be standardized  Robust  Transparent  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 6
  • 6. What is Important?  Deployable –  Robust –  Clever prototypes don’t count Mishandling is common Transparent – Will degradation be obvious?  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 7
  • 7. What is Important?  Deployable –  Robust –  Will degradation be obvious? Skillset and mindset matched? –  Mishandling is common Transparent –  Clever prototypes don’t count How long will your fancy data scientist enjoy doing standard ops tasks? Proportionate – Where is the highest value per minute of effort? ©MapR Technologies 2013 8
  • 8. Academic Goals vs Pragmatics  Academic goals – – –  Reproducible Isolate theoretically important aspects Work on novel problems Pragmatics – – – – – Highest net value Available data is constantly changing Diligence and consistency have larger impact than cleverness Many systems feed themselves, exploration and exploitation are both important Engineering constraints on budget and schedule ©MapR Technologies 2013 9
  • 9. Example 1: Making Recommendations Better ©MapR Technologies 2013 10
  • 10. Recommendation Advances  What are the most important algorithmic advances in recommendations over the last 10 years?  Cooccurrence analysis?  Matrix completion via factorization?  Latent factor log-linear models?  Temporal dynamics? ©MapR Technologies 2013 11
  • 11. The Winner – None of the Above  What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering 2. Anti-flood ©MapR Technologies 2013 12
  • 12. The Real Issues  Exploration  Diversity  Speed  Not the last fraction of a percent ©MapR Technologies 2013 13
  • 13. Result Dithering  Dithering is used to re-order recommendation results – Re-ordering is done randomly  Dithering is guaranteed to make off-line performance worse  Dithering also has a near perfect record of making actual performance much better ©MapR Technologies 2013 14
  • 14. Result Dithering  Dithering is used to re-order recommendation results – Re-ordering is done randomly  Dithering is guaranteed to make off-line performance worse  Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change” ©MapR Technologies 2013 15
  • 15. Simple Dithering Algorithm  Generate synthetic score from log rank plus Gaussian s = logr + N(0, e )  Pick noise scale to provide desired level of mixing Dr µ r exp e  Typically e Î [ 0.4, 0.8]  Oh… use floor(t/T) as seed ©MapR Technologies 2013 16
  • 16. Example … ε = 0.5 1 1 1 1 1 1 1 2 4 2 3 2 ©MapR Technologies 2013 2 2 4 2 6 2 2 1 1 1 1 1 6 3 3 4 2 3 3 3 2 5 5 3 5 8 2 3 3 5 4 5 7 3 4 4 3 5 6 15 4 24 6 7 3 4 2 7 17 4 7 7 7 16 7 12 6 9 7 7 12 13 6 11 13 9 17 5 4 8 13 8 17 16 34 10 19 5 13 14 17 5 6 6 16
  • 17. Example … ε = log 2 = 0.69 1 1 1 1 1 1 1 2 2 3 11 1 ©MapR Technologies 2013 2 8 3 2 5 2 3 4 3 4 1 8 8 14 8 10 33 7 5 11 1 1 2 7 3 15 2 7 15 3 23 8 4 2 4 3 9 3 10 3 2 5 9 3 6 10 5 22 18 15 2 5 8 9 4 7 1 7 11 7 11 7 22 7 6 11 19 4 44 8 15 3 2 6 10 4 14 29 6 2 9 33 14 14 33
  • 18. Exploring The Second Page ©MapR Technologies 2013 19
  • 19. Lesson 1: Exploration is good ©MapR Technologies 2013 20
  • 20. Example 2: Bayesian Bandits ©MapR Technologies 2013 21
  • 21. Bayesian Bandits  Based on Thompson sampling  Very general sequential test  Near optimal regret  Trade-off exploration and exploitation  Possibly best known solution for exploration/exploitation  Incredibly simple ©MapR Technologies 2013 22
  • 22. Thompson Sampling  Select each shell according to the probability that it is the best  Probability that it is the best can be computed using posterior é ù P(i is best) = ò I êE[ri | q ] = max E[rj | q ]ú P(q | D) dq ë û j  But I promised a simple answer ©MapR Technologies 2013 23
  • 23. Thompson Sampling – Take 2  Sample θ q ~ P(q | D)  Pick i to maximize reward i = argmax E[rj | q ] j  Record result from using i ©MapR Technologies 2013 24
  • 24. Fast Convergence 0.12 0.11 0.1 0.09 0.08 regret 0.07 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 n ©MapR Technologies 2013 25 700 800 900 1000 1100
  • 25. Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011 ©MapR Technologies 2013 26
  • 26. Bayesian Bandits versus Result Dithering  Many useful systems are difficult to frame in fully Bayesian form  Thompson sampling cannot be applied without posterior sampling  Can still do useful exploration with dithering  But better to use Thompson sampling if possible ©MapR Technologies 2013 27
  • 27. Lesson 2: Exploration is pretty easy to do and pays big benefits. ©MapR Technologies 2013 28
  • 28. Example 3: On-line Clustering ©MapR Technologies 2013 29
  • 29. The Problem  K-means clustering is useful for feature extraction or compression  At scale and at high dimension, the desirable number of clusters increases  Very large number of clusters may require more passes through the data  Super-linear scaling is generally infeasible ©MapR Technologies 2013 30
  • 30. The Solution  Sketch-based algorithms produce a sketch of the data  Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution  The size of the sketch grows very slowly with increasing data size  Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan. ©MapR Technologies 2013 31
  • 33. The Cluster Proximity Features  Every point can be described by the nearest cluster – –  Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – –  4.3 bits per point in this case Significant error that can be decreased (to a point) by increasing number of clusters Error is negligible Unwinds the data into a simple representation Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n) ©MapR Technologies 2013 34
  • 35. Lots of Clusters Are Fine ©MapR Technologies 2013 36
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together ©MapR Technologies 2013 37
  • 37. Streaming k-means Ideas  By using a sketch with lots (k log N) of centroids, we avoid pathological cases  We still get a very good result if the sketch is created – – in one pass with approximate search  In fact, adaptive dp-means works just fine  In the end, the sketch can be used for clustering or … ©MapR Technologies 2013 38
  • 38. Lesson 3: Sketches make big data small. ©MapR Technologies 2013 39
  • 39. Example 4: Search Abuse ©MapR Technologies 2013 40
  • 40. Recommendations Alice Charles ©MapR Technologies 2013 Alice got an apple and a puppy Charles got a bicycle 41
  • 41. Recommendations Alice Bob Charles ©MapR Technologies 2013 Alice got an apple and a puppy Bob got an apple Charles got a bicycle 42
  • 42. Recommendations Alice Bob ? What else would Bob like? Charles ©MapR Technologies 2013 43
  • 44. History Matrix: Users by Items Alice ✔ Bob ✔ Charles ©MapR Technologies 2013 ✔ ✔ ✔ ✔ 45 ✔
  • 45. Co-occurrence Matrix: Items by Items How do you tell which co-occurrences are useful?. 1 2 1 1 2 ©MapR Technologies 2013 1 0 - 0 1 1 46 0 0
  • 46. Co-occurrence Binary Matrix not not ©MapR Technologies 2013 1 1 47 1
  • 47. Indicator Matrix: Anomalous Co-Occurrence Result: The marked row will be added to the indicator field in the item document… ✔ ✔ ©MapR Technologies 2013 48
  • 48. Indicator Matrix That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators. ©MapR Technologies 2013 49
  • 49. Internals of the Recommender Engine 50 ©MapR Technologies 2013 50
  • 50. Internals of the Recommender Engine 51 ©MapR Technologies 2013 51
  • 51. Looking Inside LucidWorks Real-time recommendation query and results: Evaluation What to recommend if new user listened to 2122: Fats Domino & 303: Beatles? Recommendation is “1710 : Chuck Berry” 52 ©MapR Technologies 2013 52
  • 53. Lesson 4: Recursive search abuse pays Search can implement recs Which can implement search ©MapR Technologies 2013 54
  • 56. Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR ©MapR Technologies 2013 57

Notes de l'éditeur

  1. * A history of what everybody has done. Obviously this is just a cartoon because large numbers of users and interactions with items would be required to build a recommender* Next step will be to predict what a new user might like…
  2. *Bob is the “new user” and getting apple is his history
  3. *Here is where the recommendation engine needs to go to work…Note to trainer: you might see if audience calls out the answer before revealing next slide…
  4. Note to trainer: This is the situation similar to that in which we started, with three users in our history. The difference is that now everybody got a pony. Bob has apple and pony but not a puppy…yet
  5. *Binary matrix is stored sparsely
  6. *Convert by MapReduce into a binary matrixNote to trainer: Whether consider apple to have occurred with self is open question
  7. Old joke: all the world can be divided into 2 categories: Scotch tape and non-Scotch tape… This is a way to think about the co-occurrence
  8. Only important co-occurrence is puppy follows apple
  9. *Take that row of matrix and combine with all the meta data we might have…*Important thing to get from the co-occurrence matrix is this indicator..Cool thing: analogous to what a lot of recommendation engines do*This row forms the indicator field in a Solr document containing meta-data (you do NOT have to build a separate index for the indicators)Find the useful co-occurrence and get rid of the rest. Sparsify and get the anomalous co-occurrence
  10. Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
  11. *This indicator field is where the output of the Mahout recommendation engine are stored (the row from the indicator matrix that identified significant or interesting co-occurrence. *Keep in mind that this recommendation indicator data is added to the same original document in the Solr index that contains meta data for the item in question
  12. This is a diagnostics window in the LucidWorksSolr index (not the web interface a user would see). It’s a way for the developer to do a rough evaluation (laugh test) of the choices offered by the recommendation engine.In other words, do these indicator artists represented by their indicator Id make reasonable recommendations Note to trainer: artist 303 happens to be The Beatles. Is that a good match for Chuck Berry?