SlideShare une entreprise Scribd logo
1  sur  64
1©MapR Technologies - Confidential
News From Mahout
2©MapR Technologies - Confidential
whoami – Ted Dunning
 Chief Application Architect, MapR Technologies
 Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
 Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning
3©MapR Technologies - Confidential
 Slides and such (available late tonight):
– http://www.mapr.com/company/events/nyhug-03-05-2013
 Hash tags: #mapr #nyhug #mahout
4©MapR Technologies - Confidential
New in Mahout
 0.8 is coming soon (1-2 months)
 gobs of fixes
 QR decomposition is 10x faster
– makes ALS 2-3 times faster
 May include Bayesian Bandits
 Super fast k-means
– fast
– online (!?!)
5©MapR Technologies - Confidential
New in Mahout
 0.8 is coming soon (1-2 months)
 gobs of fixes
 QR decomposition is 10x faster
– makes ALS 2-3 times faster
 May include Bayesian Bandits
 Super fast k-means
– fast
– online (!?!)
– fast
 Possible new edition of MiA coming
– Japanese and Korean editions released, Chinese coming
6©MapR Technologies - Confidential
New in Mahout
 0.8 is coming soon (1-2 months)
 gobs of fixes
 QR decomposition is 10x faster
– makes ALS 2-3 times faster
 May include Bayesian Bandits
 Super fast k-means
– fast
– online (!?!)
– fast
 Possible new edition of MiA coming
– Japanese and Korean editions released, Chinese coming
7©MapR Technologies - Confidential
Real-time Learning
8©MapR Technologies - Confidential
We have a product
to sell …
from a web-site
9©MapR Technologies - Confidential
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
What
picture?
What tag-
line?
What call to
action?
10©MapR Technologies - Confidential
The Challenge
 Design decisions affect probability of success
– Cheesy web-sites don’t even sell cheese
 The best designers do better when allowed to fail
– Exploration juices creativity
 But failing is expensive
– If only because we could have succeeded
– But also because offending or disappointing customers is bad
11©MapR Technologies - Confidential
More Challenges
 Too many designs
– 5 pictures
– 10 tag-lines
– 4 calls to action
– 3 back-ground colors
=> 5 x 10 x 4 x 3 = 600 designs
 It gets worse quickly
– What about changes on the back-end?
– Search engine variants?
– Checkout process variants?
12©MapR Technologies - Confidential
Example – AB testing in real-time
 I have 15 versions of my landing page
 Each visitor is assigned to a version
– Which version?
 A conversion or sale or whatever can happen
– How long to wait?
 Some versions of the landing page are horrible
– Don’t want to give them traffic
13©MapR Technologies - Confidential
A Quick Diversion
 You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
 I flip the coin and while it is in the air ask again
 I catch the coin and ask again
 I look at the coin (and you don’t) and ask again
 Why does the answer change?
– And did it ever have a single value?
14©MapR Technologies - Confidential
A Philosophical Conclusion
 Probability as expressed by humans is subjective and depends on
information and experience
15©MapR Technologies - Confidential
I Dunno
16©MapR Technologies - Confidential
5 heads out of 10 throws
17©MapR Technologies - Confidential
2 heads out of 12 throws
18©MapR Technologies - Confidential
So now you understand
Bayesian probability
19©MapR Technologies - Confidential
Another Quick Diversion
 Let’s play a shell game
 This is a special shell game
 It costs you nothing to play
 The pea has constant probability of being under each shell
(trust me)
 How do you find the best shell?
 How do you find it while maximizing the number of wins?
20©MapR Technologies - Confidential
Pause for short
con-game
21©MapR Technologies - Confidential
Interim Thoughts
 Can you identify winners or losers without trying them out?
 Can you ever completely eliminate a shell with a bad streak?
 Should you keep trying apparent losers?
22©MapR Technologies - Confidential
So now you understand
multi-armed bandits
23©MapR Technologies - Confidential
Conclusions
 Can you identify winners or losers without trying them out?
No
 Can you ever completely eliminate a shell with a bad streak?
No
 Should you keep trying apparent losers?
Yes, but at a decreasing rate
24©MapR Technologies - Confidential
Is there an optimum
strategy?
25©MapR Technologies - Confidential
Bayesian Bandit
 Compute distributions based on data so far
 Sample p1, p2 and p2 from these distributions
 Pick shell i where i = argmaxi pi
 Lemma 1: The probability of picking shell i will match the
probability it is the best shell
 Lemma 2: This is as good as it gets
26©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
27©MapR Technologies - Confidential
Video Demo
28©MapR Technologies - Confidential
The Code
 Select an alternative
 Select and learn
 But we already know how to count!
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)
29©MapR Technologies - Confidential
The Basic Idea
 We can encode a distribution by sampling
 Sampling allows unification of exploration and exploitation
 Can be extended to more general response models
30©MapR Technologies - Confidential
The Original Problem
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
x1
x2
x3
31©MapR Technologies - Confidential
Response Function
p(win) = w qi
i
å xi
æ
è
ç
ö
ø
÷
6- 6 - 4 - 2 0 2 4
1
0
0.5
x
y
32©MapR Technologies - Confidential
Generalized Banditry
 Suppose we have an infinite number of bandits
– suppose they are each labeled by two real numbers x and y in [0,1]
– also that expected payoff is a parameterized function of x and y
– now assume a distribution for θ that we can learn online
 Selection works by sampling θ, then computing f
 Learning works by propagating updates back to θ
– If f is linear, this is very easy
– For special other kinds of f it isn’t too hard
 Don’t just have to have two labels, could have labels and context
E z[ ] = f (x, y |q)
33©MapR Technologies - Confidential
Context Variables
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
x1
x2
x3
user.geo env.time env.day_of_week env.weekend
34©MapR Technologies - Confidential
Caveats
 Original Bayesian Bandit only requires real-time
 Generalized Bandit may require access to long history for learning
– Pseudo online learning may be easier than true online
 Bandit variables can include content, time of day, day of week
 Context variables can include user id, user features
 Bandit × context variables provide the real power
35©MapR Technologies - Confidential
You can do this
yourself!
36©MapR Technologies - Confidential
Super-fast k-means Clustering
37©MapR Technologies - Confidential
Rationale
38©MapR Technologies - Confidential
What is Quality?
 Robust clustering not a goal
– we don’t care if the same clustering is replicated
 Generalization is critical
 Agreement to “gold standard” is a non-issue
39©MapR Technologies - Confidential
An Example
40©MapR Technologies - Confidential
An Example
41©MapR Technologies - Confidential
Diagonalized Cluster Proximity
42©MapR Technologies - Confidential
Clusters as Distribution Surrogate
43©MapR Technologies - Confidential
Clusters as Distribution Surrogate
44©MapR Technologies - Confidential
Theory
45©MapR Technologies - Confidential
For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)
46©MapR Technologies - Confidential
Algorithms
47©MapR Technologies - Confidential
Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together
48©MapR Technologies - Confidential
Ball k-means
 Provably better for highly clusterable data
 Tries to find initial centroids in each “core” of each real clusters
 Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than closest cluster
49©MapR Technologies - Confidential
Still Not a Win
 Ball k-means is nearly guaranteed with k = 2
 Probability of successful seeding drops exponentially with k
 Alternative strategy has high probability of success, but takes
O(nkd + k3d) time
50©MapR Technologies - Confidential
Still Not a Win
 Ball k-means is nearly guaranteed with k = 2
 Probability of successful seeding drops exponentially with k
 Alternative strategy has high probability of success, but takes O(
nkd + k3d ) time
 But for big data, k gets large
51©MapR Technologies - Confidential
Surrogate Method
 Start with sloppy clustering into lots of clusters
κ = k log n clusters
 Use this sketch as a weighted surrogate for the data
 Results are provably good for highly clusterable data
52©MapR Technologies - Confidential
Algorithm Costs
 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice
53©MapR Technologies - Confidential
Algorithm Costs
 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice
54©MapR Technologies - Confidential
Algorithm Costs
 How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
55©MapR Technologies - Confidential
Algorithm Costs
 How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
56©MapR Technologies - Confidential
How It Works
 For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
 If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
57©MapR Technologies - Confidential
Implementation
58©MapR Technologies - Confidential
But Wait, …
 Finding nearest centroid is inner loop
 This could take O( d κ ) per point and κ can be big
 Happily, approximate nearest centroid works fine
59©MapR Technologies - Confidential
Projection Search
total ordering!
60©MapR Technologies - Confidential
LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis
61©MapR Technologies - Confidential
Results
62©MapR Technologies - Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
63©MapR Technologies - Confidential
Quality
 Ball k-means implementation appears significantly better than
simple k-means
 Streaming k-means + ball k-means appears to be about as good as
ball k-means alone
 All evaluations on 20 newsgroups with held-out data
 Figure of merit is mean and median squared distance to nearest
cluster
64©MapR Technologies - Confidential
Contact Me!
 We’re hiring at MapR in US and Europe
 MapR software available for research use
 Get the code as part of Mahout trunk (or 0.8 very soon)
 Contact me at tdunning@maprtech.com or @ted_dunning
 Share news with @apachemahout

Contenu connexe

Tendances

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
 

Tendances (15)

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
 
Ralf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in IndustryRalf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in Industry
 
Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
 
[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
 
SECRYPT 2018 Presentation: 15th International Conference on Security and Cry...
SECRYPT 2018  Presentation: 15th International Conference on Security and Cry...SECRYPT 2018  Presentation: 15th International Conference on Security and Cry...
SECRYPT 2018 Presentation: 15th International Conference on Security and Cry...
 
Data science
Data scienceData science
Data science
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
PyConZA 2019 Keynote - Deep Neural Networks for Video Applications
PyConZA 2019 Keynote - Deep Neural Networks for Video ApplicationsPyConZA 2019 Keynote - Deep Neural Networks for Video Applications
PyConZA 2019 Keynote - Deep Neural Networks for Video Applications
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Access strategies ppt_ind
Access strategies ppt_indAccess strategies ppt_ind
Access strategies ppt_ind
 

En vedette

En vedette (9)

Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Drill Lightning London Big Data
Drill Lightning London Big DataDrill Lightning London Big Data
Drill Lightning London Big Data
 
Summit EU Machine Learning
Summit EU Machine LearningSummit EU Machine Learning
Summit EU Machine Learning
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
Apache drill
Apache drillApache drill
Apache drill
 
Cmu 2011 09.pptx
Cmu 2011 09.pptxCmu 2011 09.pptx
Cmu 2011 09.pptx
 

Similaire à News From Mahout

Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning
 

Similaire à News From Mahout (20)

CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
London hug
London hugLondon hug
London hug
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
News from Mahout
News from MahoutNews from Mahout
News from Mahout
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
What's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache MahoutWhat's Right and Wrong with Apache Mahout
What's Right and Wrong with Apache Mahout
 
Real Time Learning
Real Time LearningReal Time Learning
Real Time Learning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
Devoxx Real-time Learning
Devoxx Real-time LearningDevoxx Real-time Learning
Devoxx Real-time Learning
 

Plus de MapR Technologies

Plus de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

News From Mahout

  • 1. 1©MapR Technologies - Confidential News From Mahout
  • 2. 2©MapR Technologies - Confidential whoami – Ted Dunning  Chief Application Architect, MapR Technologies  Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring)  Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. 3©MapR Technologies - Confidential  Slides and such (available late tonight): – http://www.mapr.com/company/events/nyhug-03-05-2013  Hash tags: #mapr #nyhug #mahout
  • 4. 4©MapR Technologies - Confidential New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!)
  • 5. 5©MapR Technologies - Confidential New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) – fast  Possible new edition of MiA coming – Japanese and Korean editions released, Chinese coming
  • 6. 6©MapR Technologies - Confidential New in Mahout  0.8 is coming soon (1-2 months)  gobs of fixes  QR decomposition is 10x faster – makes ALS 2-3 times faster  May include Bayesian Bandits  Super fast k-means – fast – online (!?!) – fast  Possible new edition of MiA coming – Japanese and Korean editions released, Chinese coming
  • 7. 7©MapR Technologies - Confidential Real-time Learning
  • 8. 8©MapR Technologies - Confidential We have a product to sell … from a web-site
  • 9. 9©MapR Technologies - Confidential Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! What picture? What tag- line? What call to action?
  • 10. 10©MapR Technologies - Confidential The Challenge  Design decisions affect probability of success – Cheesy web-sites don’t even sell cheese  The best designers do better when allowed to fail – Exploration juices creativity  But failing is expensive – If only because we could have succeeded – But also because offending or disappointing customers is bad
  • 11. 11©MapR Technologies - Confidential More Challenges  Too many designs – 5 pictures – 10 tag-lines – 4 calls to action – 3 back-ground colors => 5 x 10 x 4 x 3 = 600 designs  It gets worse quickly – What about changes on the back-end? – Search engine variants? – Checkout process variants?
  • 12. 12©MapR Technologies - Confidential Example – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
  • 13. 13©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
  • 14. 14©MapR Technologies - Confidential A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience
  • 15. 15©MapR Technologies - Confidential I Dunno
  • 16. 16©MapR Technologies - Confidential 5 heads out of 10 throws
  • 17. 17©MapR Technologies - Confidential 2 heads out of 12 throws
  • 18. 18©MapR Technologies - Confidential So now you understand Bayesian probability
  • 19. 19©MapR Technologies - Confidential Another Quick Diversion  Let’s play a shell game  This is a special shell game  It costs you nothing to play  The pea has constant probability of being under each shell (trust me)  How do you find the best shell?  How do you find it while maximizing the number of wins?
  • 20. 20©MapR Technologies - Confidential Pause for short con-game
  • 21. 21©MapR Technologies - Confidential Interim Thoughts  Can you identify winners or losers without trying them out?  Can you ever completely eliminate a shell with a bad streak?  Should you keep trying apparent losers?
  • 22. 22©MapR Technologies - Confidential So now you understand multi-armed bandits
  • 23. 23©MapR Technologies - Confidential Conclusions  Can you identify winners or losers without trying them out? No  Can you ever completely eliminate a shell with a bad streak? No  Should you keep trying apparent losers? Yes, but at a decreasing rate
  • 24. 24©MapR Technologies - Confidential Is there an optimum strategy?
  • 25. 25©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data so far  Sample p1, p2 and p2 from these distributions  Pick shell i where i = argmaxi pi  Lemma 1: The probability of picking shell i will match the probability it is the best shell  Lemma 2: This is as good as it gets
  • 26. 26©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 27. 27©MapR Technologies - Confidential Video Demo
  • 28. 28©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
  • 29. 29©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models
  • 30. 30©MapR Technologies - Confidential The Original Problem Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x1 x2 x3
  • 31. 31©MapR Technologies - Confidential Response Function p(win) = w qi i å xi æ è ç ö ø ÷ 6- 6 - 4 - 2 0 2 4 1 0 0.5 x y
  • 32. 32©MapR Technologies - Confidential Generalized Banditry  Suppose we have an infinite number of bandits – suppose they are each labeled by two real numbers x and y in [0,1] – also that expected payoff is a parameterized function of x and y – now assume a distribution for θ that we can learn online  Selection works by sampling θ, then computing f  Learning works by propagating updates back to θ – If f is linear, this is very easy – For special other kinds of f it isn’t too hard  Don’t just have to have two labels, could have labels and context E z[ ] = f (x, y |q)
  • 33. 33©MapR Technologies - Confidential Context Variables Bogus Dog Food is the Best! Now available in handy 1 ton bags! Buy 5! x1 x2 x3 user.geo env.time env.day_of_week env.weekend
  • 34. 34©MapR Technologies - Confidential Caveats  Original Bayesian Bandit only requires real-time  Generalized Bandit may require access to long history for learning – Pseudo online learning may be easier than true online  Bandit variables can include content, time of day, day of week  Context variables can include user id, user features  Bandit × context variables provide the real power
  • 35. 35©MapR Technologies - Confidential You can do this yourself!
  • 36. 36©MapR Technologies - Confidential Super-fast k-means Clustering
  • 37. 37©MapR Technologies - Confidential Rationale
  • 38. 38©MapR Technologies - Confidential What is Quality?  Robust clustering not a goal – we don’t care if the same clustering is replicated  Generalization is critical  Agreement to “gold standard” is a non-issue
  • 39. 39©MapR Technologies - Confidential An Example
  • 40. 40©MapR Technologies - Confidential An Example
  • 41. 41©MapR Technologies - Confidential Diagonalized Cluster Proximity
  • 42. 42©MapR Technologies - Confidential Clusters as Distribution Surrogate
  • 43. 43©MapR Technologies - Confidential Clusters as Distribution Surrogate
  • 44. 44©MapR Technologies - Confidential Theory
  • 45. 45©MapR Technologies - Confidential For Example Grouping these two clusters seriously hurts squared distance D4 2 (X) > 1 s 2 D5 2 (X)
  • 46. 46©MapR Technologies - Confidential Algorithms
  • 47. 47©MapR Technologies - Confidential Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 48. 48©MapR Technologies - Confidential Ball k-means  Provably better for highly clusterable data  Tries to find initial centroids in each “core” of each real clusters  Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 49. 49©MapR Technologies - Confidential Still Not a Win  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 50. 50©MapR Technologies - Confidential Still Not a Win  Ball k-means is nearly guaranteed with k = 2  Probability of successful seeding drops exponentially with k  Alternative strategy has high probability of success, but takes O( nkd + k3d ) time  But for big data, k gets large
  • 51. 51©MapR Technologies - Confidential Surrogate Method  Start with sloppy clustering into lots of clusters κ = k log n clusters  Use this sketch as a weighted surrogate for the data  Results are provably good for highly clusterable data
  • 52. 52©MapR Technologies - Confidential Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice
  • 53. 53©MapR Technologies - Confidential Algorithm Costs  Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice
  • 54. 54©MapR Technologies - Confidential Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 55. 55©MapR Technologies - Confidential Algorithm Costs  How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 56. 56©MapR Technologies - Confidential How It Works  For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid  If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold
  • 57. 57©MapR Technologies - Confidential Implementation
  • 58. 58©MapR Technologies - Confidential But Wait, …  Finding nearest centroid is inner loop  This could take O( d κ ) per point and κ can be big  Happily, approximate nearest centroid works fine
  • 59. 59©MapR Technologies - Confidential Projection Search total ordering!
  • 60. 60©MapR Technologies - Confidential LSH Bit-match Versus Cosine 0 8 16 24 32 40 48 56 64 1 - 1 - 0.8 - 0.6 - 0.4 - 0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis
  • 61. 61©MapR Technologies - Confidential Results
  • 62. 62©MapR Technologies - Confidential Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 63. 63©MapR Technologies - Confidential Quality  Ball k-means implementation appears significantly better than simple k-means  Streaming k-means + ball k-means appears to be about as good as ball k-means alone  All evaluations on 20 newsgroups with held-out data  Figure of merit is mean and median squared distance to nearest cluster
  • 64. 64©MapR Technologies - Confidential Contact Me!  We’re hiring at MapR in US and Europe  MapR software available for research use  Get the code as part of Mahout trunk (or 0.8 very soon)  Contact me at tdunning@maprtech.com or @ted_dunning  Share news with @apachemahout

Notes de l'éditeur

  1. The basic idea here is that I have colored slides to be presented by you in blue. You should substitute and reword those slides as you like. In a few places, I imagined that we would have fast back and forth as in the introduction or final slide where we can each say we are hiring in turn.The overall thrust of the presentation is for you to make these points:Amex does lots of modelingit is expensivehaving a way to quickly test models and new variables would be awesomeso we worked on a new project with MapRMy part will say the following:Knn basic pictorial motivation (could move to you if you like)describe knn quality metric of overlapshow how bad metric breaks knn (optional)quick description of LSH and projection searchpicture of why k-means search is coolmotivate k-means speed as tool for k-means searchdescribe single pass k-means algorithmdescribe basic data structuresshow parallel speedupOur summary should state that we have achievedsuper-fast k-means clusteringinitial version of super-fast knn search with good overlap