SlideShare une entreprise Scribd logo
1  sur  54
How to Tell Which Algorithms Really
Matter
Ted Dunning
MapR Technologies
© 2014 MapR Technologies 2
© 2014 MapR Technologies 3
00:011.65TB
WITH 298 SERVERS
© 2014 MapR Technologies 4
129K
RECCOMENDATIONS
00:02
© 2014 MapR Technologies 5
Advertising
Automation
Cloud
Sellers
Cloud
Buyers
Cloud
63M
AD AUCTIONS
00:03
© 2014 MapR Technologies 6
00:04422.2K
GENETIC SEQUENCES
© 2014 MapR Technologies 7
Largest Biometric
Database
00:054.73M
AUTHENTICATIONS
© 2014 MapR Technologies 8© 2014 MapR Technologies
But How is This Done?
What really matters?
© 2014 MapR Technologies 9
Topic For Today
• What is important? What is not?
• Why?
• What is the difference from academic research?
• Some examples
© 2014 MapR Technologies 10
What is Important?
• Deployable
• Robust
• Transparent
• Skillset and mindset matched?
• Proportionate
© 2014 MapR Technologies 11
What is Important?
• Deployable
– Clever prototypes don’t count if they can’t be standardized
• Robust
• Transparent
• Skillset and mindset matched?
• Proportionate
© 2014 MapR Technologies 12
What is Important?
• Deployable
– Clever prototypes don’t count
• Robust
– Mishandling is common
• Transparent
– Will degradation be obvious?
• Skillset and mindset matched?
• Proportionate
© 2014 MapR Technologies 13
What is Important?
• Deployable
– Clever prototypes don’t count
• Robust
– Mishandling is common
• Transparent
– Will degradation be obvious?
• Skillset and mindset matched?
– How long will your fancy data scientist enjoy doing standard ops tasks?
• Proportionate
– Where is the highest value per minute of effort?
© 2014 MapR Technologies 14
Academic Goals vs Pragmatics
• Academic goals
– Reproducible
– Isolate theoretically important aspects
– Work on novel problems
• Pragmatics
– Highest net value
– Available data is constantly changing
– Diligence and consistency have larger impact than cleverness
– Many systems feed themselves, exploration and exploitation are both
important
– Engineering constraints on budget and schedule
© 2014 MapR Technologies 15
Example 1:
Making Recommendations Better
© 2014 MapR Technologies 16
Recommendation Advances
• What are the most important algorithmic advances in
recommendations over the last 10 years?
• Cooccurrence analysis?
• Matrix completion via factorization?
• Latent factor log-linear models?
• Temporal dynamics?
© 2014 MapR Technologies 17
The Winner – None of the Above
• What are the most important algorithmic advances in
recommendations over the last 10 years?
1. Result dithering (random noise)
2. Anti-flood (don’t repeat yourself)
© 2014 MapR Technologies 18
The Real Issues
• Exploration
• Diversity
• Speed
• Not the last fraction of a percent
© 2014 MapR Technologies 19
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual
performance much better
© 2014 MapR Technologies 20
Result Dithering
• Dithering is used to re-order recommendation results
– Re-ordering is done randomly
• Dithering is guaranteed to make off-line performance worse
• Dithering also has a near perfect record of making actual
performance much better
“Made more difference than any other change”
© 2014 MapR Technologies 22
Example … ε = 0.5
1 2 6 5 3 4 13 16
1 2 3 8 5 7 6 34
1 4 3 2 6 7 11 10
1 2 4 3 15 7 13 19
1 6 2 3 4 16 9 5
1 2 3 5 24 7 17 13
1 2 3 4 6 12 5 14
2 1 3 5 7 6 4 17
4 1 2 7 3 9 8 5
2 1 5 3 4 7 13 6
3 1 5 4 2 7 8 6
2 1 3 4 7 12 17 16
© 2014 MapR Technologies 23
Example … ε = log 2 = 0.69
1 2 8 3 9 15 7 6
1 8 14 15 3 2 22 10
1 3 8 2 10 5 7 4
1 2 10 7 3 8 6 14
1 5 33 15 2 9 11 29
1 2 7 3 5 4 19 6
1 3 5 23 9 7 4 2
2 4 11 8 3 1 44 9
2 3 1 4 6 7 8 33
3 4 1 2 10 11 15 14
11 1 2 4 5 7 3 14
1 8 7 3 22 11 2 33
© 2014 MapR Technologies 24
Exploring The Second Page
© 2014 MapR Technologies 25
Lesson 1:
Exploration is good
© 2014 MapR Technologies 26
Example 2:
Bayesian Bandits
© 2014 MapR Technologies 27
Bayesian Bandits
• Based on Thompson sampling
• Very general sequential test
• Near optimal regret
• Trade-off exploration and exploitation
• Possibly best known solution for exploration/exploitation
• Incredibly simple
© 2014 MapR Technologies 30
Fast Convergence
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
© 2014 MapR Technologies 31
Thompson Sampling on Ads
An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
© 2014 MapR Technologies 32
Bayesian Bandits versus Result Dithering
• Many useful systems are difficult to frame in fully Bayesian form
• Thompson sampling cannot be applied without posterior
sampling
• Can still do useful exploration with dithering
• But better to use Thompson sampling if possible
© 2014 MapR Technologies 33
Lesson 2:
Exploration is easy to do and
pays big benefits.
© 2014 MapR Technologies 34
Example 3:
On-line Clustering
© 2014 MapR Technologies 35
The Problem
• K-means clustering is useful for feature extraction or
compression
• At scale and at high dimension, the desirable number of clusters
increases
• Very large number of clusters may require more passes through
the data
• Super-linear scaling is generally infeasible
© 2014 MapR Technologies 36
The Solution
• Sketch-based algorithms produce a sketch of the data
• Streaming k-means uses adaptive dp-means to produce this
sketch in the form of many weighted centroids which
approximate the original distribution
• The size of the sketch grows very slowly with increasing data
size
• Many operations such as clustering are well behaved on
sketches
Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.
Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.
© 2014 MapR Technologies 37
An Example
© 2014 MapR Technologies 38
An Example
© 2014 MapR Technologies 43
Streaming k-means Ideas
• By using a sketch with lots (k log N) of centroids, we avoid
pathological cases
• We still get a very good result if the sketch is created
– in one pass
– with approximate search
• In fact, adaptive dp-means works just fine
• In the end, the sketch can be used for clustering or …
© 2014 MapR Technologies 44
Lesson 3:
Sketches make big data small.
© 2014 MapR Technologies 45
Example 4:
Search Abuse
© 2014 MapR Technologies 46
Recommendations
Alice got an apple and a puppy
Charles got a bicycle
Alice
Charles
© 2014 MapR Technologies 47
Recommendations
Alice got an apple and a puppy
Charles got a bicycle
Bob got an apple
Alice
Bob
Charles
© 2014 MapR Technologies 48
Recommendations
What else would Bob like??
Alice
Bob
Charles
© 2014 MapR Technologies 49
Log Files
Alice
Bob
Charles
Alice
Bob
Charles
Alice
© 2014 MapR Technologies 50
History Matrix: Users by Items
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
© 2014 MapR Technologies 51
Co-occurrence Matrix: Items by Items
-
1 2
1 1
1
1
2 1
How do you tell which co-occurrences are useful?.
0
0
0 0
© 2014 MapR Technologies 53
Indicator Matrix: Anomalous Co-Occurrence
✔
✔
Result: The marked row will be added to the indicator field in the
item document…
© 2014 MapR Technologies 54
Indicator Matrix
✔
id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
indicators: (t1)
That one row from indicator matrix becomes the indicator field in the
Solr document used to deploy the recommendation engine.
Note: data for the indicator field is added directly to meta-data for a document in Solr
index. You don’t need to create a separate index for the indicators.
© 2014 MapR Technologies 56
Internals of the Recommender Engine
56
© 2014 MapR Technologies 58
Real-life example
© 2014 MapR Technologies 59
Lesson 4:
Recursive search abuse pays
Search can implement recs
Which can implement search
© 2014 MapR Technologies 60
How Does This Apply?
© 2014 MapR Technologies 61
How Can I Start?
© 2014 MapR Technologies 62
Q&A
@ted_dunning @mapr maprtech
tdunning@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
© 2014 MapR Technologies 64
How to tell which algorithms really matter

Contenu connexe

Tendances

Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
DataWorks Summit
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
Ted Dunning
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
DataWorks Summit
 

Tendances (18)

Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
 
Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016 Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016
 
Bluesky - Esri UK Annual Conference 2016
Bluesky - Esri UK Annual Conference 2016Bluesky - Esri UK Annual Conference 2016
Bluesky - Esri UK Annual Conference 2016
 

En vedette

Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
DataWorks Summit
 

En vedette (8)

Safer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science PlatformSafer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science Platform
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Redefine Big Data
Redefine Big DataRedefine Big Data
Redefine Big Data
 
N(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdownN(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdown
 
Self Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System AccuracySelf Evolving Model to Attain to State of Dynamic System Accuracy
Self Evolving Model to Attain to State of Dynamic System Accuracy
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 

Similaire à How to tell which algorithms really matter

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
DataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
DataWorks Summit
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
DataWorks Summit
 

Similaire à How to tell which algorithms really matter (20)

How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

How to tell which algorithms really matter

  • 1. How to Tell Which Algorithms Really Matter Ted Dunning MapR Technologies
  • 2. © 2014 MapR Technologies 2
  • 3. © 2014 MapR Technologies 3 00:011.65TB WITH 298 SERVERS
  • 4. © 2014 MapR Technologies 4 129K RECCOMENDATIONS 00:02
  • 5. © 2014 MapR Technologies 5 Advertising Automation Cloud Sellers Cloud Buyers Cloud 63M AD AUCTIONS 00:03
  • 6. © 2014 MapR Technologies 6 00:04422.2K GENETIC SEQUENCES
  • 7. © 2014 MapR Technologies 7 Largest Biometric Database 00:054.73M AUTHENTICATIONS
  • 8. © 2014 MapR Technologies 8© 2014 MapR Technologies But How is This Done? What really matters?
  • 9. © 2014 MapR Technologies 9 Topic For Today • What is important? What is not? • Why? • What is the difference from academic research? • Some examples
  • 10. © 2014 MapR Technologies 10 What is Important? • Deployable • Robust • Transparent • Skillset and mindset matched? • Proportionate
  • 11. © 2014 MapR Technologies 11 What is Important? • Deployable – Clever prototypes don’t count if they can’t be standardized • Robust • Transparent • Skillset and mindset matched? • Proportionate
  • 12. © 2014 MapR Technologies 12 What is Important? • Deployable – Clever prototypes don’t count • Robust – Mishandling is common • Transparent – Will degradation be obvious? • Skillset and mindset matched? • Proportionate
  • 13. © 2014 MapR Technologies 13 What is Important? • Deployable – Clever prototypes don’t count • Robust – Mishandling is common • Transparent – Will degradation be obvious? • Skillset and mindset matched? – How long will your fancy data scientist enjoy doing standard ops tasks? • Proportionate – Where is the highest value per minute of effort?
  • 14. © 2014 MapR Technologies 14 Academic Goals vs Pragmatics • Academic goals – Reproducible – Isolate theoretically important aspects – Work on novel problems • Pragmatics – Highest net value – Available data is constantly changing – Diligence and consistency have larger impact than cleverness – Many systems feed themselves, exploration and exploitation are both important – Engineering constraints on budget and schedule
  • 15. © 2014 MapR Technologies 15 Example 1: Making Recommendations Better
  • 16. © 2014 MapR Technologies 16 Recommendation Advances • What are the most important algorithmic advances in recommendations over the last 10 years? • Cooccurrence analysis? • Matrix completion via factorization? • Latent factor log-linear models? • Temporal dynamics?
  • 17. © 2014 MapR Technologies 17 The Winner – None of the Above • What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering (random noise) 2. Anti-flood (don’t repeat yourself)
  • 18. © 2014 MapR Technologies 18 The Real Issues • Exploration • Diversity • Speed • Not the last fraction of a percent
  • 19. © 2014 MapR Technologies 19 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better
  • 20. © 2014 MapR Technologies 20 Result Dithering • Dithering is used to re-order recommendation results – Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change”
  • 21. © 2014 MapR Technologies 22 Example … ε = 0.5 1 2 6 5 3 4 13 16 1 2 3 8 5 7 6 34 1 4 3 2 6 7 11 10 1 2 4 3 15 7 13 19 1 6 2 3 4 16 9 5 1 2 3 5 24 7 17 13 1 2 3 4 6 12 5 14 2 1 3 5 7 6 4 17 4 1 2 7 3 9 8 5 2 1 5 3 4 7 13 6 3 1 5 4 2 7 8 6 2 1 3 4 7 12 17 16
  • 22. © 2014 MapR Technologies 23 Example … ε = log 2 = 0.69 1 2 8 3 9 15 7 6 1 8 14 15 3 2 22 10 1 3 8 2 10 5 7 4 1 2 10 7 3 8 6 14 1 5 33 15 2 9 11 29 1 2 7 3 5 4 19 6 1 3 5 23 9 7 4 2 2 4 11 8 3 1 44 9 2 3 1 4 6 7 8 33 3 4 1 2 10 11 15 14 11 1 2 4 5 7 3 14 1 8 7 3 22 11 2 33
  • 23. © 2014 MapR Technologies 24 Exploring The Second Page
  • 24. © 2014 MapR Technologies 25 Lesson 1: Exploration is good
  • 25. © 2014 MapR Technologies 26 Example 2: Bayesian Bandits
  • 26. © 2014 MapR Technologies 27 Bayesian Bandits • Based on Thompson sampling • Very general sequential test • Near optimal regret • Trade-off exploration and exploitation • Possibly best known solution for exploration/exploitation • Incredibly simple
  • 27. © 2014 MapR Technologies 30 Fast Convergence 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 28. © 2014 MapR Technologies 31 Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
  • 29. © 2014 MapR Technologies 32 Bayesian Bandits versus Result Dithering • Many useful systems are difficult to frame in fully Bayesian form • Thompson sampling cannot be applied without posterior sampling • Can still do useful exploration with dithering • But better to use Thompson sampling if possible
  • 30. © 2014 MapR Technologies 33 Lesson 2: Exploration is easy to do and pays big benefits.
  • 31. © 2014 MapR Technologies 34 Example 3: On-line Clustering
  • 32. © 2014 MapR Technologies 35 The Problem • K-means clustering is useful for feature extraction or compression • At scale and at high dimension, the desirable number of clusters increases • Very large number of clusters may require more passes through the data • Super-linear scaling is generally infeasible
  • 33. © 2014 MapR Technologies 36 The Solution • Sketch-based algorithms produce a sketch of the data • Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution • The size of the sketch grows very slowly with increasing data size • Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.
  • 34. © 2014 MapR Technologies 37 An Example
  • 35. © 2014 MapR Technologies 38 An Example
  • 36. © 2014 MapR Technologies 43 Streaming k-means Ideas • By using a sketch with lots (k log N) of centroids, we avoid pathological cases • We still get a very good result if the sketch is created – in one pass – with approximate search • In fact, adaptive dp-means works just fine • In the end, the sketch can be used for clustering or …
  • 37. © 2014 MapR Technologies 44 Lesson 3: Sketches make big data small.
  • 38. © 2014 MapR Technologies 45 Example 4: Search Abuse
  • 39. © 2014 MapR Technologies 46 Recommendations Alice got an apple and a puppy Charles got a bicycle Alice Charles
  • 40. © 2014 MapR Technologies 47 Recommendations Alice got an apple and a puppy Charles got a bicycle Bob got an apple Alice Bob Charles
  • 41. © 2014 MapR Technologies 48 Recommendations What else would Bob like?? Alice Bob Charles
  • 42. © 2014 MapR Technologies 49 Log Files Alice Bob Charles Alice Bob Charles Alice
  • 43. © 2014 MapR Technologies 50 History Matrix: Users by Items Alice Bob Charles ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 44. © 2014 MapR Technologies 51 Co-occurrence Matrix: Items by Items - 1 2 1 1 1 1 2 1 How do you tell which co-occurrences are useful?. 0 0 0 0
  • 45. © 2014 MapR Technologies 53 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔ Result: The marked row will be added to the indicator field in the item document…
  • 46. © 2014 MapR Technologies 54 Indicator Matrix ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.
  • 47. © 2014 MapR Technologies 56 Internals of the Recommender Engine 56
  • 48. © 2014 MapR Technologies 58 Real-life example
  • 49. © 2014 MapR Technologies 59 Lesson 4: Recursive search abuse pays Search can implement recs Which can implement search
  • 50. © 2014 MapR Technologies 60 How Does This Apply?
  • 51. © 2014 MapR Technologies 61 How Can I Start?
  • 52. © 2014 MapR Technologies 62 Q&A @ted_dunning @mapr maprtech tdunning@mapr.com Engage with us! MapR maprtech mapr-technologies
  • 53. © 2014 MapR Technologies 64

Notes de l'éditeur

  1. I just have 5 minutes for this talk. Given the short time I thought I’d share with you some of the more interesting things you can do with Hadoop in 5 minutes or less…
  2. The Minutesort benchmark is technology agnostic, until recently the record was held by Microsoft using custom software and dedicated high end hardware. Yahoo broke the record and sorted 1.6TB in one minute using 2200 servers. This is not limited to just the originators of Hadoop…One of our customers over a weekend recently broke this record performing 1.65TB in one minute with 298 servers. This performance is key to their use of Hadoop as it is a critical part of their business operations.
  3. After just a few minutes of work here with Hadoop I could use a minute to relax – Beats headphones by Dr. Dre have swept the audio market. Beats has launched a new Beats Music service thatis able to personalize music selections and select the perfect song in a minute from over 20 million songs. It joins a crowded space for online music, but now by using Hadoop Beats is able to provide a completely new personalized service from over 20 million songs in their library. In their very first day, they were processing 129,000 music interactions per minute. A number that is only growing.11 Million events per day
  4. In our third minute you could perform 63M ad auctions with the Rubicon Project. The company’s pioneering technology created a new model for the advertising industry – similar to what NASDAQ did for stock trading. Rubicon Project’s automated advertising platform is used by more than 500 of the world’s premium publishers to transact with over 100,000 ad brands globally.63M might seem like a lot but that’s just the average not the peak performance of the Rubicon Project that perform 90B ad auctions each day -- providing the most extensive ad reach in the industry touching 96% of internet users in the US. You might ask how do we know who has the largest ad reach. Well this was measured by comscore.
  5. You can use a second minute to change Healthcare. Doctors, particularly oncologists, are faced with an enormous amount of data regarding patient treatments, outcomes, and disease states. Hadoop is having an impact across the health care industry but for this minute we will focus on its use for developing better treatments. In one minute Hadoop can analyze more than 20,000 genes across hundreds of thousands of patients. The outcome of this analysis is to get a better understanding of genomic factors and integrate imaging and clinical analytics to better understand, predict, and impact survival. In any given minute our cluster is sequencing 422,000 genes per minute.
  6. In 1 minute you can perform 4.73 million concurrent authentications in the largest biometric database in the worldIn India, there is no social security card. It’s difficult for the average citizen to set up a bank account, access benefit programs, and enjoy economic mobility. It’s difficult for the government as well with over a $1B of government aid classified as leakage, the result of fraud and corruption. The Aadhaar program is poised to change all that by leveraging the unique IDs that all people are born with. The program aims to get fingerprints and retina scans for all 1.2 billion citizens. The scale of this project required an integrated in-Hadoop database that was capable of 200 millisecond response times while supporting millions of concurrent look-ups.
  7. Hadoop is making CIO’s rethink their data architecture. It is a fundamental shift in the economics of data storage/processing/analytics, and is opening up entirely new business opportunities. Let’s talk about 3 key trends we are seeing, as well as 3 realities or implications on your business and “readiness” to harness the power of big data and Hadoop.
  8. * A history of what everybody has done. Obviously this is just a cartoon because large numbers of users and interactions with items would be required to build a recommender* Next step will be to predict what a new user might like…
  9. *Bob is the “new user” and getting apple is his history
  10. *Here is where the recommendation engine needs to go to work…Note to trainer: you might see if audience calls out the answer before revealing next slide…
  11. Note to trainer: This is the situation similar to that in which we started, with three users in our history. The difference is that now everybody got a pony. Bob has apple and pony but not a puppy…yet
  12. *Binary matrix is stored sparsely
  13. *Convert by MapReduce into a binary matrixNote to trainer: Whether consider apple to have occurred with self is open question
  14. Old joke: all the world can be divided into 2 categories: Scotch tape and non-Scotch tape… This is a way to think about the co-occurrence
  15. Only important co-occurrence is puppy follows apple
  16. *Take that row of matrix and combine with all the meta data we might have…*Important thing to get from the co-occurrence matrix is this indicator..Cool thing: analogous to what a lot of recommendation engines do*This row forms the indicator field in a Solr document containing meta-data (you do NOT have to build a separate index for the indicators)Find the useful co-occurrence and get rid of the rest. Sparsify and get the anomalous co-occurrence
  17. Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
  18. *This indicator field is where the output of the Mahout recommendation engine are stored (the row from the indicator matrix that identified significant or interesting co-occurrence. *Keep in mind that this recommendation indicator data is added to the same original document in the Solr index that contains meta data for the item in question
  19. This is a diagnostics window in the LucidWorksSolr index (not the web interface a user would see). It’s a way for the developer to do a rough evaluation (laugh test) of the choices offered by the recommendation engine.In other words, do these indicator artists represented by their indicator Id make reasonable recommendations Note to trainer: artist 303 happens to be The Beatles. Is that a good match for Chuck Berry?
  20. In 1 minute you can perform 4.73 million concurrent authentications in the largest biometric database in the worldIn India, there is no social security card. It’s difficult for the average citizen to set up a bank account, access benefit programs, and enjoy economic mobility. It’s difficult for the government as well with over a $1B of government aid classified as leakage, the result of fraud and corruption. The Aadhaar program is poised to change all that by leveraging the unique IDs that all people are born with. The program aims to get fingerprints and retina scans for all 1.2 billion citizens. The scale of this project required an integrated in-Hadoop database that was capable of 200 millisecond response times while supporting millions of concurrent look-ups.
  21. In 1 minute you can perform 4.73 million concurrent authentications in the largest biometric database in the worldIn India, there is no social security card. It’s difficult for the average citizen to set up a bank account, access benefit programs, and enjoy economic mobility. It’s difficult for the government as well with over a $1B of government aid classified as leakage, the result of fraud and corruption. The Aadhaar program is poised to change all that by leveraging the unique IDs that all people are born with. The program aims to get fingerprints and retina scans for all 1.2 billion citizens. The scale of this project required an integrated in-Hadoop database that was capable of 200 millisecond response times while supporting millions of concurrent look-ups.