SlideShare une entreprise Scribd logo
1  sur  34
1©MapR Technologies - Confidential
Real-time Learning
for Fun and Profit
2©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– http://slideshare.net/tdunning
 Hash tags: #mapr #storm #bbuzz
3©MapR Technologies - Confidential
The Challenge
 Hadoop is great of processing vats of data
– But sucks for real-time (by design!)
 Storm is great for real-time processing
– But lacks any way to deal with batch processing
 It sounds like there isn’t a solution
– Neither fashionable solution handles everything
4©MapR Technologies - Confidential
This is not a problem.
It’s an opportunity!
5©MapR Technologies - Confidential
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest full
period
Hadoop job
takes this
long for this
data
6©MapR Technologies - Confidential
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
Blended
view
Blended
view
Blended
View
7©MapR Technologies - Confidential
One Alternative
Search
Engine
NoSql
de Jour
Consumer
Real-time Long-time
?
8©MapR Technologies - Confidential
Problems
 Simply dumping into noSql engine doesn’t quite work
 Insert rate is limited
 No load isolation
– Big retrospective jobs kill real-time
 Low scan performance
– Hbase pretty good, but not stellar
 Difficult to set boundaries
– where does real-time end and long-time begin?
9©MapR Technologies - Confidential
Almost a Solution
 Lambda architecture talks about function of long-time state
– Real-time approximate accelerator adjusts previous result to current state
 Sounds good, but …
– How does the real-time accelerator combine with long-time?
– What algorithms can do this?
– How can we avoid gaps and overlaps and other errors?
 Needs more work
10©MapR Technologies - Confidential
A Simple Example
 Let’s start with the simplest case … counting
 Counting = addition
– Addition is associative
– Addition is on-line
– We can generalize these results to all associative, on-line functions
– But let’s start simple
11©MapR Technologies - Confidential
Data
Sources
Catcher
Cluster
Rough Design – Data Flow
Catcher
Cluster
Query Event
Spout
Logger
Bolt
Counter
Bolt
Raw
Logs
Logger
Bolt
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
ProtoSpout
Counter
Bolt
Logger
Bolt
Data
Sources
12©MapR Technologies - Confidential
Closer Look – Catcher Protocol
Data
Sources
Catcher
Cluster
Catcher
Cluster
Data
Sources
The data sources and catchers
communicate with a very simple
protocol.
Hello() => list of catchers
Log(topic,message) =>
(OK|FAIL, redirect-to-catcher)
13©MapR Technologies - Confidential
Closer Look – Catcher Queues
Catcher
Cluster
Catcher
Cluster
The catchers forward log requests
to the correct catcher and return
that host in the reply to allow the
client to avoid the extra hop.
Each topic file is appended by
exactly one catcher.
Topic files are kept in shared file
storage.
Topic
File
Topic
File
14©MapR Technologies - Confidential
Closer Look – ProtoSpout
The ProtoSpout tails the topic files,
parses log records into tuples and
injects them into the Storm
topology.
Last fully acked position stored in
shared, transactionally correct file
system.
Topic
File
Topic
File
ProtoSpout
15©MapR Technologies - Confidential
Closer Look – Counter Bolt
 Critical design goals:
– fast ack for all tuples
– fast restart of counter
 Ack happens when tuple hits the replay log (10’s of milliseconds,
group commit)
 Restart involves replaying semi-agg’s + replay log (very fast)
 Replay log only lasts until next semi-aggregate goes out
Counter
Bolt
Replay
Log
Semi-
aggregated
records
Incoming
records
Real-time Long-time
16©MapR Technologies - Confidential
A Frozen Moment in Time
 Snapshot defines the dividing line
 All data in the snap is long-time, all
after is real-time
 Semi-agg strategy allows clean
combination of both kinds of data
 Data synchronized snap not
needed
Semi
Agg
Hadoop
Aggregator
Snap
Long
agg
17©MapR Technologies - Confidential
Guarantees
 Counter output volume is small-ish
– the greater of k tuples per 100K inputs or k tuple/s
– 1 tuple/s/label/bolt for this exercise
 Persistence layer must provide guarantees
– distributed against node failure
– must have either readable flush or closed-append
 HDFS is distributed, but provides no guarantees and strange
semantics
 MapRfs is distributed, provides all necessary guarantees
18©MapR Technologies - Confidential
Presentation Layer
 Presentation must
– read recent output of Logger bolt
– read relevant output of Hadoop jobs
– combine semi-aggregated records
 User will see
– counts that increment within 0-2 s of events
– seamless and accurate meld of short and long-term data
19©MapR Technologies - Confidential
The Basic Idea
 Online algorithms generally have relatively small state (like
counting)
 Online algorithms generally have a simple update (like counting)
 If we can do this with counting, we can do it with all kinds of
algorithms
20©MapR Technologies - Confidential
Summary – Part 1
 Semi-agg strategy + snapshots allows correct real-time counts
– because addition is on-line and associative
 Other on-line associative operations include:
– k-means clustering (see Dan Filimon’s talk at 16.)
– count distinct (see hyper-log-log counters from streamlib or kmv from
Brickhouse)
– top-k values
– top-k (count(*)) (see streamlib)
– contextual Bayesian bandits (see part 2 of this talk)
21©MapR Technologies - Confidential
Example 2 – AB testing in real-time
 I have 15 versions of my landing page
 Each visitor is assigned to a version
– Which version?
 A conversion or sale or whatever can happen
– How long to wait?
 Some versions of the landing page are horrible
– Don’t want to give them traffic
22©MapR Technologies - Confidential
A Quick Diversion
 You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
 I flip the coin and while it is in the air ask again
 I catch the coin and ask again
 I look at the coin (and you don’t) and ask again
 Why does the answer change?
– And did it ever have a single value?
23©MapR Technologies - Confidential
A Philosophical Conclusion
 Probability as expressed by humans is subjective and depends on
information and experience
24©MapR Technologies - Confidential
I Dunno
25©MapR Technologies - Confidential
5 heads out of 10 throws
26©MapR Technologies - Confidential
2 heads out of 12 throws
Mean
Using any single number as a “best”
estimate denies the uncertain nature of
a distribution
Adding confidence bounds still loses most of
the information in the distribution and
prevents good modeling of the tails
27©MapR Technologies - Confidential
Bayesian Bandit
 Compute distributions based on data
 Sample p1 and p2 from these distributions
 Put a coin in bandit 1 if p1 > p2
 Else, put the coin in bandit 2
28©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
29©MapR Technologies - Confidential
Video Demo
30©MapR Technologies - Confidential
The Code
 Select an alternative
 Select and learn
 But we already know how to count!
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)
31©MapR Technologies - Confidential
The Basic Idea
 We can encode a distribution by sampling
 Sampling allows unification of exploration and exploitation
 Can be extended to more general response models
 Note that learning here = counting = on-line algorithm
33©MapR Technologies - Confidential
Caveats
 Original Bayesian Bandit only requires real-time
 Generalized Bandit may require access to long history for learning
– Pseudo online learning may be easier than true online
 Bandit variables can include content, time of day, day of week
 Context variables can include user id, user features
 Bandit × context variables provide the real power
34©MapR Technologies - Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such (available late tonight):
– http://slideshare.net/tdunning
 Hash tags: #mapr #storm #bbuzz
35©MapR Technologies - Confidential
Thank You

Contenu connexe

Tendances

Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012Ted Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 

Tendances (18)

Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 

Similaire à Buzz words-dunning-real-time-learning

Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopMapR Technologies
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningMapR Technologies
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
Lambda kappa architecture - the jury are still out
Lambda   kappa architecture - the jury are still outLambda   kappa architecture - the jury are still out
Lambda kappa architecture - the jury are still outYoav chernobroda
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon Berlin
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 

Similaire à Buzz words-dunning-real-time-learning (20)

Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
 
London hug
London hugLondon hug
London hug
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
Lambda kappa architecture - the jury are still out
Lambda   kappa architecture - the jury are still outLambda   kappa architecture - the jury are still out
Lambda kappa architecture - the jury are still out
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 

Plus de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationTed Dunning
 

Plus de Ted Dunning (14)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 

Dernier

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Dernier (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Buzz words-dunning-real-time-learning

  • 1. 1©MapR Technologies - Confidential Real-time Learning for Fun and Profit
  • 2. 2©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://slideshare.net/tdunning  Hash tags: #mapr #storm #bbuzz
  • 3. 3©MapR Technologies - Confidential The Challenge  Hadoop is great of processing vats of data – But sucks for real-time (by design!)  Storm is great for real-time processing – But lacks any way to deal with batch processing  It sounds like there isn’t a solution – Neither fashionable solution handles everything
  • 4. 4©MapR Technologies - Confidential This is not a problem. It’s an opportunity!
  • 5. 5©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  • 6. 6©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • 7. 7©MapR Technologies - Confidential One Alternative Search Engine NoSql de Jour Consumer Real-time Long-time ?
  • 8. 8©MapR Technologies - Confidential Problems  Simply dumping into noSql engine doesn’t quite work  Insert rate is limited  No load isolation – Big retrospective jobs kill real-time  Low scan performance – Hbase pretty good, but not stellar  Difficult to set boundaries – where does real-time end and long-time begin?
  • 9. 9©MapR Technologies - Confidential Almost a Solution  Lambda architecture talks about function of long-time state – Real-time approximate accelerator adjusts previous result to current state  Sounds good, but … – How does the real-time accelerator combine with long-time? – What algorithms can do this? – How can we avoid gaps and overlaps and other errors?  Needs more work
  • 10. 10©MapR Technologies - Confidential A Simple Example  Let’s start with the simplest case … counting  Counting = addition – Addition is associative – Addition is on-line – We can generalize these results to all associative, on-line functions – But let’s start simple
  • 11. 11©MapR Technologies - Confidential Data Sources Catcher Cluster Rough Design – Data Flow Catcher Cluster Query Event Spout Logger Bolt Counter Bolt Raw Logs Logger Bolt Semi Agg Hadoop Aggregator Snap Long agg ProtoSpout Counter Bolt Logger Bolt Data Sources
  • 12. 12©MapR Technologies - Confidential Closer Look – Catcher Protocol Data Sources Catcher Cluster Catcher Cluster Data Sources The data sources and catchers communicate with a very simple protocol. Hello() => list of catchers Log(topic,message) => (OK|FAIL, redirect-to-catcher)
  • 13. 13©MapR Technologies - Confidential Closer Look – Catcher Queues Catcher Cluster Catcher Cluster The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop. Each topic file is appended by exactly one catcher. Topic files are kept in shared file storage. Topic File Topic File
  • 14. 14©MapR Technologies - Confidential Closer Look – ProtoSpout The ProtoSpout tails the topic files, parses log records into tuples and injects them into the Storm topology. Last fully acked position stored in shared, transactionally correct file system. Topic File Topic File ProtoSpout
  • 15. 15©MapR Technologies - Confidential Closer Look – Counter Bolt  Critical design goals: – fast ack for all tuples – fast restart of counter  Ack happens when tuple hits the replay log (10’s of milliseconds, group commit)  Restart involves replaying semi-agg’s + replay log (very fast)  Replay log only lasts until next semi-aggregate goes out Counter Bolt Replay Log Semi- aggregated records Incoming records Real-time Long-time
  • 16. 16©MapR Technologies - Confidential A Frozen Moment in Time  Snapshot defines the dividing line  All data in the snap is long-time, all after is real-time  Semi-agg strategy allows clean combination of both kinds of data  Data synchronized snap not needed Semi Agg Hadoop Aggregator Snap Long agg
  • 17. 17©MapR Technologies - Confidential Guarantees  Counter output volume is small-ish – the greater of k tuples per 100K inputs or k tuple/s – 1 tuple/s/label/bolt for this exercise  Persistence layer must provide guarantees – distributed against node failure – must have either readable flush or closed-append  HDFS is distributed, but provides no guarantees and strange semantics  MapRfs is distributed, provides all necessary guarantees
  • 18. 18©MapR Technologies - Confidential Presentation Layer  Presentation must – read recent output of Logger bolt – read relevant output of Hadoop jobs – combine semi-aggregated records  User will see – counts that increment within 0-2 s of events – seamless and accurate meld of short and long-term data
  • 19. 19©MapR Technologies - Confidential The Basic Idea  Online algorithms generally have relatively small state (like counting)  Online algorithms generally have a simple update (like counting)  If we can do this with counting, we can do it with all kinds of algorithms
  • 20. 20©MapR Technologies - Confidential Summary – Part 1  Semi-agg strategy + snapshots allows correct real-time counts – because addition is on-line and associative  Other on-line associative operations include: – k-means clustering (see Dan Filimon’s talk at 16.) – count distinct (see hyper-log-log counters from streamlib or kmv from Brickhouse) – top-k values – top-k (count(*)) (see streamlib) – contextual Bayesian bandits (see part 2 of this talk)
  • 21. 21©MapR Technologies - Confidential Example 2 – AB testing in real-time  I have 15 versions of my landing page  Each visitor is assigned to a version – Which version?  A conversion or sale or whatever can happen – How long to wait?  Some versions of the landing page are horrible – Don’t want to give them traffic
  • 22. 22©MapR Technologies - Confidential A Quick Diversion  You see a coin – What is the probability of heads? – Could it be larger or smaller than that?  I flip the coin and while it is in the air ask again  I catch the coin and ask again  I look at the coin (and you don’t) and ask again  Why does the answer change? – And did it ever have a single value?
  • 23. 23©MapR Technologies - Confidential A Philosophical Conclusion  Probability as expressed by humans is subjective and depends on information and experience
  • 24. 24©MapR Technologies - Confidential I Dunno
  • 25. 25©MapR Technologies - Confidential 5 heads out of 10 throws
  • 26. 26©MapR Technologies - Confidential 2 heads out of 12 throws Mean Using any single number as a “best” estimate denies the uncertain nature of a distribution Adding confidence bounds still loses most of the information in the distribution and prevents good modeling of the tails
  • 27. 27©MapR Technologies - Confidential Bayesian Bandit  Compute distributions based on data  Sample p1 and p2 from these distributions  Put a coin in bandit 1 if p1 > p2  Else, put the coin in bandit 2
  • 28. 28©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 29. 29©MapR Technologies - Confidential Video Demo
  • 30. 30©MapR Technologies - Confidential The Code  Select an alternative  Select and learn  But we already know how to count! n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0))) for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
  • 31. 31©MapR Technologies - Confidential The Basic Idea  We can encode a distribution by sampling  Sampling allows unification of exploration and exploitation  Can be extended to more general response models  Note that learning here = counting = on-line algorithm
  • 32. 33©MapR Technologies - Confidential Caveats  Original Bayesian Bandit only requires real-time  Generalized Bandit may require access to long history for learning – Pseudo online learning may be easier than true online  Bandit variables can include content, time of day, day of week  Context variables can include user id, user features  Bandit × context variables provide the real power
  • 33. 34©MapR Technologies - Confidential  Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such (available late tonight): – http://slideshare.net/tdunning  Hash tags: #mapr #storm #bbuzz
  • 34. 35©MapR Technologies - Confidential Thank You