[2024]Digital Global Overview Report 2024 Meltwater.pdf
Distributed Online Machine Learning Framework for Big Data
1. Distributed Online Machine Learning
Framework for Big Data
Shohei Hido
Preferred Infrastructure, Inc. Japan.
XLDB Asia, June 22nd, 2012
2. Overview:
Big Data analytics will go real-time and deeper
1. Bigger data
2. More in real-time
3. Deep analysis
No storage
No data sharing
Only mix model
3. Jubatus: OSS platform for Big Data analytics
l Joint development with NTT laboratory in Japan
l Project started April 2011
l Released as an open source software
l Just released 0.3.0
l You can download it from
l http://github.com/jubatus/
l Waiting for your contribution and collaboration
3
4. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
4
5. Increasing demand in Big Data applications:
Real-time deeper analysis
l Current focus: aggregation and rule processing on bigger data
l CEP (Complex Event Processing) for real-time processing
l Hadoop/MapReduce for distributed computation
l Future: deeper analysis for rapid decisions and actions
l Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012]
l Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]
Data size
What will
Hadoop come?
CEP
Deep
Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf
5
analysis
http://www.computerworlduk.com/news/networking/3302464/
6. Key technology: Machine learning
l Examples need rapid decisions under uncertainty
l Anomaly detection from M2M sensor data
l Energy demand forecast / Smart grid optimization
l Security monitoring on raw Internet traffic
l What is missing for fast & deep analytics on Big Data?
l Online/real-time machine learning platform
l + Scale-out distributed machine learning platform
1. Bigger data
2. More in real-time
3. Deep analysis
7. Online machine learning in Jubatus
l Batch learning
l Scan all data before building a model
l Data must be stored in memory or storage
Model
l Online learning
l Model will be updated by each data sample
l Sometimes with theory that the online model
converges to the batch model
Model
7
8. Jubatus focuses on latest online algorithms
l Advantage: fast and not memory-intensive
l Low latency & high throughput
l No need for storing large datasets
l Eg. Linear classification algorithms
l Perceptron (1958)
l Passive Aggressive (PA) (2003) Very recent
progress
l Confidence Weighted Learning (CW) (2008)
l AROW (2009)
l Normal HERD (NHERD) (2010)
8
9. Online learning or distributed learning:
No unified solution has been available
l Jubatus combines them into a unified computation framework
Real-time/
Online
Online ML alg.: Jubatus
PA [2003] 2011-
CW[2008]
Large scale
Small scale &
Stand-alone Distributed/
Parallel
WEKA Mahout computing
1993- 2006-
SPSS
1988-
Batch
9
10. What Jubatus currently supports
l Classification (multi-class)
l Perceptron / PA / CW / AROW
l Regression
l PA-based regression
l Nearest neighbor
l LSH / MinHash / Euclid LSH
l Recommendation
l Based on nearest neighbor
l Anomaly detection*
l LOF based on nearest neighbor
l Graph analysis*
l Shortest path / Centrality (PageRank)
l Some simple statistics
10
11. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
11
12. Hadoop and Mahout: Not good for online learning
l Hadoop
l Advantage
l Many extensions for a variety of applications
l Good for distributed data storing and aggregation
l Disadvantage
l No direct support for machine learning and online processing
l Mahout
l Advantage
l Popular machine learning algorithms are implemented
l Disadvantage
l Some implementation are less mature
l Still not capable of online machine learning
12
13. Jubatus vs. Hadoop, RDB-based, and Storm:
Advantage in online AND distributed ML
l Only Jubatus satisfies both of them at the same time
Jubatus Hadoop RDB Storm
Storing ✓ ✓✓ ✓
✓
Big Data External DB HDFS Ext. DB
Batch ✓ ✓✓
✓ ✕
learning Mahout SPSS, etc
Stream
✓ ✕ ✕ ✓✓
processing
Distributed ✓
✓✓ ✕ ✕
learning Mahout
High Online
importance
✓✓ ✕ ✕ ✕
learning
13
14. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
14
15. How to make online algorithms distributed?
=> No trivial!
Batch learning
Online learning
Learn Learn
Easy to
the update parallelize Model update
Learn
Model update Model update
Hard to Learn
Learn
parallelize Model update
the update
due to
Learn
frequent updates
Time
Model update Model update
l Online learning requires frequent model updates
l Naïve distributed architecture leads to too many
synchronization operations
l It causes performance problems in terms of network
communications and accuracy
15
16. Solution: Loose model sharing
l Jubatus only shares the local models in a loose manner
l Model size << Data size
l Jubatus DOES NOT share datasets
l Unique approach compared to existing framework
l Local models can be different on the servers
l Different models will be gradually merged
Model Model Model
Mixed Mixed Mixed
model model model
17. Three fundamental operations on Jubatus:
UPDATE, ANALYZE, and MIX
1. UPDATE
l Receive a sample, learn and update the local model
2. ANALYZE
l Receive a sample, apply the local model, return result
3. MIX (called automatically in backend)
l Exchange and merge the local models between servers
l C.f. Map-Shuffle-Reduce operations on Hadoop
l Algorithms can be implemented independently from
l Distribution logic
l Data sharing
l Failover
17
18. UPDATE
l Each server starts from an initial model
l Each data sample are sent to one (or two) servers
l Local models updated based on the sample
l Data samples are NEVER shared
Distributed
randomly
Local
or consistently
Initial
model
model
1
Local
model Initial
model
2
18
19. MIX
l Each server sends its model diff
l Model diffs are merged and distributed
l Only model diffs are transmitted
Local Model Model
Initial Merged Initial Mixed
model -
model =
diff diff
diff +
model =
model
1 1 1 Merged
+
=
diff
Local Model Model
Initial Merged Initial Mixed
model -
2
model =
diff diff
diff +
model =
model
2 2
19
20. UPDATE (iteration)
l Locally updated models after MIX are discarded
l Each server starts updating from the mixed model
l The mixed model improves gradually thanks to all of the servers
Distributed
randomly
Local
or consistently
Mixed
model
model
1
Local
model Mixed
model
2
20
21. ANALYZE
l For prediction, each sample randomly goes to a server
l Server applies the current mixed model to the sample
l The prediction will be returned to the client
Distributed
randomly
Mixed
model
Return prediction
Mixed
model
Return prediction
21
22. Why Jubatus can work in real-time?
l Focus on online machine learning
l Make online machine learning algorithms distributed
l Update locally
l Online training without communication with others
l Mix only models globally
l Small communication cost, low latency, good performance
l Advantage compared to costly Shuffle in MapReduce
l Analyze locally
l Each server has mixed model
l Low latency for making predictions
l Everything in-memory
l Process data on-the-fly
22
23. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
23
24. Demo: Twitter analysis using natural language
processing and machine learning
Jubatus classifies each tweet from Twitter data stream into pre-defined
categories. Only one Jubatus server is enough to classify over 5,000 QPS,
which is close to the raw Twitter data. We provide a browser-based GUI.
24
25. Experiment: Estimation of power consumption
Jubatus learns the power usage and network data flow pattern of
certain servers. The power consumption of individual servers can be
estimated in real-time by monitoring and analyzing packets without
having to install power measurement modules on all servers.
Predicted value (W)
Data Center /
Office Estimation
Power
No power meter meter
Actual value (W)
TAP
(Packet data)
Consumption differs for
different types of packets
26. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
26
27. Summary
l Jubatus is the first OSS platform for online
distributed machine learning on Big Data streams.
l Download it from http://github.com/jubatus/
l We welcome your contribution and collaboration
1. Bigger data
2. More in real-time
3. Deep analysis
No storage
No data sharing
Only mix model