SlideShare une entreprise Scribd logo
Distributed Online Machine Learning
        Framework for Big Data




                 Shohei Hido
     Preferred Infrastructure, Inc. Japan.
        XLDB Asia, June 22nd, 2012
Preferred Infrastructure (PFI): to bring
cutting-edge research advances to products
l    Founded:           March, 2006, located in Tokyo, Japan
l    Employees:         28
      l    Top university graduates including ICPC world finalists
      l    Mid-career engineers from Sony, IBM, Yahoo!, Sun


      Information retrieval                  Distributed computing



            Natural language
                                                 Machine learning
              processing

                                        2
3
Overview:
Big Data analytics will go real-time and deeper

        1. Bigger data

     2. More in real-time

      3. Deep analysis

                                No storage
                                No data sharing
                                Only mix model
Jubatus: OSS platform for Big Data analytics




l    Joint development with NTT laboratory in Japan
      l    Project started April 2011
l    Released as an open source software
      l    Just released 0.3.0
l    You can download it from
l    http://github.com/jubatus/
l    Waiting for your contribution and collaboration

                                         5
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                    6
Increasing demand in Big Data applications:
    Real-time deeper analysis
    l  Current focus: aggregation and rule processing on bigger data
         l  CEP (Complex Event Processing) for real-time processing

         l  Hadoop/MapReduce for distributed computation

    l  Future: deeper analysis for rapid decisions and actions
         l  Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012]

         l  Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]


Data size	

                                                               What will
                                        Hadoop                  come?
                  CEP
                                                                        Deep
    Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf

                                             7	
                        analysis	
        
    
http://www.computerworlduk.com/news/networking/3302464/
Key technology: Machine learning

l    Examples need rapid decisions under uncertainty
      l    Anomaly detection from M2M sensor data
      l    Energy demand forecast / Smart grid optimization
      l    Security monitoring on raw Internet traffic
l    What is missing for fast & deep analytics on Big Data?
      l    Online/real-time machine learning platform
      l    + Scale-out distributed machine learning platform



            1. Bigger data

      2. More in real-time

       3. Deep analysis
Online machine learning in Jubatus
l    Batch learning
       l  Scan all data before building a model
       l  Data must be stored in memory or storage


                                          Model


l    Online learning
       l  Model will be updated by each data sample
       l  Sometimes with theory that the online model
           converges to the batch model


                                              Model


                                9
Jubatus focuses on latest online algorithms

l    Advantage: fast and not memory-intensive
       l  Low latency & high throughput
       l  No need for storing large datasets


l    Eg. Linear classification algorithms
      l    Perceptron (1958)
      l    Passive Aggressive (PA) (2003)             Very recent
                                                        progress
      l    Confidence Weighted Learning (CW) (2008)
      l    AROW (2009)
      l    Normal HERD (NHERD) (2010)




                                    10
Online learning or distributed learning:
   No unified solution has been available
   l    Jubatus combines them into a unified computation framework
                                  Real-time/
                                    Online
                Online ML alg.:                Jubatus
                  PA [2003]                    2011-
                  CW[2008]

                                                                  Large scale
Small scale                                                             &
Stand-alone                                                       Distributed/
                                                                    Parallel
                WEKA                                     Mahout    computing
                   1993-                                  2006-
                SPSS
                   1988-
                                    Batch
                                      11
What Jubatus currently supports

l    Classification (multi-class)
       l  Perceptron / PA / CW / AROW

l    Regression
       l  PA-based regression

l    Nearest neighbor
       l  LSH / MinHash / Euclid LSH

l    Recommendation
       l  Based on nearest neighbor

l    Anomaly detection*
       l  LOF based on nearest neighbor

l  Graph analysis*
     l  Shortest path / Centrality (PageRank)

l  Some simple statistics
                                    12
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   13
Hadoop and Mahout: Not good for online learning

l    Hadoop
       l  Advantage

              l    Many extensions for a variety of applications
              l    Good for distributed data storing and aggregation
       l    Disadvantage
              l    No direct support for machine learning and online processing
l    Mahout
       l  Advantage

              l    Popular machine learning algorithms are implemented
       l    Disadvantage
              l    Some implementation are less mature
              l    Still not capable of online machine learning

                                              14
Jubatus vs. Hadoop, RDB-based, and Storm:
    Advantage in online AND distributed ML
    l    Only Jubatus satisfies both of them at the same time

                            Jubatus       Hadoop           RDB        Storm
                Storing          ✓               ✓✓                     ✓
                                                             ✓
                Big Data    External DB          HDFS                 Ext. DB
                 Batch                             ✓        ✓✓
                                ✓                                       ✕
                learning                         Mahout   SPSS, etc
                 Stream
                                ✓                  ✕         ✕         ✓✓
               processing
             Distributed                           ✓
                               ✓✓                            ✕          ✕
              learning                           Mahout
   High
         Online
importance	
                   ✓✓                  ✕         ✕          ✕
                learning
                                          15
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   16
How to make online algorithms distributed?
=> No trivial!
            Batch learning	
                      Online learning	

                Learn                                  Learn
                                    Easy to
              the update           parallelize     Model update
                                                       Learn
             Model update                          Model update
                                    Hard to            Learn
                Learn
                                   parallelize     Model update
              the update
                                     due to
                                                       Learn
                               frequent updates
  Time	
     Model update                          Model update


l    Online learning requires frequent model updates
l    Naïve distributed architecture leads to too many
      synchronization operations
l    It causes performance problems in terms of network
      communications and accuracy
                               17
Solution: Loose model sharing

l  Jubatus only shares the local models in a loose manner
     l  Model size << Data size

l  Jubatus DOES NOT share datasets
     l  Unique approach compared to existing framework

l  Local models can be different on the servers
     l  Different models will be gradually merged




                  Model      Model       Model




                  Mixed      Mixed       Mixed
                  model      model       model
Three fundamental operations on Jubatus:
UPDATE, ANALYZE, and MIX
1.    UPDATE
      l  Receive a sample, learn and update the local model

2.    ANALYZE
      l  Receive a sample, apply the local model, return result

3.    MIX (called automatically in backend)
      l  Exchange and merge the local models between servers



l    C.f. Map-Shuffle-Reduce operations on Hadoop
l    Algorithms can be implemented independently from
      l    Distribution logic
      l    Data sharing
      l    Failover

                                  19
UPDATE

   l  Each server starts from an initial model
   l  Each data sample are sent to one (or two) servers
   l  Local models updated based on the sample
   l  Data samples are NEVER shared




Distributed

randomly
                                            Local
or consistently 	
                                           Initial
                                                     model
                                                             model
                                                       1

                                                     Local
                                                     model   Initial
                                                             model
                                                       2
                                    20
MIX

l  Each server sends its model diff
l  Model diffs are merged and distributed
l  Only model diffs are transmitted




            Local     Model    Model
Initial                                         Merged Initial     Mixed
model     -	
            model   =	
 diff    diff
                                                  diff +	
                                                         model   =	
                                                                   model
              1          1       1    Merged
                                 +	
 =	
 diff
        Local         Model    Model
Initial                                         Merged Initial     Mixed
model -	
 2
        model       =	
 diff    diff
                                                  diff +	
                                                        model    =	
                                                                   model
                         2       2


                                       21
UPDATE (iteration)

   l  Locally updated models after MIX are discarded
   l  Each server starts updating from the mixed model
   l  The mixed model improves gradually thanks to all of the servers




Distributed

randomly
                                            Local
or consistently 	
                                             Mixed
                                                     model
                                                               model
                                                       1

                                                     Local
                                                     model     Mixed
                                                               model
                                                       2
                                   22
ANALYZE

   l  For prediction, each sample randomly goes to a server
   l  Server applies the current mixed model to the sample
   l  The prediction will be returned to the client




Distributed

randomly
                                                      Mixed
                                                               model

                                Return prediction
                                                               Mixed
                                                               model
                                Return prediction
                                   23
Why Jubatus can work in real-time?

l  Focus on online machine learning
     l  Make online machine learning algorithms distributed

l  Update locally
     l  Online training without communication with others

l  Mix only models globally
     l  Small communication cost, low latency, good performance

     l  Advantage compared to costly Shuffle in MapReduce

l  Analyze locally
     l  Each server has mixed model

     l  Low latency for making predictions

l    Everything in-memory
       l  Process data on-the-fly


                                     24
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   25
Demo: Twitter analysis using natural language
processing and machine learning
Jubatus classifies each tweet from Twitter data stream into pre-defined
categories. Only one Jubatus server is enough to classify over 5,000 QPS,
which is close to the raw Twitter data. We provide a browser-based GUI.




                                   26
Experiment: Estimation of power consumption
Jubatus learns the power usage and network data flow pattern of
certain servers. The power consumption of individual servers can be
estimated in real-time by monitoring and analyzing packets without
having to install power measurement modules on all servers.




                                      Predicted value (W)
  Data Center /
     Office     Estimation

                    Power
No power meter      meter

                                                            Actual value (W)
                         TAP
                         (Packet data)
Consumption differs for
different types of packets
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   28
Summary

l    Jubatus is the first OSS platform for online
      distributed machine learning on Big Data streams.
l    Download it from http://github.com/jubatus/
l    We welcome your contribution and collaboration
               1. Bigger data

            2. More in real-time

              3. Deep analysis
                                      No storage
                                      No data sharing
                                      Only mix model

Contenu connexe

Similaire à Jubatus Invited Talk at XLDB Asia

Jubatus talk at HadoopSummit 2013
Jubatus talk at HadoopSummit 2013Jubatus talk at HadoopSummit 2013
Jubatus talk at HadoopSummit 2013
Preferred Networks
 
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Data Science Milan
 
Hadoop
HadoopHadoop
Hadoop
Aarti Bedre
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
Muthusamy Manigandan
 
BIG DATA
BIG DATABIG DATA
BIG DATA
Shashank Shetty
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
christian.perez
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
TIB Academy
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET Journal
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
Ahmad El Tawil
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
EMC
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
Hadoop.mapreduce
Hadoop.mapreduceHadoop.mapreduce
Hadoop.mapreduce
Michael Hepburn
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
Evert Lammerts
 

Similaire à Jubatus Invited Talk at XLDB Asia (20)

Jubatus talk at HadoopSummit 2013
Jubatus talk at HadoopSummit 2013Jubatus talk at HadoopSummit 2013
Jubatus talk at HadoopSummit 2013
 
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Big Data
Big DataBig Data
Big Data
 
Hadoop.mapreduce
Hadoop.mapreduceHadoop.mapreduce
Hadoop.mapreduce
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 

Plus de Preferred Networks

PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
Preferred Networks
 
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Preferred Networks
 
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
Preferred Networks
 
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
Preferred Networks
 
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Preferred Networks
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Preferred Networks
 
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
Preferred Networks
 
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
Preferred Networks
 
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
Preferred Networks
 
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Preferred Networks
 
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
Preferred Networks
 
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
Preferred Networks
 
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るKubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Preferred Networks
 
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Preferred Networks
 
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
Preferred Networks
 
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
Preferred Networks
 
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Preferred Networks
 
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
Preferred Networks
 
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
Preferred Networks
 
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
Preferred Networks
 

Plus de Preferred Networks (20)

PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
 
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
 
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
 
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
 
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
 
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
 
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
 
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
 
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
 
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
 
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
 
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るKubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
 
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
 
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
 
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
 
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
 
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
 
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
 
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
 

Dernier

Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 

Dernier (20)

Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 

Jubatus Invited Talk at XLDB Asia

  • 1. Distributed Online Machine Learning Framework for Big Data Shohei Hido Preferred Infrastructure, Inc. Japan. XLDB Asia, June 22nd, 2012
  • 2. Preferred Infrastructure (PFI): to bring cutting-edge research advances to products l  Founded: March, 2006, located in Tokyo, Japan l  Employees: 28 l  Top university graduates including ICPC world finalists l  Mid-career engineers from Sony, IBM, Yahoo!, Sun Information retrieval Distributed computing Natural language Machine learning processing 2
  • 3. 3
  • 4. Overview: Big Data analytics will go real-time and deeper 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model
  • 5. Jubatus: OSS platform for Big Data analytics l  Joint development with NTT laboratory in Japan l  Project started April 2011 l  Released as an open source software l  Just released 0.3.0 l  You can download it from l  http://github.com/jubatus/ l  Waiting for your contribution and collaboration 5
  • 6. Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 6
  • 7. Increasing demand in Big Data applications: Real-time deeper analysis l  Current focus: aggregation and rule processing on bigger data l  CEP (Complex Event Processing) for real-time processing l  Hadoop/MapReduce for distributed computation l  Future: deeper analysis for rapid decisions and actions l  Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012] l  Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011] Data size What will Hadoop come? CEP Deep Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf
 7 analysis http://www.computerworlduk.com/news/networking/3302464/
  • 8. Key technology: Machine learning l  Examples need rapid decisions under uncertainty l  Anomaly detection from M2M sensor data l  Energy demand forecast / Smart grid optimization l  Security monitoring on raw Internet traffic l  What is missing for fast & deep analytics on Big Data? l  Online/real-time machine learning platform l  + Scale-out distributed machine learning platform 1. Bigger data 2. More in real-time 3. Deep analysis
  • 9. Online machine learning in Jubatus l  Batch learning l  Scan all data before building a model l  Data must be stored in memory or storage Model l  Online learning l  Model will be updated by each data sample l  Sometimes with theory that the online model converges to the batch model Model 9
  • 10. Jubatus focuses on latest online algorithms l  Advantage: fast and not memory-intensive l  Low latency & high throughput l  No need for storing large datasets l  Eg. Linear classification algorithms l  Perceptron (1958) l  Passive Aggressive (PA) (2003) Very recent progress l  Confidence Weighted Learning (CW) (2008) l  AROW (2009) l  Normal HERD (NHERD) (2010) 10
  • 11. Online learning or distributed learning: No unified solution has been available l  Jubatus combines them into a unified computation framework Real-time/ Online Online ML alg.: Jubatus PA [2003] 2011- CW[2008] Large scale Small scale & Stand-alone Distributed/ Parallel WEKA Mahout computing    1993- 2006- SPSS 1988- Batch 11
  • 12. What Jubatus currently supports l  Classification (multi-class) l  Perceptron / PA / CW / AROW l  Regression l  PA-based regression l  Nearest neighbor l  LSH / MinHash / Euclid LSH l  Recommendation l  Based on nearest neighbor l  Anomaly detection* l  LOF based on nearest neighbor l  Graph analysis* l  Shortest path / Centrality (PageRank) l  Some simple statistics 12
  • 13. Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 13
  • 14. Hadoop and Mahout: Not good for online learning l  Hadoop l  Advantage l  Many extensions for a variety of applications l  Good for distributed data storing and aggregation l  Disadvantage l  No direct support for machine learning and online processing l  Mahout l  Advantage l  Popular machine learning algorithms are implemented l  Disadvantage l  Some implementation are less mature l  Still not capable of online machine learning 14
  • 15. Jubatus vs. Hadoop, RDB-based, and Storm: Advantage in online AND distributed ML l  Only Jubatus satisfies both of them at the same time Jubatus Hadoop RDB Storm Storing ✓ ✓✓ ✓ ✓ Big Data External DB HDFS Ext. DB Batch ✓ ✓✓ ✓ ✕ learning Mahout SPSS, etc Stream ✓ ✕ ✕ ✓✓ processing Distributed ✓ ✓✓ ✕ ✕ learning Mahout High
 Online importance ✓✓ ✕ ✕ ✕ learning 15
  • 16. Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 16
  • 17. How to make online algorithms distributed? => No trivial! Batch learning Online learning Learn Learn Easy to the update parallelize Model update Learn Model update Model update Hard to Learn Learn parallelize Model update the update due to Learn frequent updates Time Model update Model update l  Online learning requires frequent model updates l  Naïve distributed architecture leads to too many synchronization operations l  It causes performance problems in terms of network communications and accuracy 17
  • 18. Solution: Loose model sharing l  Jubatus only shares the local models in a loose manner l  Model size << Data size l  Jubatus DOES NOT share datasets l  Unique approach compared to existing framework l  Local models can be different on the servers l  Different models will be gradually merged Model Model Model Mixed Mixed Mixed model model model
  • 19. Three fundamental operations on Jubatus: UPDATE, ANALYZE, and MIX 1.  UPDATE l  Receive a sample, learn and update the local model 2.  ANALYZE l  Receive a sample, apply the local model, return result 3.  MIX (called automatically in backend) l  Exchange and merge the local models between servers l  C.f. Map-Shuffle-Reduce operations on Hadoop l  Algorithms can be implemented independently from l  Distribution logic l  Data sharing l  Failover 19
  • 20. UPDATE l  Each server starts from an initial model l  Each data sample are sent to one (or two) servers l  Local models updated based on the sample l  Data samples are NEVER shared Distributed
 randomly Local or consistently Initial model model 1 Local model Initial model 2 20
  • 21. MIX l  Each server sends its model diff l  Model diffs are merged and distributed l  Only model diffs are transmitted Local Model Model Initial Merged Initial Mixed model - model = diff diff diff + model = model 1 1 1 Merged + = diff Local Model Model Initial Merged Initial Mixed model - 2 model = diff diff diff + model = model 2 2 21
  • 22. UPDATE (iteration) l  Locally updated models after MIX are discarded l  Each server starts updating from the mixed model l  The mixed model improves gradually thanks to all of the servers Distributed
 randomly Local or consistently Mixed model model 1 Local model Mixed model 2 22
  • 23. ANALYZE l  For prediction, each sample randomly goes to a server l  Server applies the current mixed model to the sample l  The prediction will be returned to the client Distributed
 randomly Mixed model Return prediction Mixed model Return prediction 23
  • 24. Why Jubatus can work in real-time? l  Focus on online machine learning l  Make online machine learning algorithms distributed l  Update locally l  Online training without communication with others l  Mix only models globally l  Small communication cost, low latency, good performance l  Advantage compared to costly Shuffle in MapReduce l  Analyze locally l  Each server has mixed model l  Low latency for making predictions l  Everything in-memory l  Process data on-the-fly 24
  • 25. Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 25
  • 26. Demo: Twitter analysis using natural language processing and machine learning Jubatus classifies each tweet from Twitter data stream into pre-defined categories. Only one Jubatus server is enough to classify over 5,000 QPS, which is close to the raw Twitter data. We provide a browser-based GUI. 26
  • 27. Experiment: Estimation of power consumption Jubatus learns the power usage and network data flow pattern of certain servers. The power consumption of individual servers can be estimated in real-time by monitoring and analyzing packets without having to install power measurement modules on all servers. Predicted value (W) Data Center / Office Estimation Power No power meter meter Actual value (W) TAP (Packet data) Consumption differs for different types of packets
  • 28. Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 28
  • 29. Summary l  Jubatus is the first OSS platform for online distributed machine learning on Big Data streams. l  Download it from http://github.com/jubatus/ l  We welcome your contribution and collaboration 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model