SlideShare une entreprise Scribd logo
1  sur  35
KNITTING BOAR
    Machine Learning, Mahout, and Parallel Iterative Algorithms




    Josh Patterson
    Principal Solutions Architect




1
✛ Josh Patterson
   > Master’s Thesis: self-organizing mesh networks
       ∗   Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
   > Conceived, built, and led Hadoop integration for openPDC project
      at Tennessee Valley Authority (TVA)
   > Twitter: @jpatanooga

   > Email:    josh@cloudera.com
✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts
Introduction to
    MACHINE LEARNING




4
✛ What is Data Mining?
  > “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
  > Raw data essentially useless
      ∗ Data is simply recorded facts
      ∗ Information is the patterns underlying the data

✛ Machine Learning
  > Algorithms for acquiring structural descriptions from
    data “examples”
      ∗ Process of learning “concepts”
✛ Information Retrieval
   > information science, information architecture,
     cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing
  > grounded in machine learning, especially statistical
    machine learning
✛ Statistics
  > Math and stuff
✛ Machine Learning
  > Considered a branch of artificial intelligence
✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization



        “Descriptive Statistics”
✛ Don’t always assume you need “scale” and
  parallelization
  >   Try it out on a single machine first
  >   See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy
  machine?
✛ We can always use the constructed model
  back in MapReduce to score a ton of new
  data
✛   http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG
    MOD2012.pdf
    >   Looks to study data with descriptive statistics in the hopes of building models for
        predictive analytics

✛   Does majority of ML work via Pig custom integrations
    >   Pipeline is very “Pig-centric”
    >   Example: https://github.com/tdunning/pig-vector
    >   They use SGD and Ensemble methods mostly being conducive
        to large scale data mining
✛   Questions they try to answer
    >   Is this tweet spam?
    >   What star rating might this user give this movie?
✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
  or Pig
✛ ML work performed with
  >   SAS
  >   SPSS
  >   R
  >   Mahout
Introduction to
11
     MAHOUT
✛ Classification
   > “Fraud detection”
 ✛ Recommendation
   > “Collaborative
     Filtering”
 ✛ Clustering
   > “Segmentation”
 ✛ Frequent Itemset
     Mining


12                       Copyright 2010 Cloudera Inc. All rights reserved
✛ Stochastic Gradient Descent
   > Single process
   > Logistic Regression Model Construction
 ✛ Naïve Bayes
   > MapReduce-based
   > Text Classification
 ✛ Random Forests
   > MapReduce-based




13                    Copyright 2010 Cloudera Inc. All rights reserved
✛ An algorithm that looks at a user’s past actions
  and suggests
   > Products
   > Services
   > People
✛ Advertisement
  > Cloudera has a great Data Science training course on
    this topic
  > http://university.cloudera.com/training/data_science/in
    troduction_to_data_science_-
    _building_recommender_systems.html
✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation
✛   Why Machine Learning?
    >   Growing interest in predictive modeling

✛   Linear Models are Simple, Useful
    >   Stochastic Gradient Descent is a very popular tool for
        building linear models like Logistic Regression

✛   Building Models Still is Time Consuming
    >   The “Need for speed”
    >   “More data beats a cleverer algorithm”
Introducing
KNITTING BOAR




 17
✛ Parallelize Mahout’s Stochastic Gradient Descent
  >   With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
   using YARN
  >   Wanted a first class Hadoop-Yarn citizen
  >   Work through dev progressions towards a stable state
  >   Worry about “frameworks” later
✛ Training                        Training Data

    > Simple gradient descent
      procedure
    > Loss functions needs to be
      convex
 ✛ Prediction                         SGD

   > Logistic Regression:
       ∗ Sigmoid function using
         parameter vector (dot)
         example as exponential
                                     Model
         parameter


19
Current Limitations
 ✛ Sequential algorithms on a single node only
   goes so far
 ✛ The “Data Deluge”
     > Presents algorithmic challenges when combined with
       large data sets
     > need to design algorithms that are able to perform in
       a distributed fashion
 ✛ MapReduce only fits certain types of algorithms




20
Distributed Learning Strategies
 ✛ Langford, 2007
    > Vowpal Wabbit
 ✛ McDonald 2010
   > Distributed Training Strategies for the Structured
     Perceptron
 ✛ Dekel 2010
   > Optimal Distributed Online Prediction Using Mini-
     Batches




21
Input             Processor    Processor    Processor



                                         Superstep 1
     Map      Map      Map

                             Processor    Processor    Processor



     Reduce         Reduce               Superstep 2

                                             . . .
           Output


22
“Are the gains gotten from using X worth the integration costs incurred in
     building the end-to-end solution?

     If no, then operationally, we can consider the Hadoop stack …

     there are substantial costs in knitting together a patchwork of different
     frameworks, programming models, etc.”

     –– Lin, 2012




23
✛ Parallel Iterative implementation of SGD on
     YARN

 ✛ Workers work on partitions of the data
 ✛ Master keeps global copy of merged parameter
     vector




24
✛ Each given a split of the total dataset
   > Similar to a map task
 ✛ Using a modified OLR
   > process N samples in a batch (subset of split)
 ✛ Batched gradient accumulation updates sent to
     master node
     > Gradient influences future models vectors towards
       better predictions




25
✛ Accumulates gradient updates
   > From batches of worker OLR runs
 ✛ Produces new global parameter vector
   > By averaging workers’ vectors
 ✛ Sends update to all workers
   > Workers replace local parameter vector with new
     global parameter vector




26
OnlineLogisticRegression
                                              Knitting Boar’s POLR
                                    Split 1             Split 2             Split 3
           Training Data




                                 Worker 1             Worker 2
                                                                     …   Worker N




                                Partial Model        Partial Model       Partial Model
     OnlineLogisticRegression


                                                     Master



             Model
                                                    Global Model

27
300


     250


     200


     150                                                                     OLR
                                                                             POLR
     100


      50


       0
           4.1   8.2   12.3   16.4   20.5   24.6   28.7   32.8   36.9   41




                 Input Size vs Processing Time


28
Knitting Boar
     PARTING THOUGHTS




29
✛ Parallel SGD
   > The Boar is temperamental, experimental
       ∗ Linear speedup (roughly)

 ✛ Developing YARN Applications
   > More complex the just MapReduce
   > Requires lots of “plumbing”
 ✛ IterativeReduce
    > Great native-Hadoop way to implement algorithms
    > Easy to use and well integrated



30
✛ Knitting Boar
   > https://github.com/jpatanooga/KnittingBoar
   > 100% Java
   > ASF 2.0 Licensed
   > Quick Start
       ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

 ✛ IterativeReduce
    > https://github.com/emsixteeen/IterativeReduce
    > 100% Java
    > ASF 2.0 Licensed


31
✛ Machine Learning is hard
       > Don’t believe the hype
       > Do the work
     ✛ Model development takes
       time
       > Lots of iterations
       > Speed is key here


        Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg



32
✛ Strata / Hadoop World 2012 Slides
   > http://www.cloudera.com/content/cloudera/en/resourc
     es/library/hadoopworld/strata-hadoop-world-2012-
     knitting-boar_slide_deck.html
 ✛ Mahout’s SGD implementation
   > http://lingpipe.files.wordpress.com/2008/04/lazysgdre
     gression.pdf
 ✛ MapReduce is Good Enough? If All You Have is
     a Hammer, Throw Away Everything That’s Not a
     Nail!
     > http://arxiv.org/pdf/1209.2191v1.pdf


33
✛ Langford
    > http://hunch.net/~vw/
 ✛ McDonald, 2010
   > http://dl.acm.org/citation.cfm?id=1858068




34
✛ http://eteamjournal.files.wordpress.com/2011/03/
   photos-of-mount-everest-pictures.jpg
 ✛ http://images.fineartamerica.com/images-
   medium-large/-say-hello-to-my-little-friend--luis-
   ludzska.jpg
 ✛ http://freewallpaper.in/wallpaper2/2202-2-
   2001_space_odyssey_-_5.jpg




35

Contenu connexe

Tendances

Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Optimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative EnvironmentsOptimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative EnvironmentsBehrouz Derakhshan
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkDalei Li
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr..."Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...Edge AI and Vision Alliance
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!DataWorks Summit
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce ApplicationDr. C.V. Suresh Babu
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 

Tendances (20)

Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Optimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative EnvironmentsOptimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative Environments
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on Spark
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr..."Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 

En vedette

En vedette (6)

Anuncio de la empresa sensient prorroga
Anuncio de la empresa sensient prorrogaAnuncio de la empresa sensient prorroga
Anuncio de la empresa sensient prorroga
 
Tarea # 3: Foro de Discusión # 2
Tarea # 3: Foro de Discusión # 2Tarea # 3: Foro de Discusión # 2
Tarea # 3: Foro de Discusión # 2
 
Our holidays
Our holidaysOur holidays
Our holidays
 
Etude Marketing
Etude MarketingEtude Marketing
Etude Marketing
 
Sarah scazzi
Sarah scazziSarah scazzi
Sarah scazzi
 
Washington
WashingtonWashington
Washington
 

Similaire à MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningMakoto Yui
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceIRJET Journal
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworksReem Abdel-Rahman
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 

Similaire à MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS (20)

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 

Plus de Adam Muise

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadamAdam Muise
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop IntroductionAdam Muise
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopAdam Muise
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_securityAdam Muise
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mdaAdam Muise
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - HadoopAdam Muise
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACAdam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013Adam Muise
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
 

Plus de Adam Muise (20)

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 

Dernier

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS

  • 1. KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms Josh Patterson Principal Solutions Architect 1
  • 2. ✛ Josh Patterson > Master’s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) > Twitter: @jpatanooga > Email: josh@cloudera.com
  • 3. ✛ Introduction to Machine Learning ✛ Mahout ✛ Knitting Boar and YARN ✛ Parting Thoughts
  • 4. Introduction to MACHINE LEARNING 4
  • 5. ✛ What is Data Mining? > “the process of extracting patterns from data” ✛ Why are we interested in Data Mining? > Raw data essentially useless ∗ Data is simply recorded facts ∗ Information is the patterns underlying the data ✛ Machine Learning > Algorithms for acquiring structural descriptions from data “examples” ∗ Process of learning “concepts”
  • 6. ✛ Information Retrieval > information science, information architecture, cognitive psychology, linguistics, and statistics. ✛ Natural Language Processing > grounded in machine learning, especially statistical machine learning ✛ Statistics > Math and stuff ✛ Machine Learning > Considered a branch of artificial intelligence
  • 7. ✛ ETL ✛ Joining multiple disparate data sources ✛ Filtering data ✛ Aggregation ✛ Cube materialization “Descriptive Statistics”
  • 8. ✛ Don’t always assume you need “scale” and parallelization > Try it out on a single machine first > See if it becomes a bottleneck! ✛ Will the data fit in memory on a beefy machine? ✛ We can always use the constructed model back in MapReduce to score a ton of new data
  • 9. http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG MOD2012.pdf > Looks to study data with descriptive statistics in the hopes of building models for predictive analytics ✛ Does majority of ML work via Pig custom integrations > Pipeline is very “Pig-centric” > Example: https://github.com/tdunning/pig-vector > They use SGD and Ensemble methods mostly being conducive to large scale data mining ✛ Questions they try to answer > Is this tweet spam? > What star rating might this user give this movie?
  • 10. ✛ Data collection performed w Flume ✛ Data cleansing / ETL performed with Hive or Pig ✛ ML work performed with > SAS > SPSS > R > Mahout
  • 12. ✛ Classification > “Fraud detection” ✛ Recommendation > “Collaborative Filtering” ✛ Clustering > “Segmentation” ✛ Frequent Itemset Mining 12 Copyright 2010 Cloudera Inc. All rights reserved
  • 13. ✛ Stochastic Gradient Descent > Single process > Logistic Regression Model Construction ✛ Naïve Bayes > MapReduce-based > Text Classification ✛ Random Forests > MapReduce-based 13 Copyright 2010 Cloudera Inc. All rights reserved
  • 14. ✛ An algorithm that looks at a user’s past actions and suggests > Products > Services > People ✛ Advertisement > Cloudera has a great Data Science training course on this topic > http://university.cloudera.com/training/data_science/in troduction_to_data_science_- _building_recommender_systems.html
  • 15. ✛ Cluster words across docs to identify topics ✛ Latent Dirichlet Allocation
  • 16. Why Machine Learning? > Growing interest in predictive modeling ✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression ✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  • 18. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible ✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  • 19. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter 19
  • 20. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms 20
  • 21. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches 21
  • 22. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output 22
  • 23. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 2012 23
  • 24. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector 24
  • 25. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions 25
  • 26. ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers’ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector 26
  • 27. OnlineLogisticRegression Knitting Boar’s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model 27
  • 28. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time 28
  • 29. Knitting Boar PARTING THOUGHTS 29
  • 30. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated 30
  • 31. ✛ Knitting Boar > https://github.com/jpatanooga/KnittingBoar > 100% Java > ASF 2.0 Licensed > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > https://github.com/emsixteeen/IterativeReduce > 100% Java > ASF 2.0 Licensed 31
  • 32. ✛ Machine Learning is hard > Don’t believe the hype > Do the work ✛ Model development takes time > Lots of iterations > Speed is key here Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg 32
  • 33. ✛ Strata / Hadoop World 2012 Slides > http://www.cloudera.com/content/cloudera/en/resourc es/library/hadoopworld/strata-hadoop-world-2012- knitting-boar_slide_deck.html ✛ Mahout’s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf 33
  • 34. ✛ Langford > http://hunch.net/~vw/ ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 34
  • 35. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://freewallpaper.in/wallpaper2/2202-2- 2001_space_odyssey_-_5.jpg 35

Notes de l'éditeur

  1. Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  2. Yeah? Ok let’s look at doing ETL in HadoopAnd then running the model construction phase in another tool like RNo?We need to think of a way to either Refactor the algorithm into MapReducePartition the data such that a reducer can work on each subset
  3. Frequent itemset mining – what appears together
  4. “What do other people w/ similar tastes like?”“strength of associations”
  5. “say hello to my leeeeetle friend….”
  6. Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  7. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  8. At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  9. Bottou similar to Xu2010 in the 2010 paper
  10. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  11. Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  12. POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  13. 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  14. Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  15. Basecamp: use story of how we get to basecamp to see how to climb some more