SlideShare une entreprise Scribd logo
1  sur  140
Télécharger pour lire hors ligne
Session 2: Modeling with
        Hadoop
    Algorithms in MapReduce

       Vijay K Narayanan
  Principal Scientist, Yahoo! Labs, Yahoo!
Outline
•   Why learn models in MapReduce framework?
•   Types of learning in MapReduce
•   Statistical Query Model (SQM)
•   SQM Algorithms in MapReduce
•   Sequential learning methods and MapReduce
•   Challenges and Enhancements
•   Apache Mahout
Why learn models in MapReduce?
• High data throughput
  – Stream about 100 Tb per hour using 500 mappers
• Framework provides fault tolerance
  – Monitors mappers and reducers and re-starts tasks on
    other machines should one of the machines fail
• Excels in counting patterns over data records
• Built on relatively cheap, commodity hardware
  – No special purpose computing hardware
• Large volumes of data are being increasingly
  stored on Grid clusters running MapReduce
  – Especially in the internet domain
Why learn models in MapReduce?
• Learning can become limited by computation
  time and not data volume
  – With large enough data and number of machines
  – Reduces the need to down-sample data
  – More accurate parameter estimates compared to
    learning on a single machine for the same amount of
    time
Learning models in MapReduce
• A primer for learning models in MapReduce (MR)
   – Illustrate techniques for distributing the learning algorithm
     in a MapReduce framework
   – Focus on the mapper and reducer computations
• Data parallel algorithms are most appropriate for
  MapReduce implementations
• Not necessarily the most optimal implementation for a
  specific algorithm
   – Other specialized non-MapReduce implementations exist
     for some algorithms, which may be better
• MR may not be the appropriate framework for exact
  solutions of non data parallel/sequential algorithms
   – Approximate solutions using MapReduce may be good
     enough
Outline
•   Why learn models in MapReduce framework?
•   Types of learning in MapReduce
•   Statistical Query Model (SQM)
•   SQM Algorithms in MapReduce
•   Sequential learning methods and MapReduce
•   Challenges and Enhancements
•   Apache Mahout
Types of learning in MapReduce
•    Three common types of learning models using
     MapReduce framework
    1. Parallel training of multiple models         Use the Grid as a
       –   Train either in mappers or reducers        large cluster
                                                     of independent
    2. Ensemble training methods                        machines
       –   Train multiple models and combine them       (with fault
                                                        tolerance)
    3. Distributed learning algorithms
       –   Learn using both mappers and reducers
Parallel training of multiple models

• Train multiple models simultaneously using a
  learning algorithm that can be learnt in memory
• Useful when individual models are trained using
  a subset, filtered or modification of raw data
• Can train 1000’s of models simultaneously
• Train 1 model in each reducer
  – Map:
     • Input: All data
     • Filters subset of data relevant for each model training
     • Output: <model_index, subset of data for training this model>
  – Reduce
     • Train model on data corresponding to that model_index
Parallel training of multiple models
• Train 1 model in each reducer

 Data subgroup 1                  "model_1", Data model_1
                   
                                                 Train       Model_1
 Data subgroup 2
                   

                                                 Train       Model_2


 Data subgroup N                   "model_2", Data model_2
                   
        Mapper                                           Reducer
Parallel training of multiple models
      • Train 1 model in each mapper

                              Map_1   Model (c1 )   • All data is sent to each
                    c1                                mapper (as a cache
                                                      archive)
                                                    • Mapper partition file
                              Map_2   Model (c2 )     determines the training
Training            c2                                configuration and labeling
 Data                                                 strategy
           {x, (ci , c j ...ck )}                      – e.g., Training one vs. rest
                                                         models in multi-class
           ci {c1 , c2 ...cM }                          classification
                                                       – Can train 1000s of
                                                         classes in parallel
                             Map_M    Model (cM )
                   cM
Ensemble methods
• Train 1 base model in each mapper on a data partition
• Combine the base models using ensemble methods
  (primarily, bagging) in the reducer
• Strictly, bagging requires the data to be sampled with
  replacement
   – However, if the data set is very large, sampling without
     replacement may be ok
• Base models are typically decision trees, SVMs etc.
Ensemble Methods: Random
     Subspace Bagging (RSBag)
• Assume that the training data is partitioned randomly into
  blocks
   – Class distributions are roughly the same across all blocks
• Algorithm (Yan et al. 2007)
   – Learn 1 base model per data sub-group
                      Base-model  hc ( x)
                      yc {1, 1}
   – Optionally, use a random subset of features to train each model
   – Combine the multiple base models into a composite classifier as
     the final output

                      Fci ( x)  Fci 1 ( x)  hci ( x)
RSBag in MapReduce
                        1
               Map_1   hc ( x)

    Features

                        2         Combine
               Map_2   h ( x)
                        c           base
D                                 models
A                                    into
T                                    final
A                      hc3 ( x)   classifier
               Map_3




               Map_4   hc4 ( x)
RSBag in MapReduce
• Provides coarse level parallelism at the level of base models
    – Base models can be decision trees, SVMs etc.
• Speed-up with SVM base models
             Nrd2 rf ,       rd , rf  data, feature sampling ratios
             N  5, rd  0.2, rf  0.5  Speedup  10

• Can achieve similar performance as a single classifier with
  theoretical guarantee in less learning time

                         
     E * ( Fc )   1  sc2       sc2                  Upper bound on
                                                       generalization error
       E ,    x  h( x, )h( x, )  
                                            
                                                        Correlation between
                                                             classifiers
     sc  2 Ex , yc P  h( x,  )  yc   1
                     
                                                          Strength of classifier
Robust Subspace Bagging
             (RB-SBag)
• Sometimes the base models may over-fit the
  training data
  – Correlation between base models may be high
• Add a Forward selection step for models
  – Iteratively add base models based on their
    performance on a validation data (Yan et al. 2009)
• Adds another MapReduce job
  – Select the base models using forward selection based
    on performance metrics on a validation dataset Vc
RB-SBag in MapReduce
                   Map_1   " c ",{hc , Pr ediction c (V )}

Validation
   Data            Map_2



  1
 hc ( x ),
 hc2 ( x ),
 ....
 hcN ( x )                 1. Forward selection of base models
                           2. Combine base models into composite
                   Map_N      classifier

              Mapper                      Reducer
COMET: Cloud of Massive
                   Ensemble Trees
  • Similar to RSBag, but uses Importance-Sampled Voting
    (IVoting) in each base model
  • Samples are weighted with non-uniform probability
  • Each mapper creates a set of data to train on
  • Ensemble after k iterations = E(k)
        – Add new sample to training set:
               • Always if E(k) incorrectly classifies new sample
               • With a lower probability if E(k) correctly classifies new sample
            e(k ) / (1  e(k )); e(k )  error on training dataset
  • Variant of Random Forests, in which IVoting generates
    the training samples instead of bagging
  • Use lazy evaluation during prediction
J.D Basilico, M.A. Munson, T.G. Kolda, K.R. Dixon, W.P.Kegelmeyer, COMET: A Recipe for Learning and Using
Large Ensembles on massive data, 2011, http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.2068v1.pdf
Distributed learning algorithms
• Use multiple mappers and reducers to learn 1 model
• Suitable for learning algorithms that
   – Have heavy computing per data record
   – One or few iterations for learning
   – Do not transfer much data between iterations
• Typical algorithms
   – Fit the Statistical query model (SQM)
      • One/few iterations
          – Linear regression, Naïve Bayes, k-means clustering, pair-wise similarity etc.
      • More iterations have high overheads, e.g.,
          – SVM, Logistic regression etc.
   – Divide and conquer
      • Frequent item-set mining, Approximate matrix factorization etc.
Outline
•   Why learn models in MapReduce framework?
•   Types of learning in MapReduce
•   Statistical Query Model (SQM)
•   SQM Algorithms in MapReduce
•   Sequential learning methods and MapReduce
•   Challenges and Enhancements
•   Apache Mahout
Statistical Query Model (SQM)
• Learning algorithm can access the
  learning problem only through a statistical
  query oracle (Kearns 1998)

• Given a function f(x,y) over data instances,
  the statistical query oracle returns an
  estimate of the expectation of f(x,y)
  (averaged over the data distribution).
Statistical Query Model (SQM)

                                 Raw Data            Raw Data
   Learning                      Statistics
                                 Samples             Samples
   Algorithm                      Oracle
                                   (X,Y)               (X,Y)
                 f ( x, y)


• Learning algorithms that calculate sufficient statistics of data,
  gradients of a function, etc. fit this model

• These calculations can be expressed in a “summation form”
  over subgroups of data (Chu et al. 2006)

                               
                             subgroup
                                        f ( x, y )
SQM in MapReduce
• Distribute the summation calculations over each
  data sub-group
• Map:
  – Calculate function estimates over sub-groups of data
• Reduce
  – Aggregate the function estimates from various sub-
    groups
• Learning algorithm should be able to work with
  these summaries alone
SQM in MapReduce
• Assume algorithm depends on 2 functions f(x,y) and g(x,y)


                                                                     
 Data subgroup 1
                         " f ",
                                   subgroup
                                              f ( x, y )   " g ",
                                                                    subgroup
                                                                               g ( x, y )


 Data subgroup 2
                   

                                                    
                                                    N subgroup
                                                                 f ( x, y),       
                                                                               N subgroup
                                                                                            g ( x, y)

 Data subgroup N
                   
        Mapper                                                       Reducer
Outline
•   Why learn models in MapReduce framework?
•   Types of learning in MapReduce
•   Statistical Query Model (SQM)
•   SQM Algorithms in MapReduce
•   Sequential learning methods and MapReduce
•   Challenges and Enhancements
•   Apache Mahout
Algorithms in MapReduce
• Many common algorithms can be formulated in
  the SQM framework (Chu et al. 2006)
  – Classification and Regression
     • Linear Regression, Naïve Bayes, Logistic regression,
       Support Vector Machine, Decision Trees
  – Clustering
     • K-means, Canopy clustering, Co-clustering
  – Back-propagation neural network
  – Expectation Maximization
  – PCA
• Recommendations and Frequent Itemset mining
• Graph Algorithms
Classification and Regression
       algorithms in MapReduce
•   Linear Regression
•   Naïve Bayes
•   Logistic Regression
•   Support Vector Machine
•   Decision Trees
Linear regression
•   Data vector:                    xi  ( xi1 , xi 2 ,...xin )T
•   Real valued target :            yi
•   Weight of data point:           wi
                                    x, y, w
                                               m
•   Data set of points:

                     y T x
                     *  A1b
                             m
                     A   wi ( xi xiT )
                            i 1
                            m                       Summation form
                     b   wi ( xi yi )
                            i 1
Linear Regression in MapReduce
• Map:
  – Input data     Index,  x, y, w   from a subgroup of data
  – Output
     • 2 types of keys
         – K1 – for matrix A
             » Value1 = N x N matrix
         – K2 – for vector b
             » Value2 = N x 1 vector

• Reducer:
  – Aggregate the individual mapper outputs for each key
  – Estimate   A b
                *      1
Linear Regression in MapReduce
• A: N x N matrix, b: N x 1 vector


   x, y, w     A, b     " A ",     
                                     subgroup
                                                wi xi xiT   " b ",     
                                                                     subgroup
                                                                                wi xi yi
             1




   x, y, w
             2
                   A, b                                      A,  b
                                                             *  A1b




   x, y, w
             k
                   A, b
        Mapper                                                        Reducer
Naïve Bayes
• Input Data:             x  ( x1 , x2 ,...xn );     x j {a1j , a2j ....aPj }
                                                                           j


                                                           Domain of x j
• Categorical target: y  c1 , c2 ...cL 

• Class prediction:     Class prior         Conditional probability table (CPT)

          y*  arg max P( y  ck ) P( x j  a pj | y  ck )
                                               j

                   y                    j

• Two types of sufficient statistics

                       P( x j  a pj | y  ck )
                                  j
                                                            Sum counts
                       P( y  ck )                        over sub-groups
Naïve Bayes in MapReduce
• Map
  – Input data {x, y} from a subgroup of data
  – Output: 3 types of keys

        key  ( x j  a pj , y  ck ), value 
                        j
                                                      
                                                  subgroup
                                                             1( x j  a pj | y  ck )
                                                                        j               CPT


        key  ( y  ck ), value      
                                    subgroup
                                               1( y  ck )                              Class prior

        key  " samples ", value        
                                       subgroup
                                                  1                                     Normalization

• Reduce
  – Sum all the values of each key
  – Compute the class prior and the conditional probabilities
Logistic Regression
• Features:              x  ( x1 , x2 ,...xn )
                                              ;
• Categorical target:          y [0,1]
  Data:  x, y 
                   m
•
• Conditional probability:
                                         1
                P ( y | x,  ) 
                                 1  exp( T x)
• Equivalently
                             p 
                        log        T x
                             1 p 
    – Log odds is a linear function of the features
Logistic Regression
• Estimate the parameters by maximizing the log
  conditional likelihood of observed data
              LCL       log p   log 1  p 
                        i: yi 1
                                      i
                                            i: yi 0
                                                               i



• Optimize using Newton-Raphson to update 
                 H 1 LCL  
             Gradient   LCL   j    y i  p i  x ij
                                                       i
                                                                   Summation form
             Hessian  H jk  H jk   p  p  1 x x  i   i   i i
                                                               j k
                                              i

             i  [1, m]; j , k  [1, n]
      Data                                Features
Logistic Regression in MapReduce
• A control program sets up the MapReduce iterations
• Map
   – Input:  x, y 
                                                              
   – Output:                                                
                       key  g , value   j ,  y i  p i x ij  
                                          isubgroup
                                                                
                                         
                                                                      
                                                                    i i 
                                                                
                       key  h, value   j , k ,  p p  1 x j xk  
                                                             i i

                                         
                                                 isubgroup           
                                                                        
• Reduce
   – Aggregate the values of  LCL   j , H jk from all mappers
   – Compute H  LCL  
                1

   – Update
                           H  LCL  
                              1


• Stop when updates become small                1 update per iteration
Support Vector Machine
• Features:         x  Rn          ;
• Binary target:    y [1, 1]
• Objective function in primal form
         min w  C  i p , i  ( wT xi  yi )
                  2

         w ,b
                          i ,i  0

         s.t i
                        
         y i wT xi  b  (1  i )
                p=1 (hinge loss), p=2 (quadratic loss)
• For quadratic loss, batch gradient descent to estimate w
                      Gw  2w  2C   w.xi  yi  xi
                                       i

                                      Summation form
Support Vector Machine in
            MapReduce
• Map
  – Input: ( x, y}
  – Output:
            key  GGW , value  2w  2C       w.x  y  x
                                          subgroup
                                                     i   i   i


• Reduce
  – Aggregate the values of gradient from all mappers
  – Update
                w  w  * Gw
• Driver program sets up the iterations and checks
  for convergence
Decision Trees
•   Features:           x  ( x1 , x2 ,...xn )
•   Targets:    y [0,1] or               yR
    Data: D   x, y 
                        m
•
•   Construct Tree
    – Each node splits the data by feature value
    – Start from root
        • Select best feature, value to split the node
            – Based on reduction in data impurity between the child and
              parent nodes
    – Select the next child node
    – Repeat the process till some stopping criterion
        • Pure node, or data is below some threshold etc.
Decision Trees

                                                                     Expensive step
                                                                          for
                                                                     Large datasets




B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel
Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb
Endowment - PVLDB, vol. 2, no. 2, pp. 1426-1437
PLANET for Decision Trees
•   Parallel Learner for Assembling Numerous Ensemble
    Trees (PLANET- Panda et al. 2009)
    – Main idea is to use MapReduce to determine the best feature
      value splits for nodes from large datasets
•   Each intermediate node has a sub-set of all data falling
    into it
•   If this sub-set is small enough to fit in memory,
    – Grow remaining sub-tree in memory
•   Else,
    – Launch a MapReduce job to find candidate feature value splits
    – Select the best feature split from among the candidates
PLANET for Decision Trees
• 5 main components
  1. Controller                                                        M
     • Monitors and controls the growth of tree                        a
                                                                       p
  2. Initialization Task                                               R
     • Identifies all feature values to be considered for splits       e
  3. FindBestSplit Task                                                d
                                                                       u
     • Finds best split when there is too much data to fit in memory
                                                                       c
  4. InMemoryGrow Task                                                 e
     • Grow an entire sub-tree once the data fits in memory
                                                                       T
  5. Model File                                                        a
     • File describing the state of the model                          s
                                                                       k
                                                                       s
PLANET for Decision Trees
• Maintain 2 queues
   – MapReduceQueue (MRQ)
      • Contains nodes for which data is too large to fit in memory
   – InMemoryQueue (InMemQ)
      • Contains nodes for which data fits in memory
• 2 main MapReduce jobs
   – MR_ExpandNodes
      • Process nodes from the MRQ to find best split
      • Output for each node:
           – Candidate split positions for node along with
               » Quality of split (using summary statistics)
               » Predictions in left and right branches
               » Size of data going into left and right branches
   – MR_InMemory
      • Process nodes from the InMemQ.
      • For a given set of nodes N, complete tree induction at nodes in N using the
        InMemoryGrow algorithm.
PLANET for Decision Trees
• Map function in MR_ExpandNodes
   – Load the current model file and set of nodes N from MRQ
   – For each record
       • Determine if record is relevant to any of the nodes in N
       • Add record to the summary statistics (SS) for node
       • For each feature-value in record
            – Add record to the summary statistics for node for split points “s” less than the
              value in record “v”
   – Output
                              Split ID
       key  (n  N , x  Ordered  feature, s ); value  Tn , x  s                     SS of

       key  (n  N , x  Categorical  feature); value   v, Tn , x v 
                                                                                         candidate
                                                                                          splits
       key  (n  N ); value  SS                                          SS of parent node
                                                       
       Tn , x  s   SS    y,  y ,  1   2

                            subgroup subgroup subgroup          SS for variance impurity
PLANET for Decision Trees
• Reduce function in MR_ExpandNodes
  – For each node
      • Aggregate the summary statistics for that node
  – For each split (which is node specific)
      • Aggregate the summary statistics for that Split ID from all map
        outputs of summary statistics
      • Compute impurity of data going into left and right branches
      • Total impurity = Impurity in left branch + Impurity in right branch
      • If Total impurity < Best split impurity so far
          – Best split = Current split
  – Output the best split found
Clustering algorithms in
            MapReduce
• k-means clustering
• Canopy clustering
• Co-clustering
k-means clustering
• Choose k samples as initial cluster centroids
• Iterate till convergence
  – Assign membership of each point to closest cluster
                                                         MR
  – Re-compute new cluster centroids using assigned
    members
• Control program to
  – Initialize the centroids
     • random, initial clustering on sample etc.
  – Run the MapReduce iterations
  – Determine stopping criterion
k-means clustering in MapReduce
• Map
  –   Input data points: x1 , x2 ...xN
  –   Input cluster centroids: C  (c1 , c2 ,...cK )
  –   Assign each data point to closest cluster
  –   Output
                                                                         
                    key  ci , value    x j | x j  ci ,  1| x j  ci 
• Reduce                                subgroup          subgroup       
  – Compute new centroids for each cluster ci

                                  
                                key  ci subgroup
                                                    x j | x j  ci
           key  ci , value 
                                    
                                  key  ci subgroup
                                                      1| x j  ci
Complexity of k-means clustering
• Each point is compared with each cluster centroid
• Complexity = N * K * O(d ) where O(d ) is the complexity
  of the distance metric
• Typical Euclidean distance is not a cheap operation
• Can reduce complexity using an initial canopy clustering
  to partition data cheaply
   – Preliminary step to help reduce expensive distance calculations
   – Group data into (possibly overlapping) canopies using a cheap
     distance metric (McCallum et al. 2000)
   – Compute the distance metric between a point and a cluster
     centroid only if they share a canopy.
Canopy clustering
•    Every point in the dataset is in a canopy
•    A point can belong to multiple canopies
•    Canopy size = T1
•    Algorithm
      – Keep a list of canopies, initially an empty list
      – Scan each data point:
            • If it is within T2 < T1 distance of existing canopies, discard it.
              Otherwise, add this point into the list of canopies
      – Use a cheap distance metric to construct the
        canopies
            • e.g. Manhattan distance, L
      – Assign points to the closest canopy

A. McCallum, K. Nigam, L. Ungar. Efficient Clustering of High Dimensional Data Sets with Application
to Reference Matching, SIGKDD 2000
Canopy clustering




Image from: http://horicky.blogspot.com/2011/04/k-means-clustering-in-map-reduce.html
Canopy clustering in MapReduce
• Map
 – Input data points: x1 , x2 ...xN
 – If data point is not within distance
                                    T2  of an existing
   candidate canopy, add it as a candidate canopy point
 – Output
        key  1, value  xi | xi  candidate  canopy
• Reduce
 – Keep a list of final canopy points, initially an empty list
 – If the canopy point is not within distance
                                        0.5*T2
                                                       of an
   existing final canopy point, add it as a final canopy
   point
 – Output
           key  1, value  xi | xi  final  canopy
Canopy + k-means clustering
• Final step in canopy clustering assigns all points
  to the closest final canopy point
  – Map only operation
• Speeding up k-means using canopy clustering
  – Initial run of canopy clustering on the data (or on a
      sample of data)
     • Pick canopy centers
     • Assign points to canopies
  – Pick initial k-means cluster centroids
     • Run k-means iterations
  – Compute distance between point and centroid only if
    they are in the same canopy
Co-clustering
• Cluster pair-wise relationships in dyadic data
• Simultaneously cluster both rows and clusters,
  based on certain criteria
• Identify sub-matrices of rows and columns that
  are inter-related
• Commonly used in text mining, recommendation
  systems and graph mining
Co-clustering
• Given an m   x n matrix
   – Find group assignments of rows and columns such that the resulting
     sub-matrices are smooth (Papadimitriou & Sun, 2008)
   – Assign rows and columns to clusters

          r {1, 2....k}m ,   c {1, 2....l}n , k  m, l  n



  01011                   r   2 1 2 1
                                        T          11000
  10100                                            11000
  01011                  c   2 1 2 1 1          00111
                                            T


  10100                                            00111
Co-clustering
• Iteratively re-arrange rows and columns till an
  error function keeps reducing
• Algorithm: Input Am x n , k , l
    – Initialize r and c
    – Compute a group statistics/cost matrix Gk x l
    – While cost decreases
          • For each row i  1 m do
                – For each row group label p  1 k do
                    »      r (i) p         if cost decreases
          • Update G, r
          • Do the same for columns
    – Return r and c

S. Papadimitriou, J. Sun, DisCo: Distributed Co-clustering with Map-Reduce, 2008,
  ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521
Co-clustering in MapReduce
• Assumptions
  – Error can be computed using r , c, G only (sufficient statistics)
  – Row assignments can be based on r , c, G, ai: (greedy search)
• Map:
  – Cost matrix and column cluster assignments are in all mappers
  – Input:
      • Key = row index i
      • Value = adjacency list for row   i  ai:
  – Compute:
      • Row statistics for current column cluster assignment gi (ai: , c)
      • Assign row to row cluster r (i) {1 k} that has the lowest cost
  – Output:
                    key  r (i )                   Row cluster label for row

                    value  ( gi ,{i})             Cost of cluster assignment, row
Co-clustering in MapReduce
• Reduce
   – For each row cluster label, merge the rows and total cost
          p  r (i)         gp          
                                    j:r ( j )  r ( i )
                                                          gi        Ip  Ip     i

     Row cluster label          Total cost                     Rows in this row cluster
   – Output
                         p,  g p , I p 

• Collect the results for each row cluster
   – For each reduce output
                            g p:  g p
                           r (i )  p, i  I p
Co-clustering in MapReduce-
                  Example
• Assume a row and column partitioning for the matrix
                                   k 2 l2
         01011                     r  (1,1,1, 2)         r  (1, 2,1, 2)
         10100                     c  (1,1,1, 2, 2)
         01011                     Cost function = Number of non-zeros per group
         10100                         4 4            2 4
                                   G=             G=      
                                       2 0            4 0
                                                                          Reduce:
              Map:
                                                                  Input: (2,<(2, 0),{2}) )
Input:(2,  1,3 )
                                                                  Output: g 2   2, 0 
Output:  r (2)  2, ( g 2  (2, 0),{2}) 
                                                                         I2  I2    {2}
S. Papadimitriou, J. Sun, DisCo: Distributed Co-clustering with Map-Reduce, 2008,
  ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521
Recommendations and Frequent
          Itemset mining
•   Item-based collaborative filtering
•   Pair-wise similarity
•   Low-rank matrix factorization
•   Frequent Itemset mining
Item-based collaborative filtering
• Given a user-item ratings matrix, fill in the ratings of the missing
  items for each user
                        ITEM RATING

                   U       514
                   S
                   E       ?25
                   R
                           432


• Infer missing ratings from available item ratings for user weighted by
  similarity between items
                                   
                                R ( u , j )!?
                                                 sim(i, j ) * R(u, j )
                   R(u, i ) 
                                             
                                         R ( u , j )!?
                                                          sim(i, j )
Item-based collaborative filtering
• Estimate similarity between items as Pearson
  correlation of rankings from users who have
  rated both items.

                          R(u, i)  R (i)  R(u, j )  R ( j ) 
                        U ij
 sim(i, j ) 
                   R(u, i)  R (i)    R(u, j )  R ( j ) 
                                           2                          2

                 U ij                          U ij

 U ij  {u | R(i )!  ?, R( j )!  ?}
Item-based collaborative filtering
        using MapReduce
• Map                                  • Reduce
  – Input:                               – Input:
   key  u,                                key  (i, j )
   Value  {(i, R(i ) | R(i )!  ?)}       Value  [( R(i ), R( j )]
  – Output: Ratings for                  – Output:
    item pairs
                                           key  (i, j )
   key  (i, j )                           Value  sim(i, j )
   Value  ( R(i ), R( j ))
Pair-wise Similarity
• Compute similarity between pairs of
  documents in a corpus
        S (di , d j )   wt ,di * wt ,d j               wt ,di * wt ,d j
                       tV                     tdi   dj



• Generate a postings list for each t V
       P(t )  {(di , wt ,di ) | wt ,di  0}
  – This is an easy map-reduce job
Pair-wise Similarity in MapReduce
• Generating a postings list of inverted index
  – Map
          Input di
          For each t  di
            Emit {t , ( di , wt ,di )}

  – Reduce
                   Emit {t ,[(di , wt ,di )]}
Pair-wise Similarity in MapReduce
• Map
  – Input term postings list t , P(t )
  – Take the Cartesian product of the postings list with
    itself
     • For each pair of   (di , d j )  P(t )

                     Emit <(i, j ), sim(i, j )  wt ,di * wt ,d j 
• Reduce
  – For each
                  key  (i, j ),
                  Sim(di , d j )   sim(i, j )
Pair-wise Similarity in MapReduce
• Cartesian product of postings list with itself may produce a large
  set of intermediate keys
• Modify the above algorithm as follows
   – Split the corpus into blocks of documents and query against postings list
   – Map
      • Input term postings list t , P(t )
       • Load blocks of documents in memory
       • For each document d i in block
            – If   t  di   compute partial score for each element
   – Reduce
       • For each document, aggregate the partial scores from mappers for all other
         documents
• Can reduce intermediate keys by implementing term limits when
  documents are loaded into memory
Low-rank matrix factorizations
• Useful for analyzing patterns in dyadic data
         Vm x n  Wm x d H d x n ,                 d       min(m, n)

• Given an application dependent loss function, find
                 arg min L(V ,W , H )
                   W ,H

• Most loss functions are sums of local losses
                   L       
                          ( i , j )Z
                                        l (Vij ,Wij , H ij )

• Use stochastic gradient descent (SGD) for this
  factorization
SGD for matrix factorization
Training set Z  Vij | Vij !  ?  , initial values W0 , H 0
While not converged, do
 Select a training point (i, j )  Z uniformly at random
                   Lij (W , H )
                    
 Wi*  Wi*   n N
    '
                         l (Vij ,Wi* , H * j ) For local losses,
                   WWi ' k
                      i*
                                                             depend only on
                      Lij (W , H )
                       
  H* j  H* j   n N       l (Vij ,Wi* , H * j )            l (Vij , Wi* , H * j )
                      HjH kj '
                         *

   Wi*  Wi*
           '


end while

     R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011
SGD for matrix factorization in
            MapReduce
• Main ideas
   – Local loss depends only on Vij ,Wi* , H* j
   – If sub-matrices do not share rows and columns, they can be
     factored independently and factors combined.
                                         Z b  W bH b
            H   1
                     H   2
                             H   d
                                     
                                            W 1 
      W  Z 0 ... 0 
        1     11
                                             2
      2                                 W 
      W  0 Z 22 ...                  W                
                                                        H  H 1 , H 2 ...H d   
                    0                       
                                        W d 
      W d  0         dd 
                                                
                 0 Z 

   – Stratify the input matrix such that each stratum can be processed
     in a distributed manner
SGD for matrix factorization in
             MapReduce
• Stratify the input matrix (dropping missing values) into subsets
         Z s , Z s2 ,
           1
                        Z sd such that
         i  i ', j  j ' (i, j )  Z sb1 , (i ', j ')  Z sb 2 , (b1  b2)
• Stratification
    – Randomly permute the rows and columns of the input matrix

         Z 11              n/d

           V11 V12                   V1n                     For a permutation j1 , j2 .... jd
                                                             of 1...d
   m/d     V21 V22                   V2 n 
                                                               Z s  Z 1 j1    Z 2 j2    Z 3 j3 ... Z djd
                                          
          
          V V                             
           m1 m 2                     Vmn 
                                           
                                            R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011
SGD for matrix factorization in
         MapReduce
Training set Z , initial values W0 , H 0 , cluster size d
W  W0 , H  H 0
Block Z / W / H into d x d / d x 1/1 x d blocks
While not converged, do               Epochs
  Pick step size 
  For s  1 d do                     Sub-epoch

                       
     Pick d blocks Z 1 j1 , Z 2 j2 ,....Z djd    to form a stratum Z   s

     For b  1     d do                         Machines
      Run SGD on points in Z bjb with step size 
    end for
  end for
end while

                            R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011
Frequent Itemset Mining
• Set of items I  {a1 , a2 ...aM } and D  {T1 , T2 ...TN }
  where Ti  subsets of I
• Pattern A  I       is frequent if
                    support( A)  
• Problem
    – Find all complete frequent item-sets of    D
• Divide and conquer approach
    – Patterns containing A can be found using only transactions
      containing A.
    – Filter transactions with A – conditional database (CDB) of A
    – Find patterns containing A in CDB(A)
Frequent Itemset Mining
• Construct a Frequent Pattern (FP) Tree
   –   Keep only items with frequency above the minimum support
   –   Sort each transaction in descending order of frequent items
   –   Add each sorted transaction to an item prefix tree
   –   Each node in the FP tree is an item
         • Node has count of transactions with that item in that path
         • Nodes of same items in different paths are linked together


• FPGrowth algorithm
   – Start from CDB of single frequent item
   – Build FP Tree of CDB
   – Mine frequent patterns from CDBs using recursion
         • Recursion terminates when CDB has a single path
         • Frequent pattern = Union of all nodes in this tree with support = min. support
           of nodes in this tree

  Mining frequent patterns without candidate generation, J. Han, J. Pei,Y. Yin. 2000, In SIGMOD, 2000.
Frequent Itemset Mining
                 f:4
                                         p: { f c a m / f c a m / c b }
                 c:4
facdgimp         a:3       fcamp
                 b:3
                                         m: { f c a / f c a / f c a b }
                 m:3
abcflmo          p:3       fcabm
                                         b: { f c a / f c }
                 o:2
bfhjo            d:1       fb
                 e:1
                                         a: { f c / f c / f c }
                 g:1
bcksp            h:1       cbp
                 i:1
                                         c: { f / f / f }
                 k:1
afcelpmn         l:1       fcamp
                                         f: {}
                 n:1

  Original                   Sorted      Conditional databases of
               Frequent
transactions              transactions       Frequent items
                 items
Frequent Itemset Mining in
             MapReduce
• Identifying frequent items = 1 MapReduce job
   – Find the set of items and the associated frequency
• Prune this frequent items list keeping only items more
  frequent than minimum support
• Mine subsequent projected CDBs in MapReduce
  iterations (Li et al. 2008)
   – Project transactions in CDB by least frequent item in the mapper
   – Breadth first search of the FP Tree using a MapReduce iteration
   – Once projected CDB fits in memory of reducer
       • Run FPGrowth algorithm in reducer
       • No more growth of the sub-tree
Frequent Itemset Mining in
                MapReduce
                                                          p: { f c a m / f c a m / c b }
                                                          pc: {}
                                                          pc:3
          p
                 D|p                      c               m: { f c a / f c a / f c a b }
                                 D|am          D|cam      am: { f c / f c / f c }
                           a
          m                                               cm: { f / f / f }
                 D|m       c                              cam: { f / f / f }
                                 D|cm                     fcam: {}
          b                                               mf:3, mc:3, ma:3, mfc:3,
D                D|b                                      mfa:3, mca:3, mfca:3
                           c                               b: { f c a / f c }
          a
                 D|a              D|ca                    a: { f c / f c / f c }
                                                          ca: { f / f / f }
          c                                               fa: {}
                 D|c                                      af:3, ac:3, afc: 3
                                                          c: { f / f / f }
                                                          fc: {}
    MR Iteration 1     MR Iteration 2    MR Iteration 3   cf:3
Graph Algorithms
• Ubiquitous in web applications
  – Web-graph, Social network graph, User-item
    graph
• Typical problems
  – Popularity (e.g. PageRank)
  – Shortest paths
  – Clustering, semi-clustering etc.
Graph algorithms in MapReduce
•    Vertex centric approach
    –   Work with the adjacency list of each vertex
    –   Especially useful for sparse adjacency matrices
•    Breadth first search
    –   Each MR iteration advances the horizon by one
        level
•    In each iteration
    1. Compute on each vertex
    2. Pass values to connected vertices for aggregation
       in the reducer
    3. Pass the adjacency list of each node to the reducer
Breadth first search on Graphs in
           MapReduce
                                      3   • Easy (iterative)
                 2                          implementations exist for
                                            some common algorithms
                                             – Single source shortest path
1                2                    3
                                             – PageRank


                 2                    3




                                      3


MR Iteration 1       MR Iteration 2
Single source shortest path in
              MapReduce
• Find the shortest path from a given node to any reachable node
• Given a start node:
   – Distance to adjacent nodes = 1
   – Distance to any other node reachable from a set of nodes S
      DistanceTo(n) = 1 + min(DistanceTo(m), m  S)

• Map                                     • Reduce
    – Input:                                  – Input:
        • Node “n”                                 • “p”, “D+1” from all nodes
        • D, Adjacency list of “n”                   pointing to “p”
                                   Pass the
    – Output:                    graph from        • “n”, Adjacency list of “n”
       • For each node “p” in     1 iteration – Output:
         adjacency list          To the next     • “p”, min(“D+1” from all
           – <p, (D+1)>                            nodes pointing to “p”)
       • <n,Adjacency list of “n”>               • “n”, Adjacency list of “n”
PageRank
• Given a node A
                                               PR(Ti )
          PR( A)  d  (1  d ) * 
                                 {Ti :Ti  A} C (Ti )

          d  random jump probability
          Ti  node pointing to A
          C (Ti )  out-degree of Ti


• Iterate this equation till convergence
• Driver program to check if the page rank for each
  node has converged
PageRank in MapReduce
• In each iteration (i)
  • Map                                 • Reduce
     – Input:                             – Input:
        • Node “n”, PRi-1(n)                 • <“p”, V from all nodes “n”
        • Adjacency list of “n”                pointing to “p”>
     – Compute                               • Adjacency list of “n”
        • V = PRi-1(n) / |Adjacency       – Compute
                           list of n|        • PRi(p) = Sum(V)
     – Output:                            – Output:
        • For each node “p” in               • <p, PRi(p) >
          adjacency list                     • <n,Adjacency list of “n”>
            – <p, V>
        • <n,Adjacency list of “n”>
Frameworks for graph algorithms
• MapReduce is not a good fit for graph algorithms
   – 1 iteration for each level of the graph has large overheads
• “Bulk synchronous processing model” for graph
  processing.
   – Components – for either compute or storage
   – Router – to deliver point to point messages
   – Synchronization at periodic intervals (called supersteps) that are
     atomic
• In each superstep, vertex can
   – Receive messages sent by other vertices in previous superstep
   – Compute using the data in that vertex and the received
     messages
   – Send messages to other vertices
Frameworks for graph algorithms
• Vertex can vote to go to halt state
• Computation stops when all vertices have voted to halt.
• Vertices can also mutate the graph
   – Add/remove edges and other vertices
   – Mutations implemented in next superstep
• Framework also supports aggregators
   – Can maintain global summaries over the graph
   – Values communicated to all vertices before the next
     superstep
• Large scale graph processing tools leveraging Grid
   – Pregel (in Google)
   – Open source implementation Giraph
     https://github.com/aching/Giraph
Outline
•   Why learn models in MapReduce framework?
•   Types of learning in MapReduce
•   Statistical Query Model (SQM)
•   SQM Algorithms in MapReduce
•   Sequential learning methods and MapReduce
•   Challenges and Enhancements
•   Apache Mahout
Sequential learning methods
• Some learning algorithms are inherently sequential in nature, e.g.,
   – Stochastic Gradient Descent (SGD) minimization
   – Conditional Maximum Entropy using SGD
   – Perceptron

• Difficult to distribute sequential algorithms over data partitions
   – Need frequent communication of intermediate parameter values

• Some sequential algorithms can be trained in a cluster environment.
   – Theoretical and empirical analysis show that parameters
     converge to the values from sequential training over all data
Sequential learning methods in
           MapReduce
• Types of sequential learning in MapReduce
  – Single M/R job:
     • Learn parameters on each data partition in mappers over
       multiple epochs
     • Average the model parameters from all mappers in a reducer
  – Multiple M/R jobs:
     • Learn parameters on each data partition in each mapper for
       1 epoch
     • Average the model parameters from all mappers in a
       reducer
     • Start the next iteration for next epoch in the mapper with the
       average parameter values from previous iteration
  – Communicate between nodes
     • Launch MPI on Hadoop cluster
Stochastic Gradient Descent (SGD)
             methods
• Many learning algorithms involve optimizing an objective
  function (maximizing log likelihood, minimizing root
  mean square error etc.) over the training data to
  determine the optimal parameters
               w*  arg min         L( xi , y i , w)
                              itraining  data

               w  w  *         
                            itraining  data
                                                 w L(xi , y i , w)

• Stochastic Gradient techniques update the parameter
  one example at a time
               w  w  * w L( xi , y i , w)
• Parameter updates are inherently sequential
Parallelized SGD
•    Partition the training data into multiple partitions, each
     with T examples chosen at random
•    Perform stochastic gradient updates on each data
     partition separately with constant learning rate.
•    Average the solutions between different machines.
•    For large scale data, (Zinkevich et al. 2010) show that
    – Parameter values converge to sequential estimates
    – For k partitions, averaging the parameters reduces
        variance by O(k 1 2 )
    – Bias in parameter estimates decreases as well
Parallelized SGD in MapReduce

         Map:                                                 Reduce:
 In each mapper i 1...k                                 Aggregate from all mappers:
                       Machines
 wi ,0  0
                                                            1 k
                                                         v   wi ,t
                                                                       Average across all
 For t  1...T                Data
                                                            k i 1         machines

     wi ,t  wi ,(t 1)   *  w L( x, y , wi ,t 1 )
 end for
 end for
Parallelized SGD in MapReduce
    • Multi-pass parallel SGD (Weimer, Rao, Zinkevich 2010)
          – Divide the data randomly among all machines
                ctj  t th example sent to j th machine
                                            *
          – Initialize weight vector w
          – For i {1...T } iterations do                 Iterations
                 • For each machine      j {1...k}   do               Machines
                         wi  w*
                         Shuffle data uniformly at random    p :{1...m}  {1...m}
                         For each t {1...m} do                 Data
Initial value
  for next                             w j  w j c p (t ) (w j )
                                                      j

                         end for
  iteration
                    end for
                          1 k j
                       w  w
                         *                                Average across all machines in
                          k j 1                                  each iteration
          end for
Conditional MaxEnt models
• Used in both binary and multi-class classification problems
• Commonly used in NLP and computer vision

    S  {( x1 , y1 ), ( x2 , y2 )...( xm , ym )
                     1
    pw ( y | x )         exp  w. ( x, y )  ,  ( x, y )  feature
                   Z ( x)
    Z ( x)   exp  w. ( x, y) 
                 yY

                                      1 m
   w  arg min FS ( w)  arg min  w   log pw ( y | x)
                                                  2

           w                w         m i 1
   y  arg max pw ( y | x)
             y
Conditional MaxEnt in MapReduce
• Mixture weighting method (Mann et al. 2009)
   – Train a model in each of mappers using standard gradient ascent on a
                              M
       subsample of the data.

                               k th mapper; wk  0
                                for t  1...T do
                                   wk  wk   *  wk FS ( wk )
                               return w
    – Average the weights from all the mappers in 1 reducer


                           M
                       w   k wk
                                                              M


                           k 1
                                                    k  0      k   1
    – Mann et al. (2009) show that the mixture weighting1 estimate converges to the
                                                       k
      sequential estimate
Perceptron algorithm
• Online algorithm used in NLP for structure prediction e.g.,
    – Parsing, Named entity recognition, Machine translation etc.

     Perceptron( D  {xi , yi })
     w(0)  0; k  0
     for n  1...N                                      N epochs
       for t  1... | D |                                    Data

         y '  arg max wk . f ( xt , y ' )                           Predict using
                    y'                                              current weights

         if ( y '  yt )
                                                               Add weight to features for
            w( k 1)  wk  f ( xt , yt )  f ( xt , y ' )          correct output
             k  k 1                                        Remove weights to features for
                                                                   incorrect output
     return wk
Perceptron in MapReduce
• Iterative parameter mixing
   – Train using data sub-group for 1 epoch in each mapper
   – Average the weights in reducer
   – Communicate back to mapper
   – Train next epoch in mapper                OneEpochPerceptron( D, w)
                                               w(0)  w; k  0
    w 0                                       for t  1... | D |
    for n  1...N
                                                  y '  arg max wk . f ( xt , y ' )
      w(i ,n ) = OneEpochPerceptron( Di , w)                 y'

      w   i , n w   (i ,n )                   if ( y '  yt )
                                                       w( k 1)  wk  f ( xt , yt )  f ( xt , y ' )
           i

    return w
                                                        k  k 1
       Average across all machines in          return wk
               each iteration
Perceptron in MapReduce
• McDonald et al. (2010) show that averaging
  parameters after each epoch:
  – Has as good or better performance as sequential
    training on all data
  – Trains better classifiers quicker than training
    sequentially on all data
  – Performs better than averaging parameters from
    training model in each partition for multiple epochs to
    convergence
Outline
•   Why learn models in MapReduce framework?
•   Types of learning in MapReduce
•   Statistical Query Model (SQM)
•   SQM Algorithms in MapReduce
•   Sequential learning methods and MapReduce
•   Challenges and Enhancements
•   Apache Mahout
Challenges for ML algorithms on
              Hadoop
• Hadoop is optimized for large batch data processing
   – Assumes data parallelism
   – Ideal for shared nothing computing
• Many learning algorithms are iterative
   – Incur significant overheads per iteration
• Multiple scans of the same data
   – Typically once per iteration  high I/O overhead reading data
     into mappers per iteration
   – In some algorithms static data is read into mappers in each
     iteration
       • e.g. input data in k-means clustering.
• Need a separate controller outside the framework to:
   – coordinate the multiple MapReduce jobs for each iteration
   – perform some computations between iterations and at the end
   – measure and implement stopping criterion
Challenges for ML algorithms on
              Hadoop
• Incur multiple task initialization overheads
   – Setup and tear down mapper and reducer tasks per iteration
• Transfer/shuffle static data between mapper and reducer
  repeatedly
   – Intermediate data is transferred through index/data files on local
     disks of mappers and pulled by reducers
• Blocking architecture
   – Reducers cannot start till all map jobs complete
• Availability of nodes in a shared environment
   – Wait for mapper and reducer nodes to become available in each
     iteration in a shared computing cluster
Iterative algorithms in MapReduce




                     Overhead per Iteration:
  Data (each pass)




                                               Pass Result
                     •Job setup
                     •Data Loading
                     •Disk I/O
Enhancements to Hadoop
• Many proposals to overcome these challenges
• All try to retain the core strengths of data partitioning and
  fault tolerance of Hadoop to various degrees
• Proposed enhancements and alternatives to Hadoop
   –   Worker/Aggregator framework
   –   HaLoop
   –   MapReduce Online
   –   iMapReduce
   –   Spark
   –   Twister
   –   Hadoop ML
   –   …..
Worker/Aggregator framework
•   Worker
    -     Load data in memory
    -     Iterate:
           ›   Iterates over data using user specified functions
           ›   Communicates state
           ›   Waits for input state of next pass
•   Aggregator
    –     Receive state from the workers
    –     Aggregate state using user specified functions
    –     Send state to all workers
•   Communicate between workers and aggregators using TCP/IP
•   Leverage the fault tolerance, and data locality of Hadoop




        M. Weimer, S. Rao, M. Zinkevich, 2010, NIPS 2010 Workshop on Learning
        on Cores, Clusters and Clouds
Parallelized SGD in
               Worker/Aggregator




               Advantages:




                                        Final Result
Initial Data




               •Schedule once per Job
               •Data stays in memory
               •P2P communication


                             102            9/7/2011
HaLoop
 • Programming model and architecture for iterations
       – New APIs to express iterations in the framework
 • Loop-aware task scheduling
       – Physically co-locate tasks that use the same data in different
         iterations
       – Remember association between data and node
       – Assign task to node that uses data cached in that node
 • Caching for loop invariant data:
       – Detect invariants in first iteration, cache on local disk to reduce
         I/O and shuffling cost in subsequent iterations
       – Cache for Mapper inputs, Reducer Inputs, Reducer outputs
 • Caching to support fixpoint evaluation:
       – Avoids the need for a dedicated MR step on each iteration


HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska,
Michael D. Ernst. In VLDB'10
HaLoop vs. MapReduce
Application
                                                                                             Application


Framework

                                                                                            Framework



    • HaLoop framework controls the loop
        • First iteration is similar to that on Hadoop.
        • Framework identifies data  node mappings, caches and                          indexes
        for fast access, and controls looping
    • Subsequent iterations leverage the above optimizations


 HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska,
 Michael D. Ernst. In VLDB'10
New, additional
       API                  HaLoop Design


Leverage
   data
 locality




                                                                                             Caching
                                                                                             for fast
Starts new                                                                                   access
 MR jobs
repeatedly

HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska,
Michael D. Ernst. In VLDB'10
HaLoop Programming API
Name                                           Functionality
Map() & Reduce()                               Specify a map & reduce function
AddMap() & AddReduce()                         Specify a step in loop                            Iteration
SetDistanceMeasure()                           Specify a distance for results                      inputs
SetInput()                                     Specify inputs to iterations
AddInvariantTable()                            Specify loop-invariant data
SetFixedPointThreshold()                       A loop termination condition                        Loop
SetMaxNumberOfIterations()                     Specify the max number of                          control
                                               iterations
SetReducerInputCache()                         Enable/disable reducer input cache
SetReducerOutputCache()                        Enable/disable reducer output                      Cache
                                               cache                                              control
SetMapperInputCache()                          Enable/disable mapper input cache

HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska,
Michael D. Ernst. In VLDB'10
k-means clustering in HaLoop
•   k-means in HaLoop
    1. Job job = new Job();
    2. job.AddMap(Map_Kmeans,1);  Assign data point to closest
       cluster
    3. job.AddReduce(Reduce_Kmeans,1);  Re-compute centroids
    4. job.SetDistanceMeasure(ResultDistance);
        –   # of changes in cluster membership
    5. job.SetFixedPointThreshold(0.01);
    6. job.SetMaxNumOfIterations(12);  Stopping criteria
    7. job.SetInput(IterationInput);  Same input data to each iteration
    8. job.SetMapperInputCache(true);
        –   Enable mapper input caching for mappers to read data from local
            disk node
    9. job.Submit();
MapReduce Online
   • Pipeline data between operators as it is produced
         – Decouple computation and data transfer schedules
         – Intra-job:
               • between mapper and reducer
         – Inter-job:
               • schedule multiple dependent jobs simultaneously
               • between reducer of one job and mapper of next job
   • “Push” data from producers instead of a “pull” by consumers
   • Intermediate data is considered tentative till map job completes
         – Also stored on disk for fault tolerance/recovery
   • Reducer starts as soon as some data is available from mappers
         – Can compute approximate answers from partial data
   • Mappers and Reducers can also run continuously
         – Enables stream processing



Mapreduce online, T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, R. Sears, 2010, NSDI'10,
Proceedings of the 7th USENIX conference on Networked systems design and implementation
iMapReduce
• Iterative processing
      – Persistent map/reduce tasks
      – Each reduce task has a locally connected
        corresponding map task
• Maintain static data locally
      – On local disk of mapper
• Asynchronous map execution
      – Persistent socket between reducemap
      – Completion of reduce triggers map
      – Mappers do not need to wait

iMapReduce: A Distributed Computing Framework for Iterative Computation, Y. Zhang, Q. Gao, L. Gao,
C. Wang, DataCloud 2011
iMapReduce – Iterative Processing
iMapReduce – Asynchronous map
              execution



T
I
M
E




          MapReduce    iMapReduce
Spark
• Open source cluster computing model:
    – Different from MapReduce, but retains some basic character
• Optimized for:
    – iterative computations
          • Applies to many learning algorithms
    – interactive data mining
          • Load data once into multiple mappers and run multiple queries
• Programming model using working sets
    – applications reuse intermediate results in multiple parallel operations
    – preserves the fault tolerance of MapReduce
• Supports
    – Parallel loops over distributed datasets
          • Loads data into memory for (re)use in multiple iterations
    – Access to shared variables accessible from multiple machines
• Implemented in Scala,
• www.spark-project.org

Spark: Cluster Computing with Working Sets. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker,
I. Stoica. 2010, USENIX HotCloud 2010.
Outline
•   Why learn models in MapReduce framework?
•   Types of learning in MapReduce
•   Statistical Query Model (SQM)
•   SQM Algorithms in MapReduce
•   Sequential learning methods and MapReduce
•   Challenges and Enhancements
•   Apache Mahout
Mahout
• Goal
   – Create scalable, machine learning algorithms under the Apache license.
• Scalable:
   – to large datasets
   – business use cases
   – community
• Contains both:
   – Hadoop implementations of algorithms that scale linearly with data.
   – Fast sequential (non MapReduce) algorithms
• Latest release is Mahout 0.5 on 27th May 2011 (circa Aug 4,
  2011)
• Wiki:
   – https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki
• Mailing lists
   – User, Developer, Commit notification lists
   – https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists
Algorithms in Mahout
• Classification:
    – Logistic Regression
    – Naïve Bayes, Complementary Naïve Bayes
    – Random Forests
• Clustering
    –   K-means, Fuzzy k-means
    –   Canopy
    –   Mean-shift clustering
    –   Dirichlet Process clustering
    –   Latent Dirichlet allocation
    –   Spectral clustering
• Parallel FP growth
• Item based recommendations
• Stochastic Gradient Descent (sequential)
Acknowledgment
    Numerous wonderful colleagues!

Questions?
Model Training Exercise
Exercise problem
• Problem:
   – Predict the age of abalone as a function of physical attributes
   – Useful for ecological and commercial fishing purposes
• Dataset:
   – Dataset from the Marine Resources Division at the Department of
     Primary Industry and Fisheries, Tasmania
   – Attributes:
       • Gender, Length, Diameter, Height, 4 different weights – 8 attributes
   – Target:
       • Number of Rings in shell
       • Age (in years) = 1.5 + number of rings in shell
   – At: http://www.stat.duke.edu/data-sets/rlw/abalone.dat
• Learn a linear relation between the age and the physical
  attributes
Exercise dataset
• Original data sample size = 4177
• Generate larger dataset by replicating each record
   – Add Gaussian noise for each feature with the sample variance
   – Do not add variance for Gender and # of rings
   – Replicate by factors of:
       • 10x, 1k x, 8k x, 16k x, 32k x
       • Datasets of about 40k, 4MM, 32 MM, 64MM and 128 MM records.
• For all attributes, compared to the original dataset, the
  larger datasets have:
   – same mean
   – higher sample variance
Exercise: Model training
• Train a linear regression model
                          8
             Rings   wi xi            x0  1
                         i 0
             w*  A1b
                   8                     8
             A   ( xi x )     T
                                i   b   ( xi yi )
                  i 0                  i 0

• Split the training data into 10 parts
• Mapper:
    – Compute the matrix A and vector b on each partition
• Reducer
    – Aggregate the values of A and b from all mappers
    – Compute the weights           w*  A1b
Exercise: Model Results
• For replication factor of 10x
  –   w[Sex] = 0.747
  –   w[Length] = 1.894
  –   w[Diameter] = 2.844
  –   w[Height] = 7.213
  –   w[Whole] = 0.311
  –   w[Shucked] = -0.558
  –   w[Viscera] = 0.840
  –   w[Shell] = 3.288
  –   w[1] = 5.046
Training Times: Sequential vs
                                             Hadoop
                    9000

                    8000
Training Time (seconds)




                    7000
                                          Hadoop        Sequential
                    6000

                    5000

                    4000

                    3000

                    2000

                    1000

                          0
                              0      20            40       Data60 (MM records)
                                                                 size    80       100   120   140
References
1.   M. Kearns. Efficient noise-tolerant learning from
     statistical queries. Journal of the ACM, Vol. 45, No. 6,
     November 1998, pp. 983–1006.
2.   C. Chu, S.K.Kim, Y. Lin, Y. Yu, G. Bradski, A.Y. Ng, K.
     Olukotun, Map-Reduce for Machine Learning on
     Multicore. In Proceedings of NIPS 2006, pp. 281-288.
3.   W. Zhao, H. Ma, Q. He. Parallel K-Means Clustering
     Based on MapReduce. CloudCom '09 Proceedings of
     the 1st International Conference on Cloud Computing
     2009, pp. 674-679
4.   R. Ho. http://horicky.blogspot.com/2011/04/k-means-
     clustering-in-map-reduce.html
References
5.   Cluster Computing and MapReduce, Lecture 4.
     http://www.youtube.com/watch?v=1ZDybXl212Q
6.   A. McCallum, K. Nigam, L. Ungar. Efficient Clustering
     of High Dimensional Data Sets with Application to
     Reference Matching, Proceedings of the sixth ACM
     SIGKDD international conference on Knowledge
     discovery and data mining, 2000, pp.169-178
7.   C. Elkan, 2011.
     http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf
8.   B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo,
     PLANET: Massively Parallel Learning of Tree
     Ensembles with MapReduce, 2009, Proceedings of
     The Vldb Endowment - PVLDB, vol. 2, no. 2, pp. 1426-
     1437.
References
9.  J.S. Herbach, 2009.
    http://fora.tv/2009/08/12/Josh_Herbach_PLANET_MapR
    educe_and_Tree_Learning#fullprogram
10. R. Yan, J. Tesic, and J. R. Smith. Model-shared
    subspace boosting for multi-label classification, 2007, In
    Proceedings of the 13th ACM SIGKDD Intl. Conf. on
    Knowledge discovery and data mining, pp 834-843.
11. R. Yan, M. Fleury, M. Merler, A. Natsev, J.R. Smith,
    2009, Proceedings of the First ACM workshop on Large-
    scale multimedia retrieval and mining, pp 35-42
12. J.D Basilico, M.A. Munson, T.G. Kolda, K.R. Dixon,
    W.P.Kegelmeyer, COMET: A Recipe for Learning and
    Using Large Ensembles on massive data, 2011,
    http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.2068v1.pd
    f
References
13. S. Papadimitriou, J. Sun, DisCo: Distributed Co-
    clustering with Map-Reduce, 2008,ICDM '08. Eighth
    IEEE International Conference on Data Mining, pp
    512-521
14. M.A. Zinkevich, M. Weimer, A. Smola, A., L. Li,
    Parallelized Stochastic Gradient Descent, 2010, NIPS.
15. T. Elsayed, J. Lin, and D. Oard. Pairwise document
    similarity in large collections with MapReduce, 2008, In
    ACL, Companion Volume, pp 265-268, 2008
16. J. Lin, Brute Force and Indexed Approaches to
    Pairwise Document Similarity Comparisons with
    MapReduce., Proceedings of the 32nd Annual
    International ACM SIGIR Conference on Research and
    Development in Information Retrieval (SIGIR) 2009.
References
17. M. Weimer, S. Rao, M. Zinkevich, 2010, NIPS 2010
    Workshop on Learning on Cores, Clusters and Clouds
18. HaLoop: Efficient Iterative Data Processing on Large
    Clusters by Yingyi Bu, Bill Howe, Magdalena
    Balazinska, Michael D. Ernst. In VLDB'10: The 36th
    International Conference on Very Large Data Bases,
    Singapore, 24-30 September, 2010.
19. G. Mann, R. McDonald, M. Mohri, N. Silberman, D. D.
    Walker, 2009, in Advances in Neural Information
    Processing Systems 22 (2009), edited by: Y. Bengio, D.
    Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta
    pp. 1231-1239.
20. R. McDonald, K. Hall, G. Mann, Distributed training
    strategies for the structured perceptron , 2010, In
    Human Language Technologies: The 2010 Annual
    Conference of the North American Chapter of the
    Association for Computational Linguistics (2010), pp.
    456-464.
References
21. H. Li, Y. Wang, D. Zhang, M. Zhang, E.Y. Chang,
    2008, In Proceedings of the 2008 ACM conference on
    Recommender systems (2008), pp. 107-114.
22. R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM
    Tech Report , 2011
    http://www.almaden.ibm.com/cs/people/peterh/dsgdTe
    chRep.pdf
23. Pregel: a system for large-scale graph processing, G.
    Malewicz, M. H. Austern, A. J.C Bik, J. C. Dehnert, A.H
    Horn, N. Leiser, G. Czajkowski, 2010, SIGMOD
    '10 Proceedings of the 2010 international conference
    on Management of data
References
24. Mapreduce online, T. Condie, N. Conway, P. Alvaro, J.
    M. Hellerstein, K. Elmeleegy, R. Sears, 2010, NSDI'10,
    Proceedings of the 7th USENIX conference on
    Networked systems design and implementation
25. iMapReduce: A Distributed Computing Framework for
    Iterative Computation, Y. Zhang, Q. Gao, L. Gao, C.
    Wang, 2011, DataCloud 2011
26. Spark: Cluster Computing with Working Sets. M.
    Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I.
    Stoica. 2010, USENIX HotCloud 2010.
27. Mining frequent patterns without candidate generation,
    J. Han, J. Pei,Y. Yin. 2000, In SIGMOD, 2000.
Backup
Decision Trees
•   Features:           x  ( x1 , x2 ,...xn )
•   Targets:    y [0,1] or               yR
    Data: D   x, y 
                        m
•
•   Construct Tree
    – Each node splits the data by feature value
    – Start from root
        • Select best feature, value to split the node
            – Based on reduction in data impurity between the child and
              parent nodes
    – Select the next child node
    – Repeat the process till some stopping criterion
        • Pure node, or data is below some threshold etc.
Decision Trees

                                                                     Expensive step
                                                                          for
                                                                     Large datasets




B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel
Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb
Endowment - PVLDB, vol. 2, no. 2, pp. 1426-1437
PLANET for Decision Trees
•   Parallel Learner for Assembling Numerous Ensemble
    Trees (PLANET- Panda et al. 2009)
    – Main idea is to use MapReduce to determine the best feature
      value splits for nodes from large datasets
•   Each intermediate node has a sub-set of all data falling
    into it
•   If this sub-set is small enough to fit in memory,
    – Grow remaining sub-tree in memory
•   Else,
    – Launch a MapReduce job to find candidate feature value splits
    – Select the best feature split from among the candidates
PLANET for Decision Trees
• 5 main components
  1. Controller                                                        M
     • Monitors and controls the growth of tree                        a
                                                                       p
  2. Initialization Task                                               R
     • Identifies all feature values to be considered for splits       e
  3. FindBestSplit Task                                                d
                                                                       u
     • Finds best split when there is too much data to fit in memory
                                                                       c
  4. InMemoryGrow Task                                                 e
     • Grow an entire sub-tree once the data fits in memory
                                                                       T
  5. Model File                                                        a
     • File describing the state of the model                          s
                                                                       k
                                                                       s
PLANET for Decision Trees
• Controller
   – Determines the state of the tree and grows it
       • Decides if nodes are pure or have small data to become leaves
       • Data fits in memory           Launch a MapReduce job to
                                        grow the entire sub-tree in memory
       • Data does not fit in memory  Launch a MapReduce job to find
                                        candidate best splits
       • Collect results from MR jobs and choose the best split for a node
       • Update the Model File
   – Periodically checkpoints the system

• Model File
   – Contains the state of the tree constructed so far
   – Used by the controller to check which nodes to split or grow next
PLANET for Decision Trees
• Maintain 2 queues
   – MapReduceQueue (MRQ)
       • Contains nodes for which data is too large to fit in memory
   – InMemoryQueue (InMemQ)
       • Contains nodes for which data fits in memory

• Initialization Task (MapReduce)
   – Identifies candidate attribute values for node splits
   – Continuous attributes
       • Compute an approximate equi-depth histogram
       • Boundary points of histogram used for potential splits
   – Categorical attributes
       • Identify attribute's domain
       • Sort values by average values of Y and use this for ordering
   – Generate a file with list of attributes to be used by other tasks
PLANET for Decision Trees
• 2 main MapReduce jobs
  – MR_ExpandNodes
    • Process nodes from the MRQ to find best split
    • Output for each node:
        – Candidate split positions for node along with
            » Quality of split (using summary statistics)
            » Predictions in left and right branches
            » Size of data going into left and right branches
  – MR_InMemory
    • Process nodes from the InMemQ.
    • For a given set of nodes N, complete tree induction at nodes
      in N using the InMemoryGrow algorithm.
PLANET for Decision Trees
• Map function in MR_ExpandNodes
   – Load the current model file M and set of nodes N
   – For each record
       • Determine if record is relevant to any of the nodes in N
       • Add record to the summary statistics (SS) for node
       • For each feature-value in record
            – Add record to the summary statistics for node for split points “s” less than the
              value in record “v”
   – Output
                              Split ID
       key  (n  N , x  Ordered  feature, s ); value  Tn , x  s                     SS of

       key  (n  N , x  Categorical  feature); value   v, Tn , x v 
                                                                                         candidate
                                                                                          splits
       key  (n  N ); value  SS                                          SS of parent node
                                                       
       Tn , x  s   SS    y,  y ,  1   2

                            subgroup subgroup subgroup          SS for variance impurity
PLANET for Decision Trees
• Reduce function in MR_ExpandNodes
  – For each node
      • Aggregate the summary statistics for that node
  – For each split (which is node specific)
      • Aggregate the summary statistics for that Split ID from all map
        outputs of summary statistics
      • Compute impurity of data going into left and right branches
      • Total impurity = Impurity in left branch + Impurity in right branch
      • If Total impurity < Best split impurity so far
          – Best split = Current split
  – Output the best split found
PLANET for Decision Trees
• InMemoryGrow
  – Task to grow the entire subtree once the data for it fits
    in memory
  – Similar to parallel training
  – Map
     • Load the current model file
     • For each record identify the node that needs to be grown,
     • Output <Node_id, Record>
  – Reduce
     • Initialize the feature value file from Initialization task
     • For each <Node_id, List<Record>> run the basic tree
       growing algorithm on the records
     • Output the best split for each node in the subtree

Contenu connexe

Tendances

Memory Management in Amoeba
Memory Management in AmoebaMemory Management in Amoeba
Memory Management in AmoebaRamesh Adhikari
 
CS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVCS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVpkaviya
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
加速度センサーで円運動を検出する
加速度センサーで円運動を検出する加速度センサーで円運動を検出する
加速度センサーで円運動を検出するTakahiro (Poly) Horikawa
 
Cloud Computing, Social Networking and Social Media
Cloud Computing, Social Networking and Social MediaCloud Computing, Social Networking and Social Media
Cloud Computing, Social Networking and Social MediaMolly Immendorf
 
Overview of Cloud Computing
Overview of Cloud ComputingOverview of Cloud Computing
Overview of Cloud ComputingPeter R. Egli
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeDatabricks
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache SparkDataWorks Summit
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementLaurent Leturgez
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Cscu module 05 data backup and disaster recovery
Cscu module 05 data backup and disaster recoveryCscu module 05 data backup and disaster recovery
Cscu module 05 data backup and disaster recoveryAlireza Ghahrood
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookDatabricks
 
IaaS - Infrastructure as a Service
IaaS - Infrastructure as a ServiceIaaS - Infrastructure as a Service
IaaS - Infrastructure as a ServiceRajind Ruparathna
 
Phonebook Directory or Address Book In Android
Phonebook Directory or Address Book In AndroidPhonebook Directory or Address Book In Android
Phonebook Directory or Address Book In AndroidABHISHEK DINKAR
 
Vision of cloud computing
Vision of cloud computingVision of cloud computing
Vision of cloud computinggaurav jain
 

Tendances (20)

Dial up security
Dial up securityDial up security
Dial up security
 
Memory Management in Amoeba
Memory Management in AmoebaMemory Management in Amoeba
Memory Management in Amoeba
 
CS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IVCS6010 Social Network Analysis Unit IV
CS6010 Social Network Analysis Unit IV
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
加速度センサーで円運動を検出する
加速度センサーで円運動を検出する加速度センサーで円運動を検出する
加速度センサーで円運動を検出する
 
Cloud Computing, Social Networking and Social Media
Cloud Computing, Social Networking and Social MediaCloud Computing, Social Networking and Social Media
Cloud Computing, Social Networking and Social Media
 
Overview of Cloud Computing
Overview of Cloud ComputingOverview of Cloud Computing
Overview of Cloud Computing
 
A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Cscu module 05 data backup and disaster recovery
Cscu module 05 data backup and disaster recoveryCscu module 05 data backup and disaster recovery
Cscu module 05 data backup and disaster recovery
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
 
IaaS - Infrastructure as a Service
IaaS - Infrastructure as a ServiceIaaS - Infrastructure as a Service
IaaS - Infrastructure as a Service
 
Cloud security ppt
Cloud security pptCloud security ppt
Cloud security ppt
 
Phonebook Directory or Address Book In Android
Phonebook Directory or Address Book In AndroidPhonebook Directory or Address Book In Android
Phonebook Directory or Address Book In Android
 
Vision of cloud computing
Vision of cloud computingVision of cloud computing
Vision of cloud computing
 

En vedette

Functional Programming Fundamentals
Functional Programming FundamentalsFunctional Programming Fundamentals
Functional Programming FundamentalsShahriar Hyder
 
Lambda Calculus by Dustin Mulcahey
Lambda Calculus by Dustin Mulcahey Lambda Calculus by Dustin Mulcahey
Lambda Calculus by Dustin Mulcahey Hakka Labs
 
Interactive Scientific Image Analysis using Spark
Interactive Scientific Image Analysis using SparkInteractive Scientific Image Analysis using Spark
Interactive Scientific Image Analysis using SparkKevin Mader
 
Functional programming
Functional programmingFunctional programming
Functional programmingedusmildo
 
Functional programming
Functional programmingFunctional programming
Functional programmingPrateek Jain
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Functional Programming in JavaScript by Luis Atencio
Functional Programming in JavaScript by Luis AtencioFunctional Programming in JavaScript by Luis Atencio
Functional Programming in JavaScript by Luis AtencioLuis Atencio
 
The Lambda Calculus and The JavaScript
The Lambda Calculus and The JavaScriptThe Lambda Calculus and The JavaScript
The Lambda Calculus and The JavaScriptNorman Richards
 
Functional programming ii
Functional programming iiFunctional programming ii
Functional programming iiPrashant Kalkar
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryMatouš Havlena
 
Introduction to Functional Programming in JavaScript
Introduction to Functional Programming in JavaScriptIntroduction to Functional Programming in JavaScript
Introduction to Functional Programming in JavaScripttmont
 

En vedette (11)

Functional Programming Fundamentals
Functional Programming FundamentalsFunctional Programming Fundamentals
Functional Programming Fundamentals
 
Lambda Calculus by Dustin Mulcahey
Lambda Calculus by Dustin Mulcahey Lambda Calculus by Dustin Mulcahey
Lambda Calculus by Dustin Mulcahey
 
Interactive Scientific Image Analysis using Spark
Interactive Scientific Image Analysis using SparkInteractive Scientific Image Analysis using Spark
Interactive Scientific Image Analysis using Spark
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Functional Programming in JavaScript by Luis Atencio
Functional Programming in JavaScript by Luis AtencioFunctional Programming in JavaScript by Luis Atencio
Functional Programming in JavaScript by Luis Atencio
 
The Lambda Calculus and The JavaScript
The Lambda Calculus and The JavaScriptThe Lambda Calculus and The JavaScript
The Lambda Calculus and The JavaScript
 
Functional programming ii
Functional programming iiFunctional programming ii
Functional programming ii
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
 
Introduction to Functional Programming in JavaScript
Introduction to Functional Programming in JavaScriptIntroduction to Functional Programming in JavaScript
Introduction to Functional Programming in JavaScript
 

Similaire à Modeling with Hadoop kdd2011

StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhPoorabpatel
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkSpark Summit
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicoreillidan2004
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Universitat Politècnica de Catalunya
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image SegmentationDeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image SegmentationNamHyuk Ahn
 

Similaire à Modeling with Hadoop kdd2011 (20)

StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University Chhattisgarh
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
ML_in_QM_JC_02-10-18
ML_in_QM_JC_02-10-18ML_in_QM_JC_02-10-18
ML_in_QM_JC_02-10-18
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image SegmentationDeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
 

Plus de Milind Bhandarkar

Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 

Plus de Milind Bhandarkar (6)

Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 

Dernier

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Dernier (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Modeling with Hadoop kdd2011

  • 1. Session 2: Modeling with Hadoop Algorithms in MapReduce Vijay K Narayanan Principal Scientist, Yahoo! Labs, Yahoo!
  • 2. Outline • Why learn models in MapReduce framework? • Types of learning in MapReduce • Statistical Query Model (SQM) • SQM Algorithms in MapReduce • Sequential learning methods and MapReduce • Challenges and Enhancements • Apache Mahout
  • 3. Why learn models in MapReduce? • High data throughput – Stream about 100 Tb per hour using 500 mappers • Framework provides fault tolerance – Monitors mappers and reducers and re-starts tasks on other machines should one of the machines fail • Excels in counting patterns over data records • Built on relatively cheap, commodity hardware – No special purpose computing hardware • Large volumes of data are being increasingly stored on Grid clusters running MapReduce – Especially in the internet domain
  • 4. Why learn models in MapReduce? • Learning can become limited by computation time and not data volume – With large enough data and number of machines – Reduces the need to down-sample data – More accurate parameter estimates compared to learning on a single machine for the same amount of time
  • 5. Learning models in MapReduce • A primer for learning models in MapReduce (MR) – Illustrate techniques for distributing the learning algorithm in a MapReduce framework – Focus on the mapper and reducer computations • Data parallel algorithms are most appropriate for MapReduce implementations • Not necessarily the most optimal implementation for a specific algorithm – Other specialized non-MapReduce implementations exist for some algorithms, which may be better • MR may not be the appropriate framework for exact solutions of non data parallel/sequential algorithms – Approximate solutions using MapReduce may be good enough
  • 6. Outline • Why learn models in MapReduce framework? • Types of learning in MapReduce • Statistical Query Model (SQM) • SQM Algorithms in MapReduce • Sequential learning methods and MapReduce • Challenges and Enhancements • Apache Mahout
  • 7. Types of learning in MapReduce • Three common types of learning models using MapReduce framework 1. Parallel training of multiple models Use the Grid as a – Train either in mappers or reducers large cluster of independent 2. Ensemble training methods machines – Train multiple models and combine them (with fault tolerance) 3. Distributed learning algorithms – Learn using both mappers and reducers
  • 8. Parallel training of multiple models • Train multiple models simultaneously using a learning algorithm that can be learnt in memory • Useful when individual models are trained using a subset, filtered or modification of raw data • Can train 1000’s of models simultaneously • Train 1 model in each reducer – Map: • Input: All data • Filters subset of data relevant for each model training • Output: <model_index, subset of data for training this model> – Reduce • Train model on data corresponding to that model_index
  • 9. Parallel training of multiple models • Train 1 model in each reducer Data subgroup 1 "model_1", Data model_1  Train Model_1 Data subgroup 2  Train Model_2 Data subgroup N "model_2", Data model_2  Mapper Reducer
  • 10. Parallel training of multiple models • Train 1 model in each mapper Map_1 Model (c1 ) • All data is sent to each c1 mapper (as a cache archive) • Mapper partition file Map_2 Model (c2 ) determines the training Training c2 configuration and labeling Data strategy {x, (ci , c j ...ck )} – e.g., Training one vs. rest models in multi-class ci {c1 , c2 ...cM } classification – Can train 1000s of classes in parallel Map_M Model (cM ) cM
  • 11. Ensemble methods • Train 1 base model in each mapper on a data partition • Combine the base models using ensemble methods (primarily, bagging) in the reducer • Strictly, bagging requires the data to be sampled with replacement – However, if the data set is very large, sampling without replacement may be ok • Base models are typically decision trees, SVMs etc.
  • 12. Ensemble Methods: Random Subspace Bagging (RSBag) • Assume that the training data is partitioned randomly into blocks – Class distributions are roughly the same across all blocks • Algorithm (Yan et al. 2007) – Learn 1 base model per data sub-group Base-model  hc ( x) yc {1, 1} – Optionally, use a random subset of features to train each model – Combine the multiple base models into a composite classifier as the final output Fci ( x)  Fci 1 ( x)  hci ( x)
  • 13. RSBag in MapReduce 1 Map_1 hc ( x) Features 2 Combine Map_2 h ( x) c base D models A into T final A hc3 ( x) classifier Map_3 Map_4 hc4 ( x)
  • 14. RSBag in MapReduce • Provides coarse level parallelism at the level of base models – Base models can be decision trees, SVMs etc. • Speed-up with SVM base models Nrd2 rf , rd , rf  data, feature sampling ratios N  5, rd  0.2, rf  0.5  Speedup  10 • Can achieve similar performance as a single classifier with theoretical guarantee in less learning time  E * ( Fc )   1  sc2  sc2 Upper bound on generalization error   E ,    x  h( x, )h( x, )     Correlation between classifiers sc  2 Ex , yc P  h( x,  )  yc   1  Strength of classifier
  • 15. Robust Subspace Bagging (RB-SBag) • Sometimes the base models may over-fit the training data – Correlation between base models may be high • Add a Forward selection step for models – Iteratively add base models based on their performance on a validation data (Yan et al. 2009) • Adds another MapReduce job – Select the base models using forward selection based on performance metrics on a validation dataset Vc
  • 16. RB-SBag in MapReduce Map_1 " c ",{hc , Pr ediction c (V )} Validation Data Map_2 1 hc ( x ), hc2 ( x ), .... hcN ( x ) 1. Forward selection of base models 2. Combine base models into composite Map_N classifier Mapper Reducer
  • 17. COMET: Cloud of Massive Ensemble Trees • Similar to RSBag, but uses Importance-Sampled Voting (IVoting) in each base model • Samples are weighted with non-uniform probability • Each mapper creates a set of data to train on • Ensemble after k iterations = E(k) – Add new sample to training set: • Always if E(k) incorrectly classifies new sample • With a lower probability if E(k) correctly classifies new sample e(k ) / (1  e(k )); e(k )  error on training dataset • Variant of Random Forests, in which IVoting generates the training samples instead of bagging • Use lazy evaluation during prediction J.D Basilico, M.A. Munson, T.G. Kolda, K.R. Dixon, W.P.Kegelmeyer, COMET: A Recipe for Learning and Using Large Ensembles on massive data, 2011, http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.2068v1.pdf
  • 18. Distributed learning algorithms • Use multiple mappers and reducers to learn 1 model • Suitable for learning algorithms that – Have heavy computing per data record – One or few iterations for learning – Do not transfer much data between iterations • Typical algorithms – Fit the Statistical query model (SQM) • One/few iterations – Linear regression, Naïve Bayes, k-means clustering, pair-wise similarity etc. • More iterations have high overheads, e.g., – SVM, Logistic regression etc. – Divide and conquer • Frequent item-set mining, Approximate matrix factorization etc.
  • 19. Outline • Why learn models in MapReduce framework? • Types of learning in MapReduce • Statistical Query Model (SQM) • SQM Algorithms in MapReduce • Sequential learning methods and MapReduce • Challenges and Enhancements • Apache Mahout
  • 20. Statistical Query Model (SQM) • Learning algorithm can access the learning problem only through a statistical query oracle (Kearns 1998) • Given a function f(x,y) over data instances, the statistical query oracle returns an estimate of the expectation of f(x,y) (averaged over the data distribution).
  • 21. Statistical Query Model (SQM) Raw Data Raw Data Learning Statistics Samples Samples Algorithm Oracle (X,Y) (X,Y) f ( x, y) • Learning algorithms that calculate sufficient statistics of data, gradients of a function, etc. fit this model • These calculations can be expressed in a “summation form” over subgroups of data (Chu et al. 2006)  subgroup f ( x, y )
  • 22. SQM in MapReduce • Distribute the summation calculations over each data sub-group • Map: – Calculate function estimates over sub-groups of data • Reduce – Aggregate the function estimates from various sub- groups • Learning algorithm should be able to work with these summaries alone
  • 23. SQM in MapReduce • Assume algorithm depends on 2 functions f(x,y) and g(x,y)   Data subgroup 1  " f ", subgroup f ( x, y ) " g ", subgroup g ( x, y ) Data subgroup 2    N subgroup f ( x, y),   N subgroup g ( x, y) Data subgroup N  Mapper Reducer
  • 24. Outline • Why learn models in MapReduce framework? • Types of learning in MapReduce • Statistical Query Model (SQM) • SQM Algorithms in MapReduce • Sequential learning methods and MapReduce • Challenges and Enhancements • Apache Mahout
  • 25. Algorithms in MapReduce • Many common algorithms can be formulated in the SQM framework (Chu et al. 2006) – Classification and Regression • Linear Regression, Naïve Bayes, Logistic regression, Support Vector Machine, Decision Trees – Clustering • K-means, Canopy clustering, Co-clustering – Back-propagation neural network – Expectation Maximization – PCA • Recommendations and Frequent Itemset mining • Graph Algorithms
  • 26. Classification and Regression algorithms in MapReduce • Linear Regression • Naïve Bayes • Logistic Regression • Support Vector Machine • Decision Trees
  • 27. Linear regression • Data vector: xi  ( xi1 , xi 2 ,...xin )T • Real valued target : yi • Weight of data point: wi  x, y, w m • Data set of points: y T x  *  A1b m A   wi ( xi xiT ) i 1 m Summation form b   wi ( xi yi ) i 1
  • 28. Linear Regression in MapReduce • Map: – Input data Index,  x, y, w from a subgroup of data – Output • 2 types of keys – K1 – for matrix A » Value1 = N x N matrix – K2 – for vector b » Value2 = N x 1 vector • Reducer: – Aggregate the individual mapper outputs for each key – Estimate   A b * 1
  • 29. Linear Regression in MapReduce • A: N x N matrix, b: N x 1 vector  x, y, w A, b " A ",  subgroup wi xi xiT " b ",  subgroup wi xi yi 1  x, y, w 2 A, b  A,  b  *  A1b  x, y, w k A, b Mapper Reducer
  • 30. Naïve Bayes • Input Data: x  ( x1 , x2 ,...xn ); x j {a1j , a2j ....aPj } j Domain of x j • Categorical target: y  c1 , c2 ...cL  • Class prediction: Class prior Conditional probability table (CPT) y*  arg max P( y  ck ) P( x j  a pj | y  ck ) j y j • Two types of sufficient statistics P( x j  a pj | y  ck ) j Sum counts P( y  ck ) over sub-groups
  • 31. Naïve Bayes in MapReduce • Map – Input data {x, y} from a subgroup of data – Output: 3 types of keys key  ( x j  a pj , y  ck ), value  j  subgroup 1( x j  a pj | y  ck ) j CPT key  ( y  ck ), value   subgroup 1( y  ck ) Class prior key  " samples ", value   subgroup 1 Normalization • Reduce – Sum all the values of each key – Compute the class prior and the conditional probabilities
  • 32. Logistic Regression • Features: x  ( x1 , x2 ,...xn ) ; • Categorical target: y [0,1] Data:  x, y  m • • Conditional probability: 1 P ( y | x,  )  1  exp( T x) • Equivalently  p  log   T x  1 p  – Log odds is a linear function of the features
  • 33. Logistic Regression • Estimate the parameters by maximizing the log conditional likelihood of observed data LCL  log p   log 1  p  i: yi 1 i i: yi 0 i • Optimize using Newton-Raphson to update      H 1 LCL   Gradient   LCL   j    y i  p i  x ij i Summation form Hessian  H jk  H jk   p  p  1 x x i i i i j k i i  [1, m]; j , k  [1, n] Data Features
  • 34. Logistic Regression in MapReduce • A control program sets up the MapReduce iterations • Map – Input:  x, y      – Output:   key  g , value   j ,  y i  p i x ij    isubgroup      i i    key  h, value   j , k ,  p p  1 x j xk   i i   isubgroup   • Reduce – Aggregate the values of  LCL   j , H jk from all mappers – Compute H  LCL   1 – Update     H  LCL   1 • Stop when updates become small 1 update per iteration
  • 35. Support Vector Machine • Features: x  Rn ; • Binary target: y [1, 1] • Objective function in primal form min w  C  i p , i  ( wT xi  yi ) 2 w ,b i ,i  0 s.t i   y i wT xi  b  (1  i ) p=1 (hinge loss), p=2 (quadratic loss) • For quadratic loss, batch gradient descent to estimate w Gw  2w  2C   w.xi  yi  xi i Summation form
  • 36. Support Vector Machine in MapReduce • Map – Input: ( x, y} – Output: key  GGW , value  2w  2C   w.x  y  x subgroup i i i • Reduce – Aggregate the values of gradient from all mappers – Update w  w  * Gw • Driver program sets up the iterations and checks for convergence
  • 37. Decision Trees • Features: x  ( x1 , x2 ,...xn ) • Targets: y [0,1] or yR Data: D   x, y  m • • Construct Tree – Each node splits the data by feature value – Start from root • Select best feature, value to split the node – Based on reduction in data impurity between the child and parent nodes – Select the next child node – Repeat the process till some stopping criterion • Pure node, or data is below some threshold etc.
  • 38. Decision Trees Expensive step for Large datasets B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb Endowment - PVLDB, vol. 2, no. 2, pp. 1426-1437
  • 39. PLANET for Decision Trees • Parallel Learner for Assembling Numerous Ensemble Trees (PLANET- Panda et al. 2009) – Main idea is to use MapReduce to determine the best feature value splits for nodes from large datasets • Each intermediate node has a sub-set of all data falling into it • If this sub-set is small enough to fit in memory, – Grow remaining sub-tree in memory • Else, – Launch a MapReduce job to find candidate feature value splits – Select the best feature split from among the candidates
  • 40. PLANET for Decision Trees • 5 main components 1. Controller M • Monitors and controls the growth of tree a p 2. Initialization Task R • Identifies all feature values to be considered for splits e 3. FindBestSplit Task d u • Finds best split when there is too much data to fit in memory c 4. InMemoryGrow Task e • Grow an entire sub-tree once the data fits in memory T 5. Model File a • File describing the state of the model s k s
  • 41. PLANET for Decision Trees • Maintain 2 queues – MapReduceQueue (MRQ) • Contains nodes for which data is too large to fit in memory – InMemoryQueue (InMemQ) • Contains nodes for which data fits in memory • 2 main MapReduce jobs – MR_ExpandNodes • Process nodes from the MRQ to find best split • Output for each node: – Candidate split positions for node along with » Quality of split (using summary statistics) » Predictions in left and right branches » Size of data going into left and right branches – MR_InMemory • Process nodes from the InMemQ. • For a given set of nodes N, complete tree induction at nodes in N using the InMemoryGrow algorithm.
  • 42. PLANET for Decision Trees • Map function in MR_ExpandNodes – Load the current model file and set of nodes N from MRQ – For each record • Determine if record is relevant to any of the nodes in N • Add record to the summary statistics (SS) for node • For each feature-value in record – Add record to the summary statistics for node for split points “s” less than the value in record “v” – Output Split ID key  (n  N , x  Ordered  feature, s ); value  Tn , x  s  SS of key  (n  N , x  Categorical  feature); value   v, Tn , x v  candidate splits key  (n  N ); value  SS SS of parent node   Tn , x  s   SS    y,  y ,  1 2  subgroup subgroup subgroup  SS for variance impurity
  • 43. PLANET for Decision Trees • Reduce function in MR_ExpandNodes – For each node • Aggregate the summary statistics for that node – For each split (which is node specific) • Aggregate the summary statistics for that Split ID from all map outputs of summary statistics • Compute impurity of data going into left and right branches • Total impurity = Impurity in left branch + Impurity in right branch • If Total impurity < Best split impurity so far – Best split = Current split – Output the best split found
  • 44. Clustering algorithms in MapReduce • k-means clustering • Canopy clustering • Co-clustering
  • 45. k-means clustering • Choose k samples as initial cluster centroids • Iterate till convergence – Assign membership of each point to closest cluster MR – Re-compute new cluster centroids using assigned members • Control program to – Initialize the centroids • random, initial clustering on sample etc. – Run the MapReduce iterations – Determine stopping criterion
  • 46. k-means clustering in MapReduce • Map – Input data points: x1 , x2 ...xN – Input cluster centroids: C  (c1 , c2 ,...cK ) – Assign each data point to closest cluster – Output   key  ci , value    x j | x j  ci ,  1| x j  ci  • Reduce  subgroup subgroup  – Compute new centroids for each cluster ci   key  ci subgroup x j | x j  ci key  ci , value    key  ci subgroup 1| x j  ci
  • 47. Complexity of k-means clustering • Each point is compared with each cluster centroid • Complexity = N * K * O(d ) where O(d ) is the complexity of the distance metric • Typical Euclidean distance is not a cheap operation • Can reduce complexity using an initial canopy clustering to partition data cheaply – Preliminary step to help reduce expensive distance calculations – Group data into (possibly overlapping) canopies using a cheap distance metric (McCallum et al. 2000) – Compute the distance metric between a point and a cluster centroid only if they share a canopy.
  • 48. Canopy clustering • Every point in the dataset is in a canopy • A point can belong to multiple canopies • Canopy size = T1 • Algorithm – Keep a list of canopies, initially an empty list – Scan each data point: • If it is within T2 < T1 distance of existing canopies, discard it. Otherwise, add this point into the list of canopies – Use a cheap distance metric to construct the canopies • e.g. Manhattan distance, L – Assign points to the closest canopy A. McCallum, K. Nigam, L. Ungar. Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, SIGKDD 2000
  • 49. Canopy clustering Image from: http://horicky.blogspot.com/2011/04/k-means-clustering-in-map-reduce.html
  • 50. Canopy clustering in MapReduce • Map – Input data points: x1 , x2 ...xN – If data point is not within distance T2 of an existing candidate canopy, add it as a candidate canopy point – Output key  1, value  xi | xi  candidate  canopy • Reduce – Keep a list of final canopy points, initially an empty list – If the canopy point is not within distance 0.5*T2 of an existing final canopy point, add it as a final canopy point – Output key  1, value  xi | xi  final  canopy
  • 51. Canopy + k-means clustering • Final step in canopy clustering assigns all points to the closest final canopy point – Map only operation • Speeding up k-means using canopy clustering – Initial run of canopy clustering on the data (or on a sample of data) • Pick canopy centers • Assign points to canopies – Pick initial k-means cluster centroids • Run k-means iterations – Compute distance between point and centroid only if they are in the same canopy
  • 52. Co-clustering • Cluster pair-wise relationships in dyadic data • Simultaneously cluster both rows and clusters, based on certain criteria • Identify sub-matrices of rows and columns that are inter-related • Commonly used in text mining, recommendation systems and graph mining
  • 53. Co-clustering • Given an m x n matrix – Find group assignments of rows and columns such that the resulting sub-matrices are smooth (Papadimitriou & Sun, 2008) – Assign rows and columns to clusters r {1, 2....k}m , c {1, 2....l}n , k  m, l  n 01011 r   2 1 2 1 T 11000 10100 11000 01011 c   2 1 2 1 1 00111 T 10100 00111
  • 54. Co-clustering • Iteratively re-arrange rows and columns till an error function keeps reducing • Algorithm: Input Am x n , k , l – Initialize r and c – Compute a group statistics/cost matrix Gk x l – While cost decreases • For each row i  1 m do – For each row group label p  1 k do » r (i) p if cost decreases • Update G, r • Do the same for columns – Return r and c S. Papadimitriou, J. Sun, DisCo: Distributed Co-clustering with Map-Reduce, 2008, ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521
  • 55. Co-clustering in MapReduce • Assumptions – Error can be computed using r , c, G only (sufficient statistics) – Row assignments can be based on r , c, G, ai: (greedy search) • Map: – Cost matrix and column cluster assignments are in all mappers – Input: • Key = row index i • Value = adjacency list for row i  ai: – Compute: • Row statistics for current column cluster assignment gi (ai: , c) • Assign row to row cluster r (i) {1 k} that has the lowest cost – Output: key  r (i ) Row cluster label for row value  ( gi ,{i}) Cost of cluster assignment, row
  • 56. Co-clustering in MapReduce • Reduce – For each row cluster label, merge the rows and total cost p  r (i) gp   j:r ( j )  r ( i ) gi Ip  Ip i Row cluster label Total cost Rows in this row cluster – Output p,  g p , I p  • Collect the results for each row cluster – For each reduce output g p:  g p r (i )  p, i  I p
  • 57. Co-clustering in MapReduce- Example • Assume a row and column partitioning for the matrix k 2 l2 01011 r  (1,1,1, 2) r  (1, 2,1, 2) 10100 c  (1,1,1, 2, 2) 01011 Cost function = Number of non-zeros per group 10100  4 4  2 4 G=   G=    2 0  4 0 Reduce: Map: Input: (2,<(2, 0),{2}) ) Input:(2,  1,3 ) Output: g 2   2, 0  Output:  r (2)  2, ( g 2  (2, 0),{2})  I2  I2 {2} S. Papadimitriou, J. Sun, DisCo: Distributed Co-clustering with Map-Reduce, 2008, ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521
  • 58. Recommendations and Frequent Itemset mining • Item-based collaborative filtering • Pair-wise similarity • Low-rank matrix factorization • Frequent Itemset mining
  • 59. Item-based collaborative filtering • Given a user-item ratings matrix, fill in the ratings of the missing items for each user ITEM RATING U 514 S E ?25 R 432 • Infer missing ratings from available item ratings for user weighted by similarity between items  R ( u , j )!? sim(i, j ) * R(u, j ) R(u, i )   R ( u , j )!? sim(i, j )
  • 60. Item-based collaborative filtering • Estimate similarity between items as Pearson correlation of rankings from users who have rated both items.   R(u, i)  R (i)  R(u, j )  R ( j )  U ij sim(i, j )    R(u, i)  R (i)    R(u, j )  R ( j )  2 2 U ij U ij U ij  {u | R(i )!  ?, R( j )!  ?}
  • 61. Item-based collaborative filtering using MapReduce • Map • Reduce – Input: – Input: key  u, key  (i, j ) Value  {(i, R(i ) | R(i )!  ?)} Value  [( R(i ), R( j )] – Output: Ratings for – Output: item pairs key  (i, j ) key  (i, j ) Value  sim(i, j ) Value  ( R(i ), R( j ))
  • 62. Pair-wise Similarity • Compute similarity between pairs of documents in a corpus S (di , d j )   wt ,di * wt ,d j   wt ,di * wt ,d j tV tdi dj • Generate a postings list for each t V P(t )  {(di , wt ,di ) | wt ,di  0} – This is an easy map-reduce job
  • 63. Pair-wise Similarity in MapReduce • Generating a postings list of inverted index – Map Input di For each t  di Emit {t , ( di , wt ,di )} – Reduce Emit {t ,[(di , wt ,di )]}
  • 64. Pair-wise Similarity in MapReduce • Map – Input term postings list t , P(t ) – Take the Cartesian product of the postings list with itself • For each pair of (di , d j )  P(t ) Emit <(i, j ), sim(i, j )  wt ,di * wt ,d j  • Reduce – For each key  (i, j ), Sim(di , d j )   sim(i, j )
  • 65. Pair-wise Similarity in MapReduce • Cartesian product of postings list with itself may produce a large set of intermediate keys • Modify the above algorithm as follows – Split the corpus into blocks of documents and query against postings list – Map • Input term postings list t , P(t ) • Load blocks of documents in memory • For each document d i in block – If t  di compute partial score for each element – Reduce • For each document, aggregate the partial scores from mappers for all other documents • Can reduce intermediate keys by implementing term limits when documents are loaded into memory
  • 66. Low-rank matrix factorizations • Useful for analyzing patterns in dyadic data Vm x n  Wm x d H d x n , d min(m, n) • Given an application dependent loss function, find arg min L(V ,W , H ) W ,H • Most loss functions are sums of local losses L  ( i , j )Z l (Vij ,Wij , H ij ) • Use stochastic gradient descent (SGD) for this factorization
  • 67. SGD for matrix factorization Training set Z  Vij | Vij !  ?  , initial values W0 , H 0 While not converged, do Select a training point (i, j )  Z uniformly at random Lij (W , H )  Wi*  Wi*   n N ' l (Vij ,Wi* , H * j ) For local losses, WWi ' k i* depend only on Lij (W , H )  H* j  H* j   n N l (Vij ,Wi* , H * j ) l (Vij , Wi* , H * j ) HjH kj ' * Wi*  Wi* ' end while R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011
  • 68. SGD for matrix factorization in MapReduce • Main ideas – Local loss depends only on Vij ,Wi* , H* j – If sub-matrices do not share rows and columns, they can be factored independently and factors combined. Z b  W bH b H 1 H 2 H d  W 1   W  Z 0 ... 0  1 11  2  2   W   W  0 Z 22 ...  W    H  H 1 , H 2 ...H d    0       W d   W d  0 dd      0 Z  – Stratify the input matrix such that each stratum can be processed in a distributed manner
  • 69. SGD for matrix factorization in MapReduce • Stratify the input matrix (dropping missing values) into subsets Z s , Z s2 , 1 Z sd such that i  i ', j  j ' (i, j )  Z sb1 , (i ', j ')  Z sb 2 , (b1  b2) • Stratification – Randomly permute the rows and columns of the input matrix Z 11 n/d  V11 V12 V1n  For a permutation j1 , j2 .... jd   of 1...d m/d  V21 V22 V2 n  Z s  Z 1 j1 Z 2 j2 Z 3 j3 ... Z djd    V V   m1 m 2 Vmn   R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011
  • 70. SGD for matrix factorization in MapReduce Training set Z , initial values W0 , H 0 , cluster size d W  W0 , H  H 0 Block Z / W / H into d x d / d x 1/1 x d blocks While not converged, do Epochs Pick step size  For s  1 d do Sub-epoch  Pick d blocks Z 1 j1 , Z 2 j2 ,....Z djd  to form a stratum Z s For b  1 d do Machines Run SGD on points in Z bjb with step size  end for end for end while R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011
  • 71. Frequent Itemset Mining • Set of items I  {a1 , a2 ...aM } and D  {T1 , T2 ...TN } where Ti  subsets of I • Pattern A  I is frequent if support( A)   • Problem – Find all complete frequent item-sets of D • Divide and conquer approach – Patterns containing A can be found using only transactions containing A. – Filter transactions with A – conditional database (CDB) of A – Find patterns containing A in CDB(A)
  • 72. Frequent Itemset Mining • Construct a Frequent Pattern (FP) Tree – Keep only items with frequency above the minimum support – Sort each transaction in descending order of frequent items – Add each sorted transaction to an item prefix tree – Each node in the FP tree is an item • Node has count of transactions with that item in that path • Nodes of same items in different paths are linked together • FPGrowth algorithm – Start from CDB of single frequent item – Build FP Tree of CDB – Mine frequent patterns from CDBs using recursion • Recursion terminates when CDB has a single path • Frequent pattern = Union of all nodes in this tree with support = min. support of nodes in this tree Mining frequent patterns without candidate generation, J. Han, J. Pei,Y. Yin. 2000, In SIGMOD, 2000.
  • 73. Frequent Itemset Mining f:4 p: { f c a m / f c a m / c b } c:4 facdgimp a:3 fcamp b:3 m: { f c a / f c a / f c a b } m:3 abcflmo p:3 fcabm b: { f c a / f c } o:2 bfhjo d:1 fb e:1 a: { f c / f c / f c } g:1 bcksp h:1 cbp i:1 c: { f / f / f } k:1 afcelpmn l:1 fcamp f: {} n:1 Original Sorted Conditional databases of Frequent transactions transactions Frequent items items
  • 74. Frequent Itemset Mining in MapReduce • Identifying frequent items = 1 MapReduce job – Find the set of items and the associated frequency • Prune this frequent items list keeping only items more frequent than minimum support • Mine subsequent projected CDBs in MapReduce iterations (Li et al. 2008) – Project transactions in CDB by least frequent item in the mapper – Breadth first search of the FP Tree using a MapReduce iteration – Once projected CDB fits in memory of reducer • Run FPGrowth algorithm in reducer • No more growth of the sub-tree
  • 75. Frequent Itemset Mining in MapReduce p: { f c a m / f c a m / c b } pc: {} pc:3 p D|p c m: { f c a / f c a / f c a b } D|am D|cam am: { f c / f c / f c } a m cm: { f / f / f } D|m c cam: { f / f / f } D|cm fcam: {} b mf:3, mc:3, ma:3, mfc:3, D D|b mfa:3, mca:3, mfca:3 c b: { f c a / f c } a D|a D|ca a: { f c / f c / f c } ca: { f / f / f } c fa: {} D|c af:3, ac:3, afc: 3 c: { f / f / f } fc: {} MR Iteration 1 MR Iteration 2 MR Iteration 3 cf:3
  • 76. Graph Algorithms • Ubiquitous in web applications – Web-graph, Social network graph, User-item graph • Typical problems – Popularity (e.g. PageRank) – Shortest paths – Clustering, semi-clustering etc.
  • 77. Graph algorithms in MapReduce • Vertex centric approach – Work with the adjacency list of each vertex – Especially useful for sparse adjacency matrices • Breadth first search – Each MR iteration advances the horizon by one level • In each iteration 1. Compute on each vertex 2. Pass values to connected vertices for aggregation in the reducer 3. Pass the adjacency list of each node to the reducer
  • 78. Breadth first search on Graphs in MapReduce 3 • Easy (iterative) 2 implementations exist for some common algorithms – Single source shortest path 1 2 3 – PageRank 2 3 3 MR Iteration 1 MR Iteration 2
  • 79. Single source shortest path in MapReduce • Find the shortest path from a given node to any reachable node • Given a start node: – Distance to adjacent nodes = 1 – Distance to any other node reachable from a set of nodes S DistanceTo(n) = 1 + min(DistanceTo(m), m  S) • Map • Reduce – Input: – Input: • Node “n” • “p”, “D+1” from all nodes • D, Adjacency list of “n” pointing to “p” Pass the – Output: graph from • “n”, Adjacency list of “n” • For each node “p” in 1 iteration – Output: adjacency list To the next • “p”, min(“D+1” from all – <p, (D+1)> nodes pointing to “p”) • <n,Adjacency list of “n”> • “n”, Adjacency list of “n”
  • 80. PageRank • Given a node A PR(Ti ) PR( A)  d  (1  d ) *  {Ti :Ti  A} C (Ti ) d  random jump probability Ti  node pointing to A C (Ti )  out-degree of Ti • Iterate this equation till convergence • Driver program to check if the page rank for each node has converged
  • 81. PageRank in MapReduce • In each iteration (i) • Map • Reduce – Input: – Input: • Node “n”, PRi-1(n) • <“p”, V from all nodes “n” • Adjacency list of “n” pointing to “p”> – Compute • Adjacency list of “n” • V = PRi-1(n) / |Adjacency – Compute list of n| • PRi(p) = Sum(V) – Output: – Output: • For each node “p” in • <p, PRi(p) > adjacency list • <n,Adjacency list of “n”> – <p, V> • <n,Adjacency list of “n”>
  • 82. Frameworks for graph algorithms • MapReduce is not a good fit for graph algorithms – 1 iteration for each level of the graph has large overheads • “Bulk synchronous processing model” for graph processing. – Components – for either compute or storage – Router – to deliver point to point messages – Synchronization at periodic intervals (called supersteps) that are atomic • In each superstep, vertex can – Receive messages sent by other vertices in previous superstep – Compute using the data in that vertex and the received messages – Send messages to other vertices
  • 83. Frameworks for graph algorithms • Vertex can vote to go to halt state • Computation stops when all vertices have voted to halt. • Vertices can also mutate the graph – Add/remove edges and other vertices – Mutations implemented in next superstep • Framework also supports aggregators – Can maintain global summaries over the graph – Values communicated to all vertices before the next superstep • Large scale graph processing tools leveraging Grid – Pregel (in Google) – Open source implementation Giraph https://github.com/aching/Giraph
  • 84. Outline • Why learn models in MapReduce framework? • Types of learning in MapReduce • Statistical Query Model (SQM) • SQM Algorithms in MapReduce • Sequential learning methods and MapReduce • Challenges and Enhancements • Apache Mahout
  • 85. Sequential learning methods • Some learning algorithms are inherently sequential in nature, e.g., – Stochastic Gradient Descent (SGD) minimization – Conditional Maximum Entropy using SGD – Perceptron • Difficult to distribute sequential algorithms over data partitions – Need frequent communication of intermediate parameter values • Some sequential algorithms can be trained in a cluster environment. – Theoretical and empirical analysis show that parameters converge to the values from sequential training over all data
  • 86. Sequential learning methods in MapReduce • Types of sequential learning in MapReduce – Single M/R job: • Learn parameters on each data partition in mappers over multiple epochs • Average the model parameters from all mappers in a reducer – Multiple M/R jobs: • Learn parameters on each data partition in each mapper for 1 epoch • Average the model parameters from all mappers in a reducer • Start the next iteration for next epoch in the mapper with the average parameter values from previous iteration – Communicate between nodes • Launch MPI on Hadoop cluster
  • 87. Stochastic Gradient Descent (SGD) methods • Many learning algorithms involve optimizing an objective function (maximizing log likelihood, minimizing root mean square error etc.) over the training data to determine the optimal parameters w*  arg min  L( xi , y i , w) itraining  data w  w  *  itraining  data  w L(xi , y i , w) • Stochastic Gradient techniques update the parameter one example at a time w  w  * w L( xi , y i , w) • Parameter updates are inherently sequential
  • 88. Parallelized SGD • Partition the training data into multiple partitions, each with T examples chosen at random • Perform stochastic gradient updates on each data partition separately with constant learning rate. • Average the solutions between different machines. • For large scale data, (Zinkevich et al. 2010) show that – Parameter values converge to sequential estimates – For k partitions, averaging the parameters reduces variance by O(k 1 2 ) – Bias in parameter estimates decreases as well
  • 89. Parallelized SGD in MapReduce Map: Reduce: In each mapper i 1...k Aggregate from all mappers: Machines wi ,0  0 1 k v   wi ,t Average across all For t  1...T Data k i 1 machines wi ,t  wi ,(t 1)   *  w L( x, y , wi ,t 1 ) end for end for
  • 90. Parallelized SGD in MapReduce • Multi-pass parallel SGD (Weimer, Rao, Zinkevich 2010) – Divide the data randomly among all machines ctj  t th example sent to j th machine * – Initialize weight vector w – For i {1...T } iterations do Iterations • For each machine j {1...k} do Machines wi  w* Shuffle data uniformly at random p :{1...m}  {1...m} For each t {1...m} do Data Initial value for next w j  w j c p (t ) (w j ) j end for iteration end for 1 k j w  w * Average across all machines in k j 1 each iteration end for
  • 91. Conditional MaxEnt models • Used in both binary and multi-class classification problems • Commonly used in NLP and computer vision S  {( x1 , y1 ), ( x2 , y2 )...( xm , ym ) 1 pw ( y | x )  exp  w. ( x, y )  ,  ( x, y )  feature Z ( x) Z ( x)   exp  w. ( x, y)  yY 1 m w  arg min FS ( w)  arg min  w   log pw ( y | x) 2 w w m i 1 y  arg max pw ( y | x) y
  • 92. Conditional MaxEnt in MapReduce • Mixture weighting method (Mann et al. 2009) – Train a model in each of mappers using standard gradient ascent on a M subsample of the data. k th mapper; wk  0 for t  1...T do wk  wk   *  wk FS ( wk ) return w – Average the weights from all the mappers in 1 reducer M w   k wk M k 1 k  0  k 1 – Mann et al. (2009) show that the mixture weighting1 estimate converges to the k sequential estimate
  • 93. Perceptron algorithm • Online algorithm used in NLP for structure prediction e.g., – Parsing, Named entity recognition, Machine translation etc. Perceptron( D  {xi , yi }) w(0)  0; k  0 for n  1...N N epochs for t  1... | D | Data y '  arg max wk . f ( xt , y ' ) Predict using y' current weights if ( y '  yt ) Add weight to features for w( k 1)  wk  f ( xt , yt )  f ( xt , y ' ) correct output k  k 1 Remove weights to features for incorrect output return wk
  • 94. Perceptron in MapReduce • Iterative parameter mixing – Train using data sub-group for 1 epoch in each mapper – Average the weights in reducer – Communicate back to mapper – Train next epoch in mapper OneEpochPerceptron( D, w) w(0)  w; k  0 w 0 for t  1... | D | for n  1...N y '  arg max wk . f ( xt , y ' ) w(i ,n ) = OneEpochPerceptron( Di , w) y' w   i , n w (i ,n ) if ( y '  yt ) w( k 1)  wk  f ( xt , yt )  f ( xt , y ' ) i return w k  k 1 Average across all machines in return wk each iteration
  • 95. Perceptron in MapReduce • McDonald et al. (2010) show that averaging parameters after each epoch: – Has as good or better performance as sequential training on all data – Trains better classifiers quicker than training sequentially on all data – Performs better than averaging parameters from training model in each partition for multiple epochs to convergence
  • 96. Outline • Why learn models in MapReduce framework? • Types of learning in MapReduce • Statistical Query Model (SQM) • SQM Algorithms in MapReduce • Sequential learning methods and MapReduce • Challenges and Enhancements • Apache Mahout
  • 97. Challenges for ML algorithms on Hadoop • Hadoop is optimized for large batch data processing – Assumes data parallelism – Ideal for shared nothing computing • Many learning algorithms are iterative – Incur significant overheads per iteration • Multiple scans of the same data – Typically once per iteration  high I/O overhead reading data into mappers per iteration – In some algorithms static data is read into mappers in each iteration • e.g. input data in k-means clustering. • Need a separate controller outside the framework to: – coordinate the multiple MapReduce jobs for each iteration – perform some computations between iterations and at the end – measure and implement stopping criterion
  • 98. Challenges for ML algorithms on Hadoop • Incur multiple task initialization overheads – Setup and tear down mapper and reducer tasks per iteration • Transfer/shuffle static data between mapper and reducer repeatedly – Intermediate data is transferred through index/data files on local disks of mappers and pulled by reducers • Blocking architecture – Reducers cannot start till all map jobs complete • Availability of nodes in a shared environment – Wait for mapper and reducer nodes to become available in each iteration in a shared computing cluster
  • 99. Iterative algorithms in MapReduce Overhead per Iteration: Data (each pass) Pass Result •Job setup •Data Loading •Disk I/O
  • 100. Enhancements to Hadoop • Many proposals to overcome these challenges • All try to retain the core strengths of data partitioning and fault tolerance of Hadoop to various degrees • Proposed enhancements and alternatives to Hadoop – Worker/Aggregator framework – HaLoop – MapReduce Online – iMapReduce – Spark – Twister – Hadoop ML – …..
  • 101. Worker/Aggregator framework • Worker - Load data in memory - Iterate: › Iterates over data using user specified functions › Communicates state › Waits for input state of next pass • Aggregator – Receive state from the workers – Aggregate state using user specified functions – Send state to all workers • Communicate between workers and aggregators using TCP/IP • Leverage the fault tolerance, and data locality of Hadoop M. Weimer, S. Rao, M. Zinkevich, 2010, NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds
  • 102. Parallelized SGD in Worker/Aggregator Advantages: Final Result Initial Data •Schedule once per Job •Data stays in memory •P2P communication 102 9/7/2011
  • 103. HaLoop • Programming model and architecture for iterations – New APIs to express iterations in the framework • Loop-aware task scheduling – Physically co-locate tasks that use the same data in different iterations – Remember association between data and node – Assign task to node that uses data cached in that node • Caching for loop invariant data: – Detect invariants in first iteration, cache on local disk to reduce I/O and shuffling cost in subsequent iterations – Cache for Mapper inputs, Reducer Inputs, Reducer outputs • Caching to support fixpoint evaluation: – Avoids the need for a dedicated MR step on each iteration HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10
  • 104. HaLoop vs. MapReduce Application Application Framework Framework • HaLoop framework controls the loop • First iteration is similar to that on Hadoop. • Framework identifies data  node mappings, caches and indexes for fast access, and controls looping • Subsequent iterations leverage the above optimizations HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10
  • 105. New, additional API HaLoop Design Leverage data locality Caching for fast Starts new access MR jobs repeatedly HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10
  • 106. HaLoop Programming API Name Functionality Map() & Reduce() Specify a map & reduce function AddMap() & AddReduce() Specify a step in loop Iteration SetDistanceMeasure() Specify a distance for results inputs SetInput() Specify inputs to iterations AddInvariantTable() Specify loop-invariant data SetFixedPointThreshold() A loop termination condition Loop SetMaxNumberOfIterations() Specify the max number of control iterations SetReducerInputCache() Enable/disable reducer input cache SetReducerOutputCache() Enable/disable reducer output Cache cache control SetMapperInputCache() Enable/disable mapper input cache HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10
  • 107. k-means clustering in HaLoop • k-means in HaLoop 1. Job job = new Job(); 2. job.AddMap(Map_Kmeans,1);  Assign data point to closest cluster 3. job.AddReduce(Reduce_Kmeans,1);  Re-compute centroids 4. job.SetDistanceMeasure(ResultDistance); – # of changes in cluster membership 5. job.SetFixedPointThreshold(0.01); 6. job.SetMaxNumOfIterations(12);  Stopping criteria 7. job.SetInput(IterationInput);  Same input data to each iteration 8. job.SetMapperInputCache(true); – Enable mapper input caching for mappers to read data from local disk node 9. job.Submit();
  • 108. MapReduce Online • Pipeline data between operators as it is produced – Decouple computation and data transfer schedules – Intra-job: • between mapper and reducer – Inter-job: • schedule multiple dependent jobs simultaneously • between reducer of one job and mapper of next job • “Push” data from producers instead of a “pull” by consumers • Intermediate data is considered tentative till map job completes – Also stored on disk for fault tolerance/recovery • Reducer starts as soon as some data is available from mappers – Can compute approximate answers from partial data • Mappers and Reducers can also run continuously – Enables stream processing Mapreduce online, T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, R. Sears, 2010, NSDI'10, Proceedings of the 7th USENIX conference on Networked systems design and implementation
  • 109. iMapReduce • Iterative processing – Persistent map/reduce tasks – Each reduce task has a locally connected corresponding map task • Maintain static data locally – On local disk of mapper • Asynchronous map execution – Persistent socket between reducemap – Completion of reduce triggers map – Mappers do not need to wait iMapReduce: A Distributed Computing Framework for Iterative Computation, Y. Zhang, Q. Gao, L. Gao, C. Wang, DataCloud 2011
  • 111. iMapReduce – Asynchronous map execution T I M E MapReduce iMapReduce
  • 112. Spark • Open source cluster computing model: – Different from MapReduce, but retains some basic character • Optimized for: – iterative computations • Applies to many learning algorithms – interactive data mining • Load data once into multiple mappers and run multiple queries • Programming model using working sets – applications reuse intermediate results in multiple parallel operations – preserves the fault tolerance of MapReduce • Supports – Parallel loops over distributed datasets • Loads data into memory for (re)use in multiple iterations – Access to shared variables accessible from multiple machines • Implemented in Scala, • www.spark-project.org Spark: Cluster Computing with Working Sets. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica. 2010, USENIX HotCloud 2010.
  • 113. Outline • Why learn models in MapReduce framework? • Types of learning in MapReduce • Statistical Query Model (SQM) • SQM Algorithms in MapReduce • Sequential learning methods and MapReduce • Challenges and Enhancements • Apache Mahout
  • 114. Mahout • Goal – Create scalable, machine learning algorithms under the Apache license. • Scalable: – to large datasets – business use cases – community • Contains both: – Hadoop implementations of algorithms that scale linearly with data. – Fast sequential (non MapReduce) algorithms • Latest release is Mahout 0.5 on 27th May 2011 (circa Aug 4, 2011) • Wiki: – https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki • Mailing lists – User, Developer, Commit notification lists – https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists
  • 115. Algorithms in Mahout • Classification: – Logistic Regression – Naïve Bayes, Complementary Naïve Bayes – Random Forests • Clustering – K-means, Fuzzy k-means – Canopy – Mean-shift clustering – Dirichlet Process clustering – Latent Dirichlet allocation – Spectral clustering • Parallel FP growth • Item based recommendations • Stochastic Gradient Descent (sequential)
  • 116. Acknowledgment Numerous wonderful colleagues! Questions?
  • 118. Exercise problem • Problem: – Predict the age of abalone as a function of physical attributes – Useful for ecological and commercial fishing purposes • Dataset: – Dataset from the Marine Resources Division at the Department of Primary Industry and Fisheries, Tasmania – Attributes: • Gender, Length, Diameter, Height, 4 different weights – 8 attributes – Target: • Number of Rings in shell • Age (in years) = 1.5 + number of rings in shell – At: http://www.stat.duke.edu/data-sets/rlw/abalone.dat • Learn a linear relation between the age and the physical attributes
  • 119. Exercise dataset • Original data sample size = 4177 • Generate larger dataset by replicating each record – Add Gaussian noise for each feature with the sample variance – Do not add variance for Gender and # of rings – Replicate by factors of: • 10x, 1k x, 8k x, 16k x, 32k x • Datasets of about 40k, 4MM, 32 MM, 64MM and 128 MM records. • For all attributes, compared to the original dataset, the larger datasets have: – same mean – higher sample variance
  • 120. Exercise: Model training • Train a linear regression model 8 Rings   wi xi x0  1 i 0 w*  A1b 8 8 A   ( xi x ) T i b   ( xi yi ) i 0 i 0 • Split the training data into 10 parts • Mapper: – Compute the matrix A and vector b on each partition • Reducer – Aggregate the values of A and b from all mappers – Compute the weights w*  A1b
  • 121. Exercise: Model Results • For replication factor of 10x – w[Sex] = 0.747 – w[Length] = 1.894 – w[Diameter] = 2.844 – w[Height] = 7.213 – w[Whole] = 0.311 – w[Shucked] = -0.558 – w[Viscera] = 0.840 – w[Shell] = 3.288 – w[1] = 5.046
  • 122. Training Times: Sequential vs Hadoop 9000 8000 Training Time (seconds) 7000 Hadoop Sequential 6000 5000 4000 3000 2000 1000 0 0 20 40 Data60 (MM records) size 80 100 120 140
  • 123. References 1. M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, Vol. 45, No. 6, November 1998, pp. 983–1006. 2. C. Chu, S.K.Kim, Y. Lin, Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun, Map-Reduce for Machine Learning on Multicore. In Proceedings of NIPS 2006, pp. 281-288. 3. W. Zhao, H. Ma, Q. He. Parallel K-Means Clustering Based on MapReduce. CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing 2009, pp. 674-679 4. R. Ho. http://horicky.blogspot.com/2011/04/k-means- clustering-in-map-reduce.html
  • 124. References 5. Cluster Computing and MapReduce, Lecture 4. http://www.youtube.com/watch?v=1ZDybXl212Q 6. A. McCallum, K. Nigam, L. Ungar. Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 2000, pp.169-178 7. C. Elkan, 2011. http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf 8. B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb Endowment - PVLDB, vol. 2, no. 2, pp. 1426- 1437.
  • 125. References 9. J.S. Herbach, 2009. http://fora.tv/2009/08/12/Josh_Herbach_PLANET_MapR educe_and_Tree_Learning#fullprogram 10. R. Yan, J. Tesic, and J. R. Smith. Model-shared subspace boosting for multi-label classification, 2007, In Proceedings of the 13th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, pp 834-843. 11. R. Yan, M. Fleury, M. Merler, A. Natsev, J.R. Smith, 2009, Proceedings of the First ACM workshop on Large- scale multimedia retrieval and mining, pp 35-42 12. J.D Basilico, M.A. Munson, T.G. Kolda, K.R. Dixon, W.P.Kegelmeyer, COMET: A Recipe for Learning and Using Large Ensembles on massive data, 2011, http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.2068v1.pd f
  • 126. References 13. S. Papadimitriou, J. Sun, DisCo: Distributed Co- clustering with Map-Reduce, 2008,ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521 14. M.A. Zinkevich, M. Weimer, A. Smola, A., L. Li, Parallelized Stochastic Gradient Descent, 2010, NIPS. 15. T. Elsayed, J. Lin, and D. Oard. Pairwise document similarity in large collections with MapReduce, 2008, In ACL, Companion Volume, pp 265-268, 2008 16. J. Lin, Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce., Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) 2009.
  • 127. References 17. M. Weimer, S. Rao, M. Zinkevich, 2010, NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds 18. HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010. 19. G. Mann, R. McDonald, M. Mohri, N. Silberman, D. D. Walker, 2009, in Advances in Neural Information Processing Systems 22 (2009), edited by: Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta pp. 1231-1239. 20. R. McDonald, K. Hall, G. Mann, Distributed training strategies for the structured perceptron , 2010, In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2010), pp. 456-464.
  • 128. References 21. H. Li, Y. Wang, D. Zhang, M. Zhang, E.Y. Chang, 2008, In Proceedings of the 2008 ACM conference on Recommender systems (2008), pp. 107-114. 22. R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011 http://www.almaden.ibm.com/cs/people/peterh/dsgdTe chRep.pdf 23. Pregel: a system for large-scale graph processing, G. Malewicz, M. H. Austern, A. J.C Bik, J. C. Dehnert, A.H Horn, N. Leiser, G. Czajkowski, 2010, SIGMOD '10 Proceedings of the 2010 international conference on Management of data
  • 129. References 24. Mapreduce online, T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, R. Sears, 2010, NSDI'10, Proceedings of the 7th USENIX conference on Networked systems design and implementation 25. iMapReduce: A Distributed Computing Framework for Iterative Computation, Y. Zhang, Q. Gao, L. Gao, C. Wang, 2011, DataCloud 2011 26. Spark: Cluster Computing with Working Sets. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica. 2010, USENIX HotCloud 2010. 27. Mining frequent patterns without candidate generation, J. Han, J. Pei,Y. Yin. 2000, In SIGMOD, 2000.
  • 130. Backup
  • 131. Decision Trees • Features: x  ( x1 , x2 ,...xn ) • Targets: y [0,1] or yR Data: D   x, y  m • • Construct Tree – Each node splits the data by feature value – Start from root • Select best feature, value to split the node – Based on reduction in data impurity between the child and parent nodes – Select the next child node – Repeat the process till some stopping criterion • Pure node, or data is below some threshold etc.
  • 132. Decision Trees Expensive step for Large datasets B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb Endowment - PVLDB, vol. 2, no. 2, pp. 1426-1437
  • 133. PLANET for Decision Trees • Parallel Learner for Assembling Numerous Ensemble Trees (PLANET- Panda et al. 2009) – Main idea is to use MapReduce to determine the best feature value splits for nodes from large datasets • Each intermediate node has a sub-set of all data falling into it • If this sub-set is small enough to fit in memory, – Grow remaining sub-tree in memory • Else, – Launch a MapReduce job to find candidate feature value splits – Select the best feature split from among the candidates
  • 134. PLANET for Decision Trees • 5 main components 1. Controller M • Monitors and controls the growth of tree a p 2. Initialization Task R • Identifies all feature values to be considered for splits e 3. FindBestSplit Task d u • Finds best split when there is too much data to fit in memory c 4. InMemoryGrow Task e • Grow an entire sub-tree once the data fits in memory T 5. Model File a • File describing the state of the model s k s
  • 135. PLANET for Decision Trees • Controller – Determines the state of the tree and grows it • Decides if nodes are pure or have small data to become leaves • Data fits in memory  Launch a MapReduce job to grow the entire sub-tree in memory • Data does not fit in memory  Launch a MapReduce job to find candidate best splits • Collect results from MR jobs and choose the best split for a node • Update the Model File – Periodically checkpoints the system • Model File – Contains the state of the tree constructed so far – Used by the controller to check which nodes to split or grow next
  • 136. PLANET for Decision Trees • Maintain 2 queues – MapReduceQueue (MRQ) • Contains nodes for which data is too large to fit in memory – InMemoryQueue (InMemQ) • Contains nodes for which data fits in memory • Initialization Task (MapReduce) – Identifies candidate attribute values for node splits – Continuous attributes • Compute an approximate equi-depth histogram • Boundary points of histogram used for potential splits – Categorical attributes • Identify attribute's domain • Sort values by average values of Y and use this for ordering – Generate a file with list of attributes to be used by other tasks
  • 137. PLANET for Decision Trees • 2 main MapReduce jobs – MR_ExpandNodes • Process nodes from the MRQ to find best split • Output for each node: – Candidate split positions for node along with » Quality of split (using summary statistics) » Predictions in left and right branches » Size of data going into left and right branches – MR_InMemory • Process nodes from the InMemQ. • For a given set of nodes N, complete tree induction at nodes in N using the InMemoryGrow algorithm.
  • 138. PLANET for Decision Trees • Map function in MR_ExpandNodes – Load the current model file M and set of nodes N – For each record • Determine if record is relevant to any of the nodes in N • Add record to the summary statistics (SS) for node • For each feature-value in record – Add record to the summary statistics for node for split points “s” less than the value in record “v” – Output Split ID key  (n  N , x  Ordered  feature, s ); value  Tn , x  s  SS of key  (n  N , x  Categorical  feature); value   v, Tn , x v  candidate splits key  (n  N ); value  SS SS of parent node   Tn , x  s   SS    y,  y ,  1 2  subgroup subgroup subgroup  SS for variance impurity
  • 139. PLANET for Decision Trees • Reduce function in MR_ExpandNodes – For each node • Aggregate the summary statistics for that node – For each split (which is node specific) • Aggregate the summary statistics for that Split ID from all map outputs of summary statistics • Compute impurity of data going into left and right branches • Total impurity = Impurity in left branch + Impurity in right branch • If Total impurity < Best split impurity so far – Best split = Current split – Output the best split found
  • 140. PLANET for Decision Trees • InMemoryGrow – Task to grow the entire subtree once the data for it fits in memory – Similar to parallel training – Map • Load the current model file • For each record identify the node that needs to be grown, • Output <Node_id, Record> – Reduce • Initialize the feature value file from Initialization task • For each <Node_id, List<Record>> run the basic tree growing algorithm on the records • Output the best split for each node in the subtree