SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Scalding

Mario Pastorelli (Mario.Pastorelli@eurecom.fr)

                  EURECOM


            September 27, 2012




                                                 1/21
What is Scalding




   Scalding is a Scala library written on top of Cascading that makes
   it easy to define MapReduce programs




                                                                  2/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       3/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       4/21
Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.




                                                                   5/21
Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.
      Map: a function from an input key/value pair to a list of
      intermediate key/value pairs

           map : (keyinput , valueinput ) → list(keymap , valuemap )




                                                                       5/21
Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.
      Map: a function from an input key/value pair to a list of
      intermediate key/value pairs

            map : (keyinput , valueinput ) → list(keymap , valuemap )
      Reduce: a function from an intermediate key/values pairs to
      a list of output key/value pairs


       reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce )




                                                                        5/21
Hadoop Programming Model
  The Hadoop MapReduce programming model allows to control all
  the job workflow components. Job components are divided in two
  phases:




                                                             6/21
Hadoop Programming Model
     The Hadoop MapReduce programming model allows to control all
     the job workflow components. Job components are divided in two
     phases:


          The Map Phase:
                          Km1 Vm1
       Input                             Km1 Vm6
               1   1        2   2                               Km1 Vm6           Km3 Vm3
 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
                                          m   m
Source       Ki2 Vi2      Km1 Vm5
                                         Km3 Vm3             P2 Km2 Vm2        P2 Km2 Vm2
                          Km3 Vm3

                               combine(Vm1,Vm5)=Vm6




                                                                               6/21
Hadoop Programming Model
     The Hadoop MapReduce programming model allows to control all
     the job workflow components. Job components are divided in two
     phases:


             The Map Phase:
                          Km1 Vm1
       Input                             Km1 Vm6
               1   1        2   2                               Km1 Vm6           Km3 Vm3
 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
                                          m   m
Source       Ki2 Vi2      Km1 Vm5
                                         Km3 Vm3             P2 Km2 Vm2        P2 Km2 Vm2
                          Km3 Vm3

                                   combine(Vm1,Vm5)=Vm6



             The Reduce Phase:
       Km Vm3
         3
                             Km3 Vm3                                                  Output
                                                        3   3
                    Sorter                 Grouper G1 Km Vm         Reducer Kr1 Vr1   Writer     Data
Shuffle Km1 Vm6 Vm7            Km4 Vm8                  Km4 Vm8                                    Dest
                                                                            Kr2 Vr2
       Km4 Vm8               Km1 Vm6 Vm7           G2 Km1 Vm6 Vm7

                                                                                          6/21
Example: Word Count 1/2
 1    class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{
 2
 3        public void map(Object key, Text value, Context context)
 4              throws IOException, InterruptedException {
 5          StringTokenizer itr = new StringTokenizer(value.toString());
 6          while (itr.hasMoreTokens()) {
 7            word.set(itr.nextToken());
 8            context.write(new Text(word), new IntWritable(1));
 9          }
10        }
11    }
12
13    class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
14
15        public void reduce(Text key, Iterable<IntWritable> values,
16                           Context context
17                           ) throws IOException, InterruptedException {
18          int sum = 0;
19          for (IntWritable val : values)
20            sum += val.get();
21          context.write(key, new IntWritable(sum));
22        }
23    }

                                                                       7/21
Example: Word Count 2/2
 1    public class WordCount {
 2
 3        public static void main(String[] args) throws Exception {
 4          Job job = new Job(conf, "word count");
 5          job.setMapperClass(TokenizerMapper.class);
 6
 7
 8
 9            job.setReducerClass(IntSumReducer.class);
10            job.setOutputKeyClass(Text.class);
11            job.setOutputValueClass(IntWritable.class);
12            FileInputFormat.addInputPath(job, new Path(args[0]));
13            FileOutputFormat.setOutputPath(job, new Path(args[1]));
14            System.exit(job.waitForCompletion(true) ? 0 : 1);
15        }
16    }




                                                                        8/21
Example: Word Count 2/2
 1    public class WordCount {
 2
 3        public static void main(String[] args) throws Exception {
 4          Job job = new Job(conf, "word count");
 5          job.setMapperClass(TokenizerMapper.class);
 6
 7            job.setCombinerClass(IntSumReducer.class);
 8
 9            job.setReducerClass(IntSumReducer.class);
10            job.setOutputKeyClass(Text.class);
11            job.setOutputValueClass(IntWritable.class);
12            FileInputFormat.addInputPath(job, new Path(args[0]));
13            FileOutputFormat.setOutputPath(job, new Path(args[1]));
14            System.exit(job.waitForCompletion(true) ? 0 : 1);
15        }
16    }


                 Sending the integer 1 for each instance of a word is very
                 inefficient (1TB of data yields 1TB+ of data)
                 Hadoop doesn’t know if it can use the reducer as combiner. A
                 manual set is needed
                                                                          8/21
Hadoop weaknesses
      The reducer cannot be always used as combiner, Hadoop
      relies on the combiner specification or on manual partial
      aggregation inside the mapper instance life cycle (in-mapper
      combiner)
      Combiners are limited to associative and commutative
      functions (like sum). Partial aggregation is more general and
      powerful
      Programming model limited to the map/reduce phases
      model, multi-job programs are often difficult and
      counter-intuitive (think about iterative algorithms like
      PageRank)
      Joins can be difficult, many techniques must be
      implemented from scratch
      More in general, MapReduce is indeed simple but many
      optimizations are similar to hacks and not so natural

                                                                 9/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       10/21
Cascading


      Open source project developed @Concurrent
      It is Java application framework on top of Hadoop developed
      to be extendible by providing:
            Processing API: to develop complex data flows
            Integration API: integration test supported by the framework
            to avoid to put in production unstable software
            Scheduling API: used to schedule unit of work from any
            third-party application
      It changes the MapReduce programming model to a more
      generic data flow oriented programming model
      Cascading has a data flow optimizer that converts user data
      flows to optimized data flows



                                                                    11/21
Cascading Programming Model




      A Cascading program is composed by flows
      A flow is composed by a source tap, a sink tap and pipes
      that connect them
      A pipe holds a particular transformation over its input data
      flow
      Pipes can be combined to create more complex programs




                                                                 12/21
Example: Word Count

           MapReduce word count concept:
                                    Map(tokenize text
                                     and emit 1 for             Reduce(count values
                       1        1
         TextLine Ki       Vi         each token)               and emit the result)   Kr1 Vr1 TextLine
 Data                                                                                                     Data
                                                        Shuffle
Source                                                                                                    Dest




           Cascading word count concept:
TextLine


         tokenize each line group by tokens count values in every group


                                                                                                     TextLine




                                                                                                      13/21
Example: Word Count
 1    public class WordCount {
 2      public static void main( String[] args ) {
 3        Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] );
 4        Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] );
 5
 6            RegexSplitGenerator s = new RegexSplitGenerator(
 7                                                  new Fields("token"),
 8                                                  "[ [](),.]" );
 9            Pipe docPipe = new Each( "token", new Fields( "text" ), s,
10                                      Fields.RESULTS ); // text -> token
11
12            Pipe wcPipe = new Pipe( "wc", docPipe );
13            wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
14            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
15
16            // connect the taps and pipes to create a flow definition
17            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
18                                               .addSource( docPipe, docTap )
19                                               .addTailSink( wcPipe, wcTap );
20
21            getFlowConnector().connect( flowDef ).complete();
22        }
23    }

                                                                         14/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       15/21
Scalding


      Open source project developed @Twitter




                                               16/21
Scalding


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time




                                                                       16/21
Scalding


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time
      The two APIs can work together using pipe.typed and
      TypedPipe.from




                                                                       16/21
Scalding


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time
      The two APIs can work together using pipe.typed and
      TypedPipe.from
      This presentation is about the TypeSafe API ¨




                                                                       16/21
Why Scalding




               17/21
Why Scalding

      MapReduce high-level idea comes from LISP and works on
      functions (Map/Reduce) and function composition




                                                               17/21
Why Scalding

        MapReduce high-level idea comes from LISP and works on
        functions (Map/Reduce) and function composition
        Cascading works on objects representing functions and uses
        constructors as compositor between pipes:
    1   Pipe wcPipe = new Pipe( "wc", docPipe );
    2   wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
    3   wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
    4                       Fields.ALL );




                                                                 17/21
Why Scalding

        MapReduce high-level idea comes from LISP and works on
        functions (Map/Reduce) and function composition
        Cascading works on objects representing functions and uses
        constructors as compositor between pipes:
    1   Pipe wcPipe = new Pipe( "wc", docPipe );
    2   wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
    3   wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
    4                       Fields.ALL );


        Functional programming can naturally describe data flows:
        every pipe can be seen as a function working and pipes can be
        combined using functional compositing. The code above can
        be written as:
    1   docPipe.groupBy( new Fields( "token" ) )
    2          .every(Fields.ALL, new Count(), Fields.ALL)



                                                                 17/21
Example: Word Count
 1    class WordCount(args : Args) extends Job(args) {
 2
 3    /* TextLine reads each line of the given file */
 4    val input = TypedPipe.from( TextLine( args( "input" ) ) )
 5
 6    /* tokenize every line and flat the result into a list of words */
 7    val words = input.flatMap{ tokenize(_) }
 8
 9    /* group by words and add a new field size that is the group size */
10    val wordGroups = words.groupBy{ identity(_) }.size
11
12    /* write each pair (word,count) as line using TextLine */
13    wordGroups.write((0,1), TextLine( args( "output" ) ) )
14
15     /* Split a piece of text into individual words */
16     def tokenize(text : String) : Array[String] = {
17       // Lowercase each word and remove punctuation.
18       text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "")
19                            .split("s+")
20     }
21    }



                                                                    18/21
Scalding TypeSafe API


   Two main concepts:




                        19/21
Scalding TypeSafe API


   Two main concepts:
       TypedPipe[T]: class whose instances are distributed
       objects that wrap a cascading Pipe object, and holds the
       transformation done up until that point. Its interface is similar
       to Scala’s Iterator[T] (map, flatMap, groupBy,
       filter,. . . )




                                                                    19/21
Scalding TypeSafe API


   Two main concepts:
       TypedPipe[T]: class whose instances are distributed
       objects that wrap a cascading Pipe object, and holds the
       transformation done up until that point. Its interface is similar
       to Scala’s Iterator[T] (map, flatMap, groupBy,
       filter,. . . )
       KeyedList[K,V]: trait that represents a sharded lists of
       items. Two implementations:
           Grouped[K,V]: represents a grouping on keys of type K
           CoGrouped2[K,V,W,Result]: represents a cogroup over
           two grouped pipes. Used for joins




                                                                    19/21
Conclusions


      MapReduce API is powerful but limited




                                              20/21
Conclusions


      MapReduce API is powerful but limited
      Cascading API is as simple as the MapReduce API but more
      generic and powerful




                                                           20/21
Conclusions


      MapReduce API is powerful but limited
      Cascading API is as simple as the MapReduce API but more
      generic and powerful
      Scalding combines Cascading and Scala to easily describe
      distributed programs. Major strength points are:
          Functional programming to naturally describe data flows.
          Scalding is similar to Scala library, if you know Scala then
          you already know how to use Scalding
          Statically typed (TypeSafe API), no type errors at runtime
          Scala is standard and works on top of the JVM
          Scala libraries and tools can be used in production: IDEs,
          debug systems, test systems, build systems and everything else.




                                                                     20/21
Thank you for listening




                          21/21

Contenu connexe

Tendances

Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water MeetupSri Ambati
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling WaterSri Ambati
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_onSri Ambati
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and RJunHo Cho
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 

Tendances (20)

Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Scala+data
Scala+dataScala+data
Scala+data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 

Similaire à Scalding: A Scala Library for Defining MapReduce Programs

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 

Similaire à Scalding: A Scala Library for Defining MapReduce Programs (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
E031201032036
E031201032036E031201032036
E031201032036
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Scalding: A Scala Library for Defining MapReduce Programs

  • 2. What is Scalding Scalding is a Scala library written on top of Cascading that makes it easy to define MapReduce programs 2/21
  • 3. Summary Hadoop MapReduce Programming Model Cascading Scalding 3/21
  • 4. Summary Hadoop MapReduce Programming Model Cascading Scalding 4/21
  • 5. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. 5/21
  • 6. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) 5/21
  • 7. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) Reduce: a function from an intermediate key/values pairs to a list of output key/value pairs reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce ) 5/21
  • 8. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: 6/21
  • 9. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m m Source Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 6/21
  • 10. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m m Source Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 The Reduce Phase: Km Vm3 3 Km3 Vm3 Output 3 3 Sorter Grouper G1 Km Vm Reducer Kr1 Vr1 Writer Data Shuffle Km1 Vm6 Vm7 Km4 Vm8 Km4 Vm8 Dest Kr2 Vr2 Km4 Vm8 Km1 Vm6 Vm7 G2 Km1 Vm6 Vm7 6/21
  • 11. Example: Word Count 1/2 1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{ 2 3 public void map(Object key, Text value, Context context) 4 throws IOException, InterruptedException { 5 StringTokenizer itr = new StringTokenizer(value.toString()); 6 while (itr.hasMoreTokens()) { 7 word.set(itr.nextToken()); 8 context.write(new Text(word), new IntWritable(1)); 9 } 10 } 11 } 12 13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{ 14 15 public void reduce(Text key, Iterable<IntWritable> values, 16 Context context 17 ) throws IOException, InterruptedException { 18 int sum = 0; 19 for (IntWritable val : values) 20 sum += val.get(); 21 context.write(key, new IntWritable(sum)); 22 } 23 } 7/21
  • 12. Example: Word Count 2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 8 9 job.setReducerClass(IntSumReducer.class); 10 job.setOutputKeyClass(Text.class); 11 job.setOutputValueClass(IntWritable.class); 12 FileInputFormat.addInputPath(job, new Path(args[0])); 13 FileOutputFormat.setOutputPath(job, new Path(args[1])); 14 System.exit(job.waitForCompletion(true) ? 0 : 1); 15 } 16 } 8/21
  • 13. Example: Word Count 2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 job.setCombinerClass(IntSumReducer.class); 8 9 job.setReducerClass(IntSumReducer.class); 10 job.setOutputKeyClass(Text.class); 11 job.setOutputValueClass(IntWritable.class); 12 FileInputFormat.addInputPath(job, new Path(args[0])); 13 FileOutputFormat.setOutputPath(job, new Path(args[1])); 14 System.exit(job.waitForCompletion(true) ? 0 : 1); 15 } 16 } Sending the integer 1 for each instance of a word is very inefficient (1TB of data yields 1TB+ of data) Hadoop doesn’t know if it can use the reducer as combiner. A manual set is needed 8/21
  • 14. Hadoop weaknesses The reducer cannot be always used as combiner, Hadoop relies on the combiner specification or on manual partial aggregation inside the mapper instance life cycle (in-mapper combiner) Combiners are limited to associative and commutative functions (like sum). Partial aggregation is more general and powerful Programming model limited to the map/reduce phases model, multi-job programs are often difficult and counter-intuitive (think about iterative algorithms like PageRank) Joins can be difficult, many techniques must be implemented from scratch More in general, MapReduce is indeed simple but many optimizations are similar to hacks and not so natural 9/21
  • 15. Summary Hadoop MapReduce Programming Model Cascading Scalding 10/21
  • 16. Cascading Open source project developed @Concurrent It is Java application framework on top of Hadoop developed to be extendible by providing: Processing API: to develop complex data flows Integration API: integration test supported by the framework to avoid to put in production unstable software Scheduling API: used to schedule unit of work from any third-party application It changes the MapReduce programming model to a more generic data flow oriented programming model Cascading has a data flow optimizer that converts user data flows to optimized data flows 11/21
  • 17. Cascading Programming Model A Cascading program is composed by flows A flow is composed by a source tap, a sink tap and pipes that connect them A pipe holds a particular transformation over its input data flow Pipes can be combined to create more complex programs 12/21
  • 18. Example: Word Count MapReduce word count concept: Map(tokenize text and emit 1 for Reduce(count values 1 1 TextLine Ki Vi each token) and emit the result) Kr1 Vr1 TextLine Data Data Shuffle Source Dest Cascading word count concept: TextLine tokenize each line group by tokens count values in every group TextLine 13/21
  • 19. Example: Word Count 1 public class WordCount { 2 public static void main( String[] args ) { 3 Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] ); 4 Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] ); 5 6 RegexSplitGenerator s = new RegexSplitGenerator( 7 new Fields("token"), 8 "[ [](),.]" ); 9 Pipe docPipe = new Each( "token", new Fields( "text" ), s, 10 Fields.RESULTS ); // text -> token 11 12 Pipe wcPipe = new Pipe( "wc", docPipe ); 13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 15 16 // connect the taps and pipes to create a flow definition 17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) 18 .addSource( docPipe, docTap ) 19 .addTailSink( wcPipe, wcTap ); 20 21 getFlowConnector().connect( flowDef ).complete(); 22 } 23 } 14/21
  • 20. Summary Hadoop MapReduce Programming Model Cascading Scalding 15/21
  • 21. Scalding Open source project developed @Twitter 16/21
  • 22. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time 16/21
  • 23. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from 16/21
  • 24. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from This presentation is about the TypeSafe API ¨ 16/21
  • 25. Why Scalding 17/21
  • 26. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition 17/21
  • 27. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); 17/21
  • 28. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); Functional programming can naturally describe data flows: every pipe can be seen as a function working and pipes can be combined using functional compositing. The code above can be written as: 1 docPipe.groupBy( new Fields( "token" ) ) 2 .every(Fields.ALL, new Count(), Fields.ALL) 17/21
  • 29. Example: Word Count 1 class WordCount(args : Args) extends Job(args) { 2 3 /* TextLine reads each line of the given file */ 4 val input = TypedPipe.from( TextLine( args( "input" ) ) ) 5 6 /* tokenize every line and flat the result into a list of words */ 7 val words = input.flatMap{ tokenize(_) } 8 9 /* group by words and add a new field size that is the group size */ 10 val wordGroups = words.groupBy{ identity(_) }.size 11 12 /* write each pair (word,count) as line using TextLine */ 13 wordGroups.write((0,1), TextLine( args( "output" ) ) ) 14 15 /* Split a piece of text into individual words */ 16 def tokenize(text : String) : Array[String] = { 17 // Lowercase each word and remove punctuation. 18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "") 19 .split("s+") 20 } 21 } 18/21
  • 30. Scalding TypeSafe API Two main concepts: 19/21
  • 31. Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) 19/21
  • 32. Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) KeyedList[K,V]: trait that represents a sharded lists of items. Two implementations: Grouped[K,V]: represents a grouping on keys of type K CoGrouped2[K,V,W,Result]: represents a cogroup over two grouped pipes. Used for joins 19/21
  • 33. Conclusions MapReduce API is powerful but limited 20/21
  • 34. Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful 20/21
  • 35. Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful Scalding combines Cascading and Scala to easily describe distributed programs. Major strength points are: Functional programming to naturally describe data flows. Scalding is similar to Scala library, if you know Scala then you already know how to use Scalding Statically typed (TypeSafe API), no type errors at runtime Scala is standard and works on top of the JVM Scala libraries and tools can be used in production: IDEs, debug systems, test systems, build systems and everything else. 20/21
  • 36. Thank you for listening 21/21