SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne

Mario Pastorelli (


            September 27, 2012

What is Scalding

   Scalding is a Scala library written on top of Cascading that makes
   it easy to define MapReduce programs


  Hadoop MapReduce Programming Model




  Hadoop MapReduce Programming Model



Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.

Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.
      Map: a function from an input key/value pair to a list of
      intermediate key/value pairs

           map : (keyinput , valueinput ) → list(keymap , valuemap )

Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.
      Map: a function from an input key/value pair to a list of
      intermediate key/value pairs

            map : (keyinput , valueinput ) → list(keymap , valuemap )
      Reduce: a function from an intermediate key/values pairs to
      a list of output key/value pairs

       reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce )

Hadoop Programming Model
  The Hadoop MapReduce programming model allows to control all
  the job workflow components. Job components are divided in two

Hadoop Programming Model
     The Hadoop MapReduce programming model allows to control all
     the job workflow components. Job components are divided in two

          The Map Phase:
                          Km1 Vm1
       Input                             Km1 Vm6
               1   1        2   2                               Km1 Vm6           Km3 Vm3
 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
                                          m   m
Source       Ki2 Vi2      Km1 Vm5
                                         Km3 Vm3             P2 Km2 Vm2        P2 Km2 Vm2
                          Km3 Vm3


Hadoop Programming Model
     The Hadoop MapReduce programming model allows to control all
     the job workflow components. Job components are divided in two

             The Map Phase:
                          Km1 Vm1
       Input                             Km1 Vm6
               1   1        2   2                               Km1 Vm6           Km3 Vm3
 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
                                          m   m
Source       Ki2 Vi2      Km1 Vm5
                                         Km3 Vm3             P2 Km2 Vm2        P2 Km2 Vm2
                          Km3 Vm3


             The Reduce Phase:
       Km Vm3
                             Km3 Vm3                                                  Output
                                                        3   3
                    Sorter                 Grouper G1 Km Vm         Reducer Kr1 Vr1   Writer     Data
Shuffle Km1 Vm6 Vm7            Km4 Vm8                  Km4 Vm8                                    Dest
                                                                            Kr2 Vr2
       Km4 Vm8               Km1 Vm6 Vm7           G2 Km1 Vm6 Vm7

Example: Word Count 1/2
 1    class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{
 3        public void map(Object key, Text value, Context context)
 4              throws IOException, InterruptedException {
 5          StringTokenizer itr = new StringTokenizer(value.toString());
 6          while (itr.hasMoreTokens()) {
 7            word.set(itr.nextToken());
 8            context.write(new Text(word), new IntWritable(1));
 9          }
10        }
11    }
13    class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
15        public void reduce(Text key, Iterable<IntWritable> values,
16                           Context context
17                           ) throws IOException, InterruptedException {
18          int sum = 0;
19          for (IntWritable val : values)
20            sum += val.get();
21          context.write(key, new IntWritable(sum));
22        }
23    }

Example: Word Count 2/2
 1    public class WordCount {
 3        public static void main(String[] args) throws Exception {
 4          Job job = new Job(conf, "word count");
 5          job.setMapperClass(TokenizerMapper.class);
 9            job.setReducerClass(IntSumReducer.class);
10            job.setOutputKeyClass(Text.class);
11            job.setOutputValueClass(IntWritable.class);
12            FileInputFormat.addInputPath(job, new Path(args[0]));
13            FileOutputFormat.setOutputPath(job, new Path(args[1]));
14            System.exit(job.waitForCompletion(true) ? 0 : 1);
15        }
16    }

Example: Word Count 2/2
 1    public class WordCount {
 3        public static void main(String[] args) throws Exception {
 4          Job job = new Job(conf, "word count");
 5          job.setMapperClass(TokenizerMapper.class);
 7            job.setCombinerClass(IntSumReducer.class);
 9            job.setReducerClass(IntSumReducer.class);
10            job.setOutputKeyClass(Text.class);
11            job.setOutputValueClass(IntWritable.class);
12            FileInputFormat.addInputPath(job, new Path(args[0]));
13            FileOutputFormat.setOutputPath(job, new Path(args[1]));
14            System.exit(job.waitForCompletion(true) ? 0 : 1);
15        }
16    }

                 Sending the integer 1 for each instance of a word is very
                 inefficient (1TB of data yields 1TB+ of data)
                 Hadoop doesn’t know if it can use the reducer as combiner. A
                 manual set is needed
Hadoop weaknesses
      The reducer cannot be always used as combiner, Hadoop
      relies on the combiner specification or on manual partial
      aggregation inside the mapper instance life cycle (in-mapper
      Combiners are limited to associative and commutative
      functions (like sum). Partial aggregation is more general and
      Programming model limited to the map/reduce phases
      model, multi-job programs are often difficult and
      counter-intuitive (think about iterative algorithms like
      Joins can be difficult, many techniques must be
      implemented from scratch
      More in general, MapReduce is indeed simple but many
      optimizations are similar to hacks and not so natural


  Hadoop MapReduce Programming Model




      Open source project developed @Concurrent
      It is Java application framework on top of Hadoop developed
      to be extendible by providing:
            Processing API: to develop complex data flows
            Integration API: integration test supported by the framework
            to avoid to put in production unstable software
            Scheduling API: used to schedule unit of work from any
            third-party application
      It changes the MapReduce programming model to a more
      generic data flow oriented programming model
      Cascading has a data flow optimizer that converts user data
      flows to optimized data flows

Cascading Programming Model

      A Cascading program is composed by flows
      A flow is composed by a source tap, a sink tap and pipes
      that connect them
      A pipe holds a particular transformation over its input data
      Pipes can be combined to create more complex programs

Example: Word Count

           MapReduce word count concept:
                                    Map(tokenize text
                                     and emit 1 for             Reduce(count values
                       1        1
         TextLine Ki       Vi         each token)               and emit the result)   Kr1 Vr1 TextLine
 Data                                                                                                     Data
Source                                                                                                    Dest

           Cascading word count concept:

         tokenize each line group by tokens count values in every group


Example: Word Count
 1    public class WordCount {
 2      public static void main( String[] args ) {
 3        Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] );
 4        Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] );
 6            RegexSplitGenerator s = new RegexSplitGenerator(
 7                                                  new Fields("token"),
 8                                                  "[ [](),.]" );
 9            Pipe docPipe = new Each( "token", new Fields( "text" ), s,
10                                      Fields.RESULTS ); // text -> token
12            Pipe wcPipe = new Pipe( "wc", docPipe );
13            wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
14            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
16            // connect the taps and pipes to create a flow definition
17            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
18                                               .addSource( docPipe, docTap )
19                                               .addTailSink( wcPipe, wcTap );
21            getFlowConnector().connect( flowDef ).complete();
22        }
23    }


  Hadoop MapReduce Programming Model




      Open source project developed @Twitter


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time
      The two APIs can work together using pipe.typed and


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time
      The two APIs can work together using pipe.typed and
      This presentation is about the TypeSafe API ¨

Why Scalding

Why Scalding

      MapReduce high-level idea comes from LISP and works on
      functions (Map/Reduce) and function composition

Why Scalding

        MapReduce high-level idea comes from LISP and works on
        functions (Map/Reduce) and function composition
        Cascading works on objects representing functions and uses
        constructors as compositor between pipes:
    1   Pipe wcPipe = new Pipe( "wc", docPipe );
    2   wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
    3   wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
    4                       Fields.ALL );

Why Scalding

        MapReduce high-level idea comes from LISP and works on
        functions (Map/Reduce) and function composition
        Cascading works on objects representing functions and uses
        constructors as compositor between pipes:
    1   Pipe wcPipe = new Pipe( "wc", docPipe );
    2   wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
    3   wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
    4                       Fields.ALL );

        Functional programming can naturally describe data flows:
        every pipe can be seen as a function working and pipes can be
        combined using functional compositing. The code above can
        be written as:
    1   docPipe.groupBy( new Fields( "token" ) )
    2          .every(Fields.ALL, new Count(), Fields.ALL)

Example: Word Count
 1    class WordCount(args : Args) extends Job(args) {
 3    /* TextLine reads each line of the given file */
 4    val input = TypedPipe.from( TextLine( args( "input" ) ) )
 6    /* tokenize every line and flat the result into a list of words */
 7    val words = input.flatMap{ tokenize(_) }
 9    /* group by words and add a new field size that is the group size */
10    val wordGroups = words.groupBy{ identity(_) }.size
12    /* write each pair (word,count) as line using TextLine */
13    wordGroups.write((0,1), TextLine( args( "output" ) ) )
15     /* Split a piece of text into individual words */
16     def tokenize(text : String) : Array[String] = {
17       // Lowercase each word and remove punctuation.
18       text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "")
19                            .split("s+")
20     }
21    }

Scalding TypeSafe API

   Two main concepts:

Scalding TypeSafe API

   Two main concepts:
       TypedPipe[T]: class whose instances are distributed
       objects that wrap a cascading Pipe object, and holds the
       transformation done up until that point. Its interface is similar
       to Scala’s Iterator[T] (map, flatMap, groupBy,
       filter,. . . )

Scalding TypeSafe API

   Two main concepts:
       TypedPipe[T]: class whose instances are distributed
       objects that wrap a cascading Pipe object, and holds the
       transformation done up until that point. Its interface is similar
       to Scala’s Iterator[T] (map, flatMap, groupBy,
       filter,. . . )
       KeyedList[K,V]: trait that represents a sharded lists of
       items. Two implementations:
           Grouped[K,V]: represents a grouping on keys of type K
           CoGrouped2[K,V,W,Result]: represents a cogroup over
           two grouped pipes. Used for joins


      MapReduce API is powerful but limited


      MapReduce API is powerful but limited
      Cascading API is as simple as the MapReduce API but more
      generic and powerful


      MapReduce API is powerful but limited
      Cascading API is as simple as the MapReduce API but more
      generic and powerful
      Scalding combines Cascading and Scala to easily describe
      distributed programs. Major strength points are:
          Functional programming to naturally describe data flows.
          Scalding is similar to Scala library, if you know Scala then
          you already know how to use Scalding
          Statically typed (TypeSafe API), no type errors at runtime
          Scala is standard and works on top of the JVM
          Scala libraries and tools can be used in production: IDEs,
          debug systems, test systems, build systems and everything else.

Thank you for listening


Contenu connexe


Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water MeetupSri Ambati
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling WaterSri Ambati
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_onSri Ambati
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and RJunHo Cho
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney

Tendances (20)

Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
Sparkling Water Meetup
Sparkling Water MeetupSparkling Water Meetup
Sparkling Water Meetup
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce

Similaire à Scalding: A Scala Library for Defining MapReduce Programs

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
Map reduce
Map reduceMap reduce
Map reducexydii
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane

Similaire à Scalding: A Scala Library for Defining MapReduce Programs (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
Map reduce
Map reduceMap reduce
Map reduce
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Map reduce
Map reduceMap reduce
Map reduce
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
Map Reduce
Map ReduceMap Reduce
Map Reduce


Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Scalding: A Scala Library for Defining MapReduce Programs

  • 2. What is Scalding Scalding is a Scala library written on top of Cascading that makes it easy to define MapReduce programs 2/21
  • 3. Summary Hadoop MapReduce Programming Model Cascading Scalding 3/21
  • 4. Summary Hadoop MapReduce Programming Model Cascading Scalding 4/21
  • 5. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. 5/21
  • 6. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) 5/21
  • 7. Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) Reduce: a function from an intermediate key/values pairs to a list of output key/value pairs reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce ) 5/21
  • 8. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: 6/21
  • 9. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m m Source Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 6/21
  • 10. Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m m Source Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 The Reduce Phase: Km Vm3 3 Km3 Vm3 Output 3 3 Sorter Grouper G1 Km Vm Reducer Kr1 Vr1 Writer Data Shuffle Km1 Vm6 Vm7 Km4 Vm8 Km4 Vm8 Dest Kr2 Vr2 Km4 Vm8 Km1 Vm6 Vm7 G2 Km1 Vm6 Vm7 6/21
  • 11. Example: Word Count 1/2 1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{ 2 3 public void map(Object key, Text value, Context context) 4 throws IOException, InterruptedException { 5 StringTokenizer itr = new StringTokenizer(value.toString()); 6 while (itr.hasMoreTokens()) { 7 word.set(itr.nextToken()); 8 context.write(new Text(word), new IntWritable(1)); 9 } 10 } 11 } 12 13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{ 14 15 public void reduce(Text key, Iterable<IntWritable> values, 16 Context context 17 ) throws IOException, InterruptedException { 18 int sum = 0; 19 for (IntWritable val : values) 20 sum += val.get(); 21 context.write(key, new IntWritable(sum)); 22 } 23 } 7/21
  • 12. Example: Word Count 2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 8 9 job.setReducerClass(IntSumReducer.class); 10 job.setOutputKeyClass(Text.class); 11 job.setOutputValueClass(IntWritable.class); 12 FileInputFormat.addInputPath(job, new Path(args[0])); 13 FileOutputFormat.setOutputPath(job, new Path(args[1])); 14 System.exit(job.waitForCompletion(true) ? 0 : 1); 15 } 16 } 8/21
  • 13. Example: Word Count 2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 job.setCombinerClass(IntSumReducer.class); 8 9 job.setReducerClass(IntSumReducer.class); 10 job.setOutputKeyClass(Text.class); 11 job.setOutputValueClass(IntWritable.class); 12 FileInputFormat.addInputPath(job, new Path(args[0])); 13 FileOutputFormat.setOutputPath(job, new Path(args[1])); 14 System.exit(job.waitForCompletion(true) ? 0 : 1); 15 } 16 } Sending the integer 1 for each instance of a word is very inefficient (1TB of data yields 1TB+ of data) Hadoop doesn’t know if it can use the reducer as combiner. A manual set is needed 8/21
  • 14. Hadoop weaknesses The reducer cannot be always used as combiner, Hadoop relies on the combiner specification or on manual partial aggregation inside the mapper instance life cycle (in-mapper combiner) Combiners are limited to associative and commutative functions (like sum). Partial aggregation is more general and powerful Programming model limited to the map/reduce phases model, multi-job programs are often difficult and counter-intuitive (think about iterative algorithms like PageRank) Joins can be difficult, many techniques must be implemented from scratch More in general, MapReduce is indeed simple but many optimizations are similar to hacks and not so natural 9/21
  • 15. Summary Hadoop MapReduce Programming Model Cascading Scalding 10/21
  • 16. Cascading Open source project developed @Concurrent It is Java application framework on top of Hadoop developed to be extendible by providing: Processing API: to develop complex data flows Integration API: integration test supported by the framework to avoid to put in production unstable software Scheduling API: used to schedule unit of work from any third-party application It changes the MapReduce programming model to a more generic data flow oriented programming model Cascading has a data flow optimizer that converts user data flows to optimized data flows 11/21
  • 17. Cascading Programming Model A Cascading program is composed by flows A flow is composed by a source tap, a sink tap and pipes that connect them A pipe holds a particular transformation over its input data flow Pipes can be combined to create more complex programs 12/21
  • 18. Example: Word Count MapReduce word count concept: Map(tokenize text and emit 1 for Reduce(count values 1 1 TextLine Ki Vi each token) and emit the result) Kr1 Vr1 TextLine Data Data Shuffle Source Dest Cascading word count concept: TextLine tokenize each line group by tokens count values in every group TextLine 13/21
  • 19. Example: Word Count 1 public class WordCount { 2 public static void main( String[] args ) { 3 Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] ); 4 Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] ); 5 6 RegexSplitGenerator s = new RegexSplitGenerator( 7 new Fields("token"), 8 "[ [](),.]" ); 9 Pipe docPipe = new Each( "token", new Fields( "text" ), s, 10 Fields.RESULTS ); // text -> token 11 12 Pipe wcPipe = new Pipe( "wc", docPipe ); 13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 15 16 // connect the taps and pipes to create a flow definition 17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) 18 .addSource( docPipe, docTap ) 19 .addTailSink( wcPipe, wcTap ); 20 21 getFlowConnector().connect( flowDef ).complete(); 22 } 23 } 14/21
  • 20. Summary Hadoop MapReduce Programming Model Cascading Scalding 15/21
  • 21. Scalding Open source project developed @Twitter 16/21
  • 22. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time 16/21
  • 23. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from 16/21
  • 24. Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from This presentation is about the TypeSafe API ¨ 16/21
  • 25. Why Scalding 17/21
  • 26. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition 17/21
  • 27. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); 17/21
  • 28. Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); Functional programming can naturally describe data flows: every pipe can be seen as a function working and pipes can be combined using functional compositing. The code above can be written as: 1 docPipe.groupBy( new Fields( "token" ) ) 2 .every(Fields.ALL, new Count(), Fields.ALL) 17/21
  • 29. Example: Word Count 1 class WordCount(args : Args) extends Job(args) { 2 3 /* TextLine reads each line of the given file */ 4 val input = TypedPipe.from( TextLine( args( "input" ) ) ) 5 6 /* tokenize every line and flat the result into a list of words */ 7 val words = input.flatMap{ tokenize(_) } 8 9 /* group by words and add a new field size that is the group size */ 10 val wordGroups = words.groupBy{ identity(_) }.size 11 12 /* write each pair (word,count) as line using TextLine */ 13 wordGroups.write((0,1), TextLine( args( "output" ) ) ) 14 15 /* Split a piece of text into individual words */ 16 def tokenize(text : String) : Array[String] = { 17 // Lowercase each word and remove punctuation. 18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "") 19 .split("s+") 20 } 21 } 18/21
  • 30. Scalding TypeSafe API Two main concepts: 19/21
  • 31. Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) 19/21
  • 32. Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) KeyedList[K,V]: trait that represents a sharded lists of items. Two implementations: Grouped[K,V]: represents a grouping on keys of type K CoGrouped2[K,V,W,Result]: represents a cogroup over two grouped pipes. Used for joins 19/21
  • 33. Conclusions MapReduce API is powerful but limited 20/21
  • 34. Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful 20/21
  • 35. Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful Scalding combines Cascading and Scala to easily describe distributed programs. Major strength points are: Functional programming to naturally describe data flows. Scalding is similar to Scala library, if you know Scala then you already know how to use Scalding Statically typed (TypeSafe API), no type errors at runtime Scala is standard and works on top of the JVM Scala libraries and tools can be used in production: IDEs, debug systems, test systems, build systems and everything else. 20/21
  • 36. Thank you for listening 21/21