SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011

                      Lecture 2
                  September 1, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
      Course design and slides derived from
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of
• Chuck Lam’s Hadoop In Action (2011)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Roots in Functional Programming

   Map       f   f   f   f   f

   Fold      g   g   g   g   g
Divide and Conquer


        w1         w2         w3

      “worker”   “worker”   “worker”

         r1         r2         r3

                 “Result”              Combine
“Big Ideas”
    Scale “out”, not “up”
        Limits of SMP and large shared-memory machines
    Move processing to the data
        Cluster have limited bandwidth
    Process data sequentially, avoid random access
        Seeks are expensive, disk throughput is reasonable
    Seamless scalability
        From the mythical man-month to the tradable machine-hour
Typical Large-Data Problem
          Iterate over a large number of records
          Compute something of interest from each
          Shuffle and sort intermediate results
          Aggregate intermediate results
          Generate final output

             Key idea: provide a functional abstraction for
             these two operations

(Dean and Ghemawat, OSDI 2004)
MapReduce Data Flow

Courtesy of Chuck Lam’s Hadoop In Action
(2011), pp. 45, 52
MapReduce “Runtime”
   Handles scheduling
       Assigns workers to map and reduce tasks
   Handles “data distribution”
       Moves processes to data
   Handles synchronization
       Gathers, sorts, and shuffles intermediate data
   Handles errors and faults
       Detects worker failures and restarts
   Built on a distributed file system
Programmers specify two functions
   map ( K1, V1 ) → list ( K2, V2 )
   reduce ( K2, list(V2) ) → list ( K3, V3)
Note correspondence of types map output → reduce input

Data Flow
      Input → “input splits”: each a sequence of logical (K1,V1) “records”
      Map
        • Each split processed by same map node
        • map invoked iteratively: once per record in the split
        • For each record processed, map may emit 0-N (K2,V2) pairs

      Reduce
        • reduce invoked iteratively for each ( K2, list(V2) ) intermediate value
        • For each processed, reduce may emit 0-N (K3,V3) pairs
      Each reducer’s output written to a persistent file in HDFS
Input File                                   Input File

                  InputSplit              InputSplit     InputSplit       InputSplit                InputSplit

                RecordReader           RecordReader     RecordReader    RecordReader            RecordReader

                   Mapper                   Mapper         Mapper          Mapper                    Mapper

                Intermediates           Intermediates   Intermediates   Intermediates           Intermediates

Source: redrawn from a slide by Cloduera, cc-licensed
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30

Data Flow

     Input → “input splits”: each a sequence of logical (K1,V1) “records”
     For each split, for each record, do map(K1,V1)       (multiple calls)
     Each map call may emit any number of (K2,V2) pairs             (0-N)
     Groups all values with the same key into ( K2, list(V2) )
     Determines which reducer will process this
     Copies data across network as needed for reducer
     Ensures intra-node sort of keys processed by each reducer
       • No guarantee by default of inter-node total sort across reducers
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
k1 v1   k2 v2   k3 v3    k4 v4   k5 v5    k6 v6

 map                map                    map                map

a 1    b 2        c 3     c 6           a 5   c 2           b 7   c 8
      Shuffle and Sort: aggregate values by keys
             a    1 5             b     2 7           c     2 3 6 8

         reduce             reduce                 reduce

          r1 s1                 r2 s2               r3 s3

                                                                        Courtesy of Chuck Lam’s Hadoop In
                                                                        Action (2011), pp. 45, 52
   Given:     map ( K1, V1 ) → list ( K2, V2 )
             reduce ( K2, list(V2) ) → list ( K3, V3)

partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]

       Each distinct key (with associated values) sent to a single reducer
         • Same reduce node may process multiple keys in separate reduce() calls

       Balances workload across reducers: equal number of keys to each
         • Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)

       Customizable
         • Some keys require more computation than others
             • e.g. value skew, or key-specific computation performed
             • For skew, sampling can dynamically estimate distribution & set partition
         • Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
Secondary Sorting (Lin 57, White 241)
    How to output sorted bigrams (1st word, then list of 2nds)?
        What if we use word1 as the key, word 2 as the value?
        What if we use <first>--<second> as the key?
    Pattern
        Create a composite key of (first, second)
        Define a Key Comparator based on both words
          • This will produce the sort order we want (aa ab ac ba bb bc ca cb…)
        Define a partition function based only on first word
          • All bigrams with the same first word go to same reducer
          • How do you know when the first word changes across invocations?
        Preserve state in the reducer across invocations
          • Will be called separately for each bigram, but we want to remember
            the current first word across bigrams seen
        Hadoop also provides Group Comparator
   Given:      map ( K1, V1 ) → list ( K2, V2 )
             reduce ( K2, list(V2) ) → list ( K3, V3)

combine ( K2, list(V2) ) → list ( K2, V2 )

   Optional optimization
       Local aggregation to reduce network traffic
       No guarantee it will be used, how many times it will be called
       Semantics of program cannot depend on its use
   Signature: same input as reduce, same output as map
       Combine may be run repeatedly on its own output
       Lin: Associative & Commutative  combiner = reducer
         • See next slide
Functional Properties
    Associative: f( a, f(b,c) ) = f( f(a,b), c )
        Grouping of operations doesn’t matter
        YES: Addition, multiplication, concatenation
        NO: division, subtraction, NAND
        NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )
    Commutative: f(a,b) = f(b,a)
        Ordering of arguments doesn’t matter
        YES: addition, multiplication, NAND
        NO: division, subtraction, concatenation
        Concatenate(“a,”b”) != concatenate(“b”,a”)
    Distributive
        White (p. 32) and Lam (p. 84) mention with regard to combiners
        But really, go with associative + commutative in Lin (pp. 20, 27)
k1 v1   k2 v2   k3 v3    k4 v4     k5 v5      k6 v6

  map                   map                    map                   map

a 1    b 2           c 3     c 6           a 5     c 2             b 7   c 8

 combine              combine               combine                 combine

a 1    b 2                 c 9             a 5     c 2             b 7   c 8

 partition            partition               partition             partition

      Shuffle and Sort: aggregate values by keys
               a     1 5             b     2 7               c     2 9 8 8
                                                                     3 6

         reduce                   reduce                  reduce

             r1 s1                 r2 s2                   r3 s3

                                                                         (1) submit


                                                   (2) schedule map        (2) schedule reduce

                     split 0
                                                                                                      (6) write   output
                     split 1                                            (5) remote read     worker
                               (3) read                                                                            file 0
                     split 2                       (4) local write
                     split 3
                     split 4                                                                                      output
                                                                                                                   file 1


                     Input                 Map            Intermediate files                 Reduce               Output
                      files               phase             (on local disk)                   phase                files

Adapted from (Dean and Ghemawat, OSDI 2004)
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178

    Shuffle and 2 Sorts

   As map emits values, local sorting
    runs in tandem (1st sort)
   Combine is optionally called
    0..N times for local aggregation
    on sorted (K2, list(V2)) tuples (more sorting of output)
   Partition determines which (logical) reducer Rj each key will go to
   Node’s TaskTracker tells JobTracker it has keys for Rj
   JobTracker determines node to run Rj based on data locality
   When local map/combine/sort finishes, sends data to Rj’s node
   Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
   For each (K, list(V)) tuple in merged output, call reduce(…)
Distributed File System
    Don’t move data… move computation to the data!
        Store data on the local disks of nodes in the cluster
        Start up the workers on the node that has the data local
    Why?
        Not enough RAM to hold all the data in memory
        Disk access is slow, but disk throughput is reasonable
    A distributed file system is the answer
        GFS (Google File System) for Google’s MapReduce
        HDFS (Hadoop Distributed File System) for Hadoop
GFS: Assumptions
           Commodity hardware over “exotic” hardware
                 Scale “out”, not “up”
           High component failure rates
                 Inexpensive commodity components fail all the time
           “Modest” number of huge files
                 Multi-gigabyte files are common, if not encouraged
           Files are write-once, mostly appended to
                 Perhaps concurrently
           Large streaming reads over random access
                 High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions
   Files stored as chunks
        Fixed size (64MB)
   Reliability through replication
        Each chunk replicated across 3+ chunkservers
   Single master to coordinate access, keep metadata
        Simple centralized management
   No data caching
        Little benefit due to large datasets, streaming reads
   Simplify the API
        Push some of the issues onto the client (e.g., data layout)

    HDFS = GFS clone (same basic ideas)
Basic Cluster Components
   1 “Manager” node (can be split onto 2 nodes)
       Namenode (NN)
       Jobtracker (JT)
   1-N “Worker” nodes
       Tasktracker (TT)
       Datanode (DN)
   Optional Secondary Namenode
       Periodic backups of Namenode in case of failure
Hadoop Architecture

   Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
Namenode Responsibilities
   Managing the file system namespace:
       Holds file/directory structure, metadata, file-to-block mapping,
        access permissions, etc.
   Coordinating file operations:
       Directs clients to datanodes for reads and writes
       No data is moved through the namenode
   Maintaining overall health:
       Periodic communication with the datanodes
       Block re-replication and rebalancing
       Garbage collection
Putting everything together…

                            namenode                 job submission node

                    namenode daemon                         jobtracker

          tasktracker                     tasktracker                      tasktracker

       datanode daemon                 datanode daemon               datanode daemon

        Linux file system               Linux file system                Linux file system

                        …                               …                                …
          slave node                      slave node                       slave node
Anatomy of a Job
   MapReduce program in Hadoop = Hadoop job
       Jobs are divided into map and reduce tasks (+ more!)
       An instance of running a task is called a task attempt
       Multiple jobs can be composed into a workflow
   Job submission process
       Client (i.e., driver program) creates a job, configures it, and
        submits it to job tracker
       JobClient computes input splits (on client end)
       Job data (jar, configuration XML) are sent to JobTracker
       JobTracker puts job data in shared location, enqueues tasks
       TaskTrackers poll for tasks
       Off to the races…
Why have 1 API when you can have 2?
White pp. 25-27, Lam pp. 77-80
   Hadoop 0.19 and earlier had “old API”
   Hadoop 0.21 and forward has “new API”
   Hadoop 0.20 has both!
       Old API most stable, but deprecated
       Current books use old API predominantly, but discuss changes
         • Example code using new API available online from publisher
       Some old API classes/methods not yet ported to new API
       Cloud9 uses both, and you can too
   Mapper (interface)
       void map(K1 key, V1 value, OutputCollector<K2, V2> output,
        Reporter reporter)
       void configure(JobConf job)
       void close() throws IOException
   Reducer/Combiner
       void reduce(K2 key, Iterator<V2> values,
        OutputCollector<K3,V3> output, Reporter reporter)
       void configure(JobConf job)
       void close() throws IOException
   Partitioner
       void getPartition(K2 key, V2 value, int numPartitions)
   org.apache.hadoop.mapred now deprecated; instead use
    org.apache.hadoop.mapreduce &
   Mapper, Reducer now abstract classes, not interfaces
   Use Context instead of OutputCollector and Reporter
       Context.write(), not OutputCollector.collect()
   Reduce takes value list as Iterable, not Iterator
       Can use java’s foreach syntax for iterating
   Can throw InterruptedException as well as IOException
   JobConf & JobClient replaced by Configuration & Job

Contenu connexe


Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Cics cheat sheet
Cics cheat sheetCics cheat sheet
Cics cheat sheetRafi Shaik
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanualiravi9
Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Takao Wada
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Alejandro Cosin Ayerbe
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformCliff Click
OSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tablesOSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tableshholzgra
Lecture 3: Storage and Variables
Lecture 3: Storage and VariablesLecture 3: Storage and Variables
Lecture 3: Storage and VariablesEelco Visser
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Michael Nelson

Tendances (20)

Cics faqs
Cics faqsCics faqs
Cics faqs
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Cics cheat sheet
Cics cheat sheetCics cheat sheet
Cics cheat sheet
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanual
Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5Trident International Graphics Workshop 2014 5/5
Trident International Graphics Workshop 2014 5/5
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++Parametric surface visualization in Directx 11 and C++
Parametric surface visualization in Directx 11 and C++
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning Platform
OSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tablesOSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tables
Lecture 3: Storage and Variables
Lecture 3: Storage and VariablesLecture 3: Storage and Variables
Lecture 3: Storage and Variables
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...Using timed-release cryptography to mitigate the preservation risk of embargo...
Using timed-release cryptography to mitigate the preservation risk of embargo...

Similaire à Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataSteven Francia
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
07 2
07 207 2
07 2a_b_g
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TERyosuke IWANAGA
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupBrian O'Neill
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezJ On The Beach

Similaire à Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
07 2
07 207 2
07 2
Spark meets Telemetry
Spark meets TelemetrySpark meets Telemetry
Spark meets Telemetry
MapReduce DesignPatterns
MapReduce DesignPatternsMapReduce DesignPatterns
MapReduce DesignPatterns
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Map Reduce
Map ReduceMap Reduce
Map Reduce
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
MSc Presentation
MSc PresentationMSc Presentation
MSc Presentation
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
Spark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and OpsSpark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Plus de Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)Matthew Lease
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016Matthew Lease
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)Matthew Lease
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing ScienceMatthew Lease
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease

Plus de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms


WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Dernier (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 2 September 1, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of • Chuck Lam’s Hadoop In Action (2011) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Roots in Functional Programming Map f f f f f Fold g g g g g
  • 4. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • 6. “Big Ideas”  Scale “out”, not “up”  Limits of SMP and large shared-memory machines  Move processing to the data  Cluster have limited bandwidth  Process data sequentially, avoid random access  Seeks are expensive, disk throughput is reasonable  Seamless scalability  From the mythical man-month to the tradable machine-hour
  • 7. Typical Large-Data Problem  Iterate over a large number of records  Compute something of interest from each  Shuffle and sort intermediate results  Aggregate intermediate results  Generate final output Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)
  • 8. MapReduce Data Flow Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
  • 9. MapReduce “Runtime”  Handles scheduling  Assigns workers to map and reduce tasks  Handles “data distribution”  Moves processes to data  Handles synchronization  Gathers, sorts, and shuffles intermediate data  Handles errors and faults  Detects worker failures and restarts  Built on a distributed file system
  • 10. MapReduce Programmers specify two functions map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) Note correspondence of types map output → reduce input Data Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  Map • Each split processed by same map node • map invoked iteratively: once per record in the split • For each record processed, map may emit 0-N (K2,V2) pairs  Reduce • reduce invoked iteratively for each ( K2, list(V2) ) intermediate value • For each processed, reduce may emit 0-N (K3,V3) pairs  Each reducer’s output written to a persistent file in HDFS
  • 11. Input File Input File InputSplit InputSplit InputSplit InputSplit InputSplit InputFormat RecordReader RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Source: redrawn from a slide by Cloduera, cc-licensed
  • 12. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30 Data Flow  Input → “input splits”: each a sequence of logical (K1,V1) “records”  For each split, for each record, do map(K1,V1) (multiple calls)  Each map call may emit any number of (K2,V2) pairs (0-N) Run-time  Groups all values with the same key into ( K2, list(V2) )  Determines which reducer will process this  Copies data across network as needed for reducer  Ensures intra-node sort of keys processed by each reducer • No guarantee by default of inter-node total sort across reducers
  • 13. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 14. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52
  • 15. Partition  Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]  Each distinct key (with associated values) sent to a single reducer • Same reduce node may process multiple keys in separate reduce() calls  Balances workload across reducers: equal number of keys to each • Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)  Customizable • Some keys require more computation than others • e.g. value skew, or key-specific computation performed • For skew, sampling can dynamically estimate distribution & set partition • Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
  • 16. Secondary Sorting (Lin 57, White 241)  How to output sorted bigrams (1st word, then list of 2nds)?  What if we use word1 as the key, word 2 as the value?  What if we use <first>--<second> as the key?  Pattern  Create a composite key of (first, second)  Define a Key Comparator based on both words • This will produce the sort order we want (aa ab ac ba bb bc ca cb…)  Define a partition function based only on first word • All bigrams with the same first word go to same reducer • How do you know when the first word changes across invocations?  Preserve state in the reducer across invocations • Will be called separately for each bigram, but we want to remember the current first word across bigrams seen  Hadoop also provides Group Comparator
  • 17. Combine  Given: map ( K1, V1 ) → list ( K2, V2 ) reduce ( K2, list(V2) ) → list ( K3, V3) combine ( K2, list(V2) ) → list ( K2, V2 )  Optional optimization  Local aggregation to reduce network traffic  No guarantee it will be used, how many times it will be called  Semantics of program cannot depend on its use  Signature: same input as reduce, same output as map  Combine may be run repeatedly on its own output  Lin: Associative & Commutative  combiner = reducer • See next slide
  • 18. Functional Properties  Associative: f( a, f(b,c) ) = f( f(a,b), c )  Grouping of operations doesn’t matter  YES: Addition, multiplication, concatenation  NO: division, subtraction, NAND  NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )  Commutative: f(a,b) = f(b,a)  Ordering of arguments doesn’t matter  YES: addition, multiplication, NAND  NO: division, subtraction, concatenation  Concatenate(“a,”b”) != concatenate(“b”,a”)  Distributive  White (p. 32) and Lam (p. 84) mention with regard to combiners  But really, go with associative + commutative in Lin (pp. 20, 27)
  • 19. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 8 3 6 reduce reduce reduce r1 s1 r2 s2 r3 s3
  • 20. User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files Adapted from (Dean and Ghemawat, OSDI 2004)
  • 21. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178 Shuffle and 2 Sorts  As map emits values, local sorting runs in tandem (1st sort)  Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)  Partition determines which (logical) reducer Rj each key will go to  Node’s TaskTracker tells JobTracker it has keys for Rj  JobTracker determines node to run Rj based on data locality  When local map/combine/sort finishes, sends data to Rj’s node  Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)  For each (K, list(V)) tuple in merged output, call reduce(…)
  • 22. Distributed File System  Don’t move data… move computation to the data!  Store data on the local disks of nodes in the cluster  Start up the workers on the node that has the data local  Why?  Not enough RAM to hold all the data in memory  Disk access is slow, but disk throughput is reasonable  A distributed file system is the answer  GFS (Google File System) for Google’s MapReduce  HDFS (Hadoop Distributed File System) for Hadoop
  • 23. GFS: Assumptions  Commodity hardware over “exotic” hardware  Scale “out”, not “up”  High component failure rates  Inexpensive commodity components fail all the time  “Modest” number of huge files  Multi-gigabyte files are common, if not encouraged  Files are write-once, mostly appended to  Perhaps concurrently  Large streaming reads over random access  High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • 24. GFS: Design Decisions  Files stored as chunks  Fixed size (64MB)  Reliability through replication  Each chunk replicated across 3+ chunkservers  Single master to coordinate access, keep metadata  Simple centralized management  No data caching  Little benefit due to large datasets, streaming reads  Simplify the API  Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas)
  • 25. Basic Cluster Components  1 “Manager” node (can be split onto 2 nodes)  Namenode (NN)  Jobtracker (JT)  1-N “Worker” nodes  Tasktracker (TT)  Datanode (DN)  Optional Secondary Namenode  Periodic backups of Namenode in case of failure
  • 26. Hadoop Architecture  Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25
  • 27. Namenode Responsibilities  Managing the file system namespace:  Holds file/directory structure, metadata, file-to-block mapping, access permissions, etc.  Coordinating file operations:  Directs clients to datanodes for reads and writes  No data is moved through the namenode  Maintaining overall health:  Periodic communication with the datanodes  Block re-replication and rebalancing  Garbage collection
  • 28. Putting everything together… namenode job submission node namenode daemon jobtracker tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node
  • 29. Anatomy of a Job  MapReduce program in Hadoop = Hadoop job  Jobs are divided into map and reduce tasks (+ more!)  An instance of running a task is called a task attempt  Multiple jobs can be composed into a workflow  Job submission process  Client (i.e., driver program) creates a job, configures it, and submits it to job tracker  JobClient computes input splits (on client end)  Job data (jar, configuration XML) are sent to JobTracker  JobTracker puts job data in shared location, enqueues tasks  TaskTrackers poll for tasks  Off to the races…
  • 30. Why have 1 API when you can have 2? White pp. 25-27, Lam pp. 77-80  Hadoop 0.19 and earlier had “old API”  Hadoop 0.21 and forward has “new API”  Hadoop 0.20 has both!  Old API most stable, but deprecated  Current books use old API predominantly, but discuss changes • Example code using new API available online from publisher  Some old API classes/methods not yet ported to new API  Cloud9 uses both, and you can too
  • 31. Old API  Mapper (interface)  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException  Reducer/Combiner  void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter reporter)  void configure(JobConf job)  void close() throws IOException  Partitioner  void getPartition(K2 key, V2 value, int numPartitions)
  • 32. New API  org.apache.hadoop.mapred now deprecated; instead use org.apache.hadoop.mapreduce & org.apache.hadoop.mapreduce.lib  Mapper, Reducer now abstract classes, not interfaces  Use Context instead of OutputCollector and Reporter  Context.write(), not OutputCollector.collect()  Reduce takes value list as Iterable, not Iterator  Can use java’s foreach syntax for iterating  Can throw InterruptedException as well as IOException  JobConf & JobClient replaced by Configuration & Job