SlideShare une entreprise Scribd logo
1  sur  40
Counting @ Scale
       Part II

            Sharad Goel
        Columbia University
Computational Social Science: Lecture 4

          February 15, 2013
Descriptive statistics
    (as opposed to inferential statistics)
      is about counting
        contingency tables
    means, variances, quantiles
summaries of conditional distributions
Counting @ scale
  conceptually easy
computationally hard
MapReduce:
Simplifed Data Processing on Large Clusters
      Jeffrey Dean and Sanjay Ghemawat
                  OSDI, 2004
Map
assign each input line to one or more groups

                  Shuffle
             aggregate groups

                 Reduce
         operate on grouped data
Map
assign each input line to one or more groups
        v  [(k1, v1), …, (km, vm)]

                  Shuffle
             aggregate groups

                  Reduce
         operate on grouped data
        (k, [v1, …, vn])  [w1, …, wp]
Group Average

              Input
    views (user, movie, rating)

             Output
average (mean & median) by movie
Group Average

          Map
identity (key := movie)

       Reduce
movie group  average
The Insight of MapReduce
       One can efficiently group identical items

Many tasks are computationally easier on grouped data
Filter

                 Input
    arbitrary data & filter condition

                 Output
subset of input data satisfying condition
Filter

                   Map
input  input if condition(input) else pass

                 Reduce
                 identity
Distinct

        Input
     set of items

        Output
subset of distinct items
Distinct

                 Map
               identity

               Reduce
grouped items  single item from group
Sample

               Input
set of items & sample probability p

            Output
     random subset of items
Sample

                  Map
input  input if rand(0,1) < p else pass

                Reduce
                identity
Sort

          Input
set of items (and a key)

       Output
 ordered set of items
Sort

                       Map
identity, with all data assigned to the same key

                    Reduce
                    identity


     *all the work happens in the shuffle
Sort

                 Map
identity, with key := first letter of line

                Reduce
                identity


*all the work happens in the shuffle
Sort

                       Sample
generate a small sample of the data (with MapReduce)

                Determine breakpoints
      sort the sample and compute percentiles
Sort

                     Map
identity, with key determined by breakpoints

                  Reduce
                  identity


 *most of the work happens in the shuffle
Combining data

                  Example
    for each user, want to compute the
average popularity of the movies they watch

                   Problem
    one file contains views (user, movie);
another file contains popularity (movie, rank)
Joins
User    Movie

 23      829

 789     24             User   Movie   Rank

 234    5678            23      829     34

  7      24             789     24     100

                        234    5678     4
Movie   Rank
                         7      24     100
5678      4

 24      100

 829     34
Nested-Loop Joins


For each user in users:
      For each movie in movies:
            if user.movie_id == movie.id:
                    output user.id, movie.id, movie.rating
Sort-Merge Joins
User    Movie

 789     24

  7      24               User     Movie   Rank

 23      829              789       24     100

 234    5678               7        24     100

                           23       829     34
Movie   Rank
                          234      5678     4
 24      100

 829     34

5678      4
Hash Joins
User    Movie

 23      829

 789     24

 234    5678

  7      24


Movie   Rank

5678      4

 24      100

 829     34
Distributed Joins

                Map
     reduce key := hash(join key)

                Reduce
        local (sort-merge) join


*also need to keep track of which table
    is the left and which is the right
Joins
{ inner, left, right, outer }
User   Movie
User    Sex
                23      829
23     male
                789     24
789    female
                234    5678
234    female
                 7      24
 7     male
                789     90
26     male
                23      758
567    female
                23      39
 2     female
                 2      782
User    Sex
                User   Activity
23     male
                23        3
789    female
                789       2
234    female
                234       1
 7     male
                 7        1
26     male
                789      90
567    female
                 2        1
 2     female
User                Sex
                                        User       Activity
23                 male
                                        23            3
789                female
                                        789           2
234                female
                                        234           1
 7                 male
                                          7           1
26                 male
                                        789           90
567                female
                                          2           1
 2                 female


                          User    Sex          Activity
                            23   male             3
       Left Join



                          789    female           2
                          234    female           1
                            7    male             1
                            26   male
                          567    female
                            2    female           1
User                 Sex
                                         User       Activity
23                  male
                                         23            3
789                 female
                                         789           2
234                 female
                                         234           1
 7                  male
                                           7           1
26                  male
                                         789           90
567                 female
                                           2           1
 2                  female


                           User    Sex          Activity
       Inner Join


                             23   male             3
                           789    female           2
                           234    female           1
                             7    male             1
                             2    female           1
Inner join
          returns pairs of rows in tables A & B
               that match join condition

                      Left (outer) join
         returns all rows from an inner join plus
rows in the left table that do not match to the right table

                      Full (outer) join
         returns all rows from an inner join plus
   rows in either table that do not match to the other
Map-side Joins

                     Map
      load (smaller) table into memory
stream through (larger) table and find matches

                   Reduce
                   identity
MapReduce Ops

           Map-only
Filter, sample, map-side joins

      Map & Reduce
 groupby, distinct, sort, join
The long tail

             Input
      (user, movie) views

             Output
for each user, average popularity
      of movies they watch
Step 1. compute movie popularity
  group views by movie & count
Step 2. Rank movies
 sort by popularity
Step 3. merge view and ranking data
join views and movie popularity tables
Step 4. compute eccentricity
group views/ranking by user and
     compute eccentricity
Pig Latin:
A Not-So-Foreign Language for Data Processing
   Olston, Reed, Srivastava, Kumar, and Tomkins
                  SIGMOD, 2008

Contenu connexe

En vedette

Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Cerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSCerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSjorchuk
 
Estancias en Guadalajara
Estancias en GuadalajaraEstancias en Guadalajara
Estancias en GuadalajaraAlice Listing
 
Conferencia educación católica versión final - abril 24, 2009..[1]
Conferencia educación católica   versión final - abril 24, 2009..[1]Conferencia educación católica   versión final - abril 24, 2009..[1]
Conferencia educación católica versión final - abril 24, 2009..[1]julian
 
LAS PLANTAS
LAS PLANTASLAS PLANTAS
LAS PLANTASrosayago
 
4 de cada barroco
4 de cada barroco4 de cada barroco
4 de cada barrocoanamaria35
 

En vedette (20)

Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
практ7
практ7практ7
практ7
 
Cerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSCerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOS
 
Estancias en Guadalajara
Estancias en GuadalajaraEstancias en Guadalajara
Estancias en Guadalajara
 
практ3
практ3практ3
практ3
 
Conferencia educación católica versión final - abril 24, 2009..[1]
Conferencia educación católica   versión final - abril 24, 2009..[1]Conferencia educación católica   versión final - abril 24, 2009..[1]
Conferencia educación católica versión final - abril 24, 2009..[1]
 
LAS PLANTAS
LAS PLANTASLAS PLANTAS
LAS PLANTAS
 
No Esperes
No EsperesNo Esperes
No Esperes
 
4 de cada barroco
4 de cada barroco4 de cada barroco
4 de cada barroco
 
Starbucks
StarbucksStarbucks
Starbucks
 

Plus de jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10jakehofman
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09jakehofman
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brainjakehofman
 

Plus de jakehofman (14)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brain
 

Dernier

How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 

Dernier (20)

How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 

Computational Social Science, Lecture 04: Counting at Scale, Part II

  • 1. Counting @ Scale Part II Sharad Goel Columbia University Computational Social Science: Lecture 4 February 15, 2013
  • 2. Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantiles summaries of conditional distributions
  • 3. Counting @ scale conceptually easy computationally hard
  • 4. MapReduce: Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
  • 5. Map assign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
  • 6. Map assign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
  • 7. Group Average Input views (user, movie, rating) Output average (mean & median) by movie
  • 8. Group Average Map identity (key := movie) Reduce movie group  average
  • 9. The Insight of MapReduce One can efficiently group identical items Many tasks are computationally easier on grouped data
  • 10. Filter Input arbitrary data & filter condition Output subset of input data satisfying condition
  • 11. Filter Map input  input if condition(input) else pass Reduce identity
  • 12. Distinct Input set of items Output subset of distinct items
  • 13. Distinct Map identity Reduce grouped items  single item from group
  • 14. Sample Input set of items & sample probability p Output random subset of items
  • 15. Sample Map input  input if rand(0,1) < p else pass Reduce identity
  • 16. Sort Input set of items (and a key) Output ordered set of items
  • 17. Sort Map identity, with all data assigned to the same key Reduce identity *all the work happens in the shuffle
  • 18. Sort Map identity, with key := first letter of line Reduce identity *all the work happens in the shuffle
  • 19. Sort Sample generate a small sample of the data (with MapReduce) Determine breakpoints sort the sample and compute percentiles
  • 20. Sort Map identity, with key determined by breakpoints Reduce identity *most of the work happens in the shuffle
  • 21. Combining data Example for each user, want to compute the average popularity of the movies they watch Problem one file contains views (user, movie); another file contains popularity (movie, rank)
  • 22. Joins User Movie 23 829 789 24 User Movie Rank 234 5678 23 829 34 7 24 789 24 100 234 5678 4 Movie Rank 7 24 100 5678 4 24 100 829 34
  • 23. Nested-Loop Joins For each user in users: For each movie in movies: if user.movie_id == movie.id: output user.id, movie.id, movie.rating
  • 24. Sort-Merge Joins User Movie 789 24 7 24 User Movie Rank 23 829 789 24 100 234 5678 7 24 100 23 829 34 Movie Rank 234 5678 4 24 100 829 34 5678 4
  • 25. Hash Joins User Movie 23 829 789 24 234 5678 7 24 Movie Rank 5678 4 24 100 829 34
  • 26. Distributed Joins Map reduce key := hash(join key) Reduce local (sort-merge) join *also need to keep track of which table is the left and which is the right
  • 27. Joins { inner, left, right, outer }
  • 28. User Movie User Sex 23 829 23 male 789 24 789 female 234 5678 234 female 7 24 7 male 789 90 26 male 23 758 567 female 23 39 2 female 2 782
  • 29. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female
  • 30. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female User Sex Activity 23 male 3 Left Join 789 female 2 234 female 1 7 male 1 26 male 567 female 2 female 1
  • 31. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female User Sex Activity Inner Join 23 male 3 789 female 2 234 female 1 7 male 1 2 female 1
  • 32. Inner join returns pairs of rows in tables A & B that match join condition Left (outer) join returns all rows from an inner join plus rows in the left table that do not match to the right table Full (outer) join returns all rows from an inner join plus rows in either table that do not match to the other
  • 33. Map-side Joins Map load (smaller) table into memory stream through (larger) table and find matches Reduce identity
  • 34. MapReduce Ops Map-only Filter, sample, map-side joins Map & Reduce groupby, distinct, sort, join
  • 35. The long tail Input (user, movie) views Output for each user, average popularity of movies they watch
  • 36. Step 1. compute movie popularity group views by movie & count
  • 37. Step 2. Rank movies sort by popularity
  • 38. Step 3. merge view and ranking data join views and movie popularity tables
  • 39. Step 4. compute eccentricity group views/ranking by user and compute eccentricity
  • 40. Pig Latin: A Not-So-Foreign Language for Data Processing Olston, Reed, Srivastava, Kumar, and Tomkins SIGMOD, 2008