SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Weekly meeting
         Week 5


    Le Duc Tung

 PRL group, NII, Tokyo


    Nov 21, 2012




 Le Duc Tung   Weekly meeting
Last week




     Described Quicksort using MapReduce, however,
         Required many MapReduce tasks: (log n) to n steps.




                           Le Duc Tung   Weekly meeting
Terasort using Hadoop



     Written by Owen O’Malley and Arun Murthy at Yahoo Inc.
     Won the annual general purpose terabyte sort benchmark in 2008
     and 2009.
         In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB
         memory). Sorting 1TB data in 3.48 minutes.
         In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA).
         Sorting 100TB data in 173 minutes.
     Terasort includes 3 MapReduce applications:
         Teragen: generates the data.
         Terasort: samples the input data and uses them with MapReduce to
         sort the data.
         Teravalidate: validates the output is sorted.




                           Le Duc Tung   Weekly meeting
A closer look at MapReduce’s implementation model




  ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html”
                           Le Duc Tung   Weekly meeting
Teragen




     Input: The number of rows and the output directory.
     Output format:




                           Le Duc Tung   Weekly meeting
MapReduce for Teragen


     Preparation of data for Map task.
                                    InputSplit
         The number of rows            →         splits of (startid, num of rows).
                                      RecordReader
         (startid, num of rows)            →         key-value pairs of (rowid, null)
     Example:
         We need generate data of 100 rows
         We use 2 mappers
         We generate 2 splits for 2 mappers.
                Split 0: consists of two values: startid = 0, number of rows = 50.
                Split 1: consists of two values: startid = 50, number of rows = 50.
         Key-value pairs generated from split 0:
                (0, null), (1, null), ..., (49, null)
         Key-value pairs generated from split 1:
                (50, null), (51, null), ..., (99, null)




                                  Le Duc Tung      Weekly meeting
Map and Reduce task


     Map
           Input is a pair of (rowid, null)
           Output is a pair of (key, value), in which
                Step 1: Create a random generator based on ”rowid”.
                Step 2: Create a key that is a text of 10 random bytes.
                Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters +
                1 block of 8 characters”
                Characters are got from [’A’ .. ’Z’]
     Reduce task
           Does nothing.




                               Le Duc Tung   Weekly meeting
Terasort



     Step 1: (Before submitting job) Generate the sample keys by
     sampling the input. It is a sorted list of sampled keys.
     Step 2: (Using Partitioner) Distribute keys to the correspondent
     reducers using sample keys.
     Example: If we have N reducers.
           Generate (N-1) sample keys.
           All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to
           reduce i
           The above guarantees that the output of reduce i are all less than
           the output of reduce (i + 1)




                              Le Duc Tung   Weekly meeting
Generate sample keys

     From the input, create an array R containing all keys or the
     maximum of 100.000 keys (1MB). However, we can specify how
     many elements the array has.
     Using Quicksort to sort keys in the array.
     Create an array of (N-1) sample keys from the array R, in which N is
     the number of reducers, such that,
         (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the
         original array.
         stepSize is of (No. of elements of R)/(No. of reducers)
     List of sampled keys is written into HDFS.




                             Le Duc Tung   Weekly meeting
Partitioner

  public interface Partitioner<K2, V2> extends JobConfigurable {
    /**
     * Get the partition number for a given key (hence record)
     * given the total number of partitions
     * i.e. number of reduce-tasks for the job.
     *
     * <p>Typically a hash function on a all or a subset of
     * the key.</p>
     *
     * @param key the key to be partitioned.
     * @param value the entry value.
     * @param numPartitions the total number of partitions.
     * @return the partition number for the <code>key</code>.
     */
    int getPartition(K2 key, V2 value, int numPartitions);
  }


                        Le Duc Tung   Weekly meeting
Partitioner and Trie tree




    getPartition() function will
    return the position of k in the
    trie tree.                              ”source:
                                            http://en.wikipedia.org/wiki/Trie”


                              Le Duc Tung   Weekly meeting
Trie for sample keys
      Assume that we have three sample keys:
           :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120
           respectively.
           LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115
           respectively.
           le5awB.$sm, in which, ASCII code of l and e is 108 and 101
           respectively.
      Using the 2 first bytes of keys for building a 2-level trie tree.




                              Le Duc Tung   Weekly meeting
Default sort




    By default, data
    will be sorted
    before passing to
    reducers.
    Reduce task
    does nothing, it
    just outputs the
    (K, V) pairs to
    output files.




                        Le Duc Tung   Weekly meeting

Contenu connexe

Tendances

Introduction to Database Management System
Introduction to Database Management SystemIntroduction to Database Management System
Introduction to Database Management SystemHitesh Mohapatra
 
History of database processing module 1 (2)
History of database processing module 1 (2)History of database processing module 1 (2)
History of database processing module 1 (2)chottu89
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptxnehabsairam
 
Solution manual for database systems a practical approach to design implement...
Solution manual for database systems a practical approach to design implement...Solution manual for database systems a practical approach to design implement...
Solution manual for database systems a practical approach to design implement...zammok
 
Dbms 10: Conversion of ER model to Relational Model
Dbms 10: Conversion of ER model to Relational ModelDbms 10: Conversion of ER model to Relational Model
Dbms 10: Conversion of ER model to Relational ModelAmiya9439793168
 
Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expressionAnimesh Chaturvedi
 
Object Oriented Dbms
Object Oriented DbmsObject Oriented Dbms
Object Oriented Dbmsmaryeem
 
Capacity Planning of Data Warehousing
Capacity Planning of Data WarehousingCapacity Planning of Data Warehousing
Capacity Planning of Data WarehousingKamal Acharya
 
Triplestore and SPARQL
Triplestore and SPARQLTriplestore and SPARQL
Triplestore and SPARQLLino Valdivia
 
Relational Algebra & Calculus
Relational Algebra & CalculusRelational Algebra & Calculus
Relational Algebra & CalculusAbdullah Khosa
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management Systempsathishcs
 

Tendances (20)

Database Basics
Database BasicsDatabase Basics
Database Basics
 
Database Basics Theory
Database Basics TheoryDatabase Basics Theory
Database Basics Theory
 
Client server
Client serverClient server
Client server
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Introduction to Database Management System
Introduction to Database Management SystemIntroduction to Database Management System
Introduction to Database Management System
 
History of database processing module 1 (2)
History of database processing module 1 (2)History of database processing module 1 (2)
History of database processing module 1 (2)
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptx
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Solution manual for database systems a practical approach to design implement...
Solution manual for database systems a practical approach to design implement...Solution manual for database systems a practical approach to design implement...
Solution manual for database systems a practical approach to design implement...
 
Dbms 10: Conversion of ER model to Relational Model
Dbms 10: Conversion of ER model to Relational ModelDbms 10: Conversion of ER model to Relational Model
Dbms 10: Conversion of ER model to Relational Model
 
Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expression
 
Object Oriented Dbms
Object Oriented DbmsObject Oriented Dbms
Object Oriented Dbms
 
Capacity Planning of Data Warehousing
Capacity Planning of Data WarehousingCapacity Planning of Data Warehousing
Capacity Planning of Data Warehousing
 
Type constructor
Type constructorType constructor
Type constructor
 
Triplestore and SPARQL
Triplestore and SPARQLTriplestore and SPARQL
Triplestore and SPARQL
 
Relational Algebra & Calculus
Relational Algebra & CalculusRelational Algebra & Calculus
Relational Algebra & Calculus
 
Soa unit-1-well formed and valid document08.07.2019
Soa unit-1-well formed and valid document08.07.2019Soa unit-1-well formed and valid document08.07.2019
Soa unit-1-well formed and valid document08.07.2019
 
Visual basic databases
Visual basic databasesVisual basic databases
Visual basic databases
 
Hadoop
HadoopHadoop
Hadoop
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
 

Similaire à TeraSort

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structureShakil Ahmed
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Simplilearn
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0BG Java EE Course
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R languagechhabria-nitesh
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxjeremylockett77
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoreambikavenkatesh2
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.pptaliraza2732
 
Ida python intro
Ida python introIda python intro
Ida python intro小静 安
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.pptssuserd64918
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxdudelover
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerKenta Sato
 

Similaire à TeraSort (20)

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structure
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...
 
Using Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUGUsing Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUG
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0
 
Clojure intro
Clojure introClojure intro
Clojure intro
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.ppt
 
Ida python intro
Ida python introIda python intro
Ida python intro
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, Stronger
 
Python Day1
Python Day1Python Day1
Python Day1
 
1.2 matlab numerical data
1.2  matlab numerical data1.2  matlab numerical data
1.2 matlab numerical data
 

Dernier

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

TeraSort

  • 1. Weekly meeting Week 5 Le Duc Tung PRL group, NII, Tokyo Nov 21, 2012 Le Duc Tung Weekly meeting
  • 2. Last week Described Quicksort using MapReduce, however, Required many MapReduce tasks: (log n) to n steps. Le Duc Tung Weekly meeting
  • 3. Terasort using Hadoop Written by Owen O’Malley and Arun Murthy at Yahoo Inc. Won the annual general purpose terabyte sort benchmark in 2008 and 2009. In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB memory). Sorting 1TB data in 3.48 minutes. In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). Sorting 100TB data in 173 minutes. Terasort includes 3 MapReduce applications: Teragen: generates the data. Terasort: samples the input data and uses them with MapReduce to sort the data. Teravalidate: validates the output is sorted. Le Duc Tung Weekly meeting
  • 4. A closer look at MapReduce’s implementation model ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html” Le Duc Tung Weekly meeting
  • 5. Teragen Input: The number of rows and the output directory. Output format: Le Duc Tung Weekly meeting
  • 6. MapReduce for Teragen Preparation of data for Map task. InputSplit The number of rows → splits of (startid, num of rows). RecordReader (startid, num of rows) → key-value pairs of (rowid, null) Example: We need generate data of 100 rows We use 2 mappers We generate 2 splits for 2 mappers. Split 0: consists of two values: startid = 0, number of rows = 50. Split 1: consists of two values: startid = 50, number of rows = 50. Key-value pairs generated from split 0: (0, null), (1, null), ..., (49, null) Key-value pairs generated from split 1: (50, null), (51, null), ..., (99, null) Le Duc Tung Weekly meeting
  • 7. Map and Reduce task Map Input is a pair of (rowid, null) Output is a pair of (key, value), in which Step 1: Create a random generator based on ”rowid”. Step 2: Create a key that is a text of 10 random bytes. Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters + 1 block of 8 characters” Characters are got from [’A’ .. ’Z’] Reduce task Does nothing. Le Duc Tung Weekly meeting
  • 8. Terasort Step 1: (Before submitting job) Generate the sample keys by sampling the input. It is a sorted list of sampled keys. Step 2: (Using Partitioner) Distribute keys to the correspondent reducers using sample keys. Example: If we have N reducers. Generate (N-1) sample keys. All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to reduce i The above guarantees that the output of reduce i are all less than the output of reduce (i + 1) Le Duc Tung Weekly meeting
  • 9. Generate sample keys From the input, create an array R containing all keys or the maximum of 100.000 keys (1MB). However, we can specify how many elements the array has. Using Quicksort to sort keys in the array. Create an array of (N-1) sample keys from the array R, in which N is the number of reducers, such that, (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the original array. stepSize is of (No. of elements of R)/(No. of reducers) List of sampled keys is written into HDFS. Le Duc Tung Weekly meeting
  • 10. Partitioner public interface Partitioner<K2, V2> extends JobConfigurable { /** * Get the partition number for a given key (hence record) * given the total number of partitions * i.e. number of reduce-tasks for the job. * * <p>Typically a hash function on a all or a subset of * the key.</p> * * @param key the key to be partitioned. * @param value the entry value. * @param numPartitions the total number of partitions. * @return the partition number for the <code>key</code>. */ int getPartition(K2 key, V2 value, int numPartitions); } Le Duc Tung Weekly meeting
  • 11. Partitioner and Trie tree getPartition() function will return the position of k in the trie tree. ”source: http://en.wikipedia.org/wiki/Trie” Le Duc Tung Weekly meeting
  • 12. Trie for sample keys Assume that we have three sample keys: :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120 respectively. LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115 respectively. le5awB.$sm, in which, ASCII code of l and e is 108 and 101 respectively. Using the 2 first bytes of keys for building a 2-level trie tree. Le Duc Tung Weekly meeting
  • 13. Default sort By default, data will be sorted before passing to reducers. Reduce task does nothing, it just outputs the (K, V) pairs to output files. Le Duc Tung Weekly meeting