SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Weekly meeting
         Week 5


    Le Duc Tung

 PRL group, NII, Tokyo


    Nov 21, 2012




 Le Duc Tung   Weekly meeting
Last week




     Described Quicksort using MapReduce, however,
         Required many MapReduce tasks: (log n) to n steps.




                           Le Duc Tung   Weekly meeting
Terasort using Hadoop



     Written by Owen O’Malley and Arun Murthy at Yahoo Inc.
     Won the annual general purpose terabyte sort benchmark in 2008
     and 2009.
         In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB
         memory). Sorting 1TB data in 3.48 minutes.
         In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA).
         Sorting 100TB data in 173 minutes.
     Terasort includes 3 MapReduce applications:
         Teragen: generates the data.
         Terasort: samples the input data and uses them with MapReduce to
         sort the data.
         Teravalidate: validates the output is sorted.




                           Le Duc Tung   Weekly meeting
A closer look at MapReduce’s implementation model




  ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html”
                           Le Duc Tung   Weekly meeting
Teragen




     Input: The number of rows and the output directory.
     Output format:




                           Le Duc Tung   Weekly meeting
MapReduce for Teragen


     Preparation of data for Map task.
                                    InputSplit
         The number of rows            →         splits of (startid, num of rows).
                                      RecordReader
         (startid, num of rows)            →         key-value pairs of (rowid, null)
     Example:
         We need generate data of 100 rows
         We use 2 mappers
         We generate 2 splits for 2 mappers.
                Split 0: consists of two values: startid = 0, number of rows = 50.
                Split 1: consists of two values: startid = 50, number of rows = 50.
         Key-value pairs generated from split 0:
                (0, null), (1, null), ..., (49, null)
         Key-value pairs generated from split 1:
                (50, null), (51, null), ..., (99, null)




                                  Le Duc Tung      Weekly meeting
Map and Reduce task


     Map
           Input is a pair of (rowid, null)
           Output is a pair of (key, value), in which
                Step 1: Create a random generator based on ”rowid”.
                Step 2: Create a key that is a text of 10 random bytes.
                Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters +
                1 block of 8 characters”
                Characters are got from [’A’ .. ’Z’]
     Reduce task
           Does nothing.




                               Le Duc Tung   Weekly meeting
Terasort



     Step 1: (Before submitting job) Generate the sample keys by
     sampling the input. It is a sorted list of sampled keys.
     Step 2: (Using Partitioner) Distribute keys to the correspondent
     reducers using sample keys.
     Example: If we have N reducers.
           Generate (N-1) sample keys.
           All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to
           reduce i
           The above guarantees that the output of reduce i are all less than
           the output of reduce (i + 1)




                              Le Duc Tung   Weekly meeting
Generate sample keys

     From the input, create an array R containing all keys or the
     maximum of 100.000 keys (1MB). However, we can specify how
     many elements the array has.
     Using Quicksort to sort keys in the array.
     Create an array of (N-1) sample keys from the array R, in which N is
     the number of reducers, such that,
         (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the
         original array.
         stepSize is of (No. of elements of R)/(No. of reducers)
     List of sampled keys is written into HDFS.




                             Le Duc Tung   Weekly meeting
Partitioner

  public interface Partitioner<K2, V2> extends JobConfigurable {
    /**
     * Get the partition number for a given key (hence record)
     * given the total number of partitions
     * i.e. number of reduce-tasks for the job.
     *
     * <p>Typically a hash function on a all or a subset of
     * the key.</p>
     *
     * @param key the key to be partitioned.
     * @param value the entry value.
     * @param numPartitions the total number of partitions.
     * @return the partition number for the <code>key</code>.
     */
    int getPartition(K2 key, V2 value, int numPartitions);
  }


                        Le Duc Tung   Weekly meeting
Partitioner and Trie tree




    getPartition() function will
    return the position of k in the
    trie tree.                              ”source:
                                            http://en.wikipedia.org/wiki/Trie”


                              Le Duc Tung   Weekly meeting
Trie for sample keys
      Assume that we have three sample keys:
           :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120
           respectively.
           LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115
           respectively.
           le5awB.$sm, in which, ASCII code of l and e is 108 and 101
           respectively.
      Using the 2 first bytes of keys for building a 2-level trie tree.




                              Le Duc Tung   Weekly meeting
Default sort




    By default, data
    will be sorted
    before passing to
    reducers.
    Reduce task
    does nothing, it
    just outputs the
    (K, V) pairs to
    output files.




                        Le Duc Tung   Weekly meeting

Contenu connexe

Tendances

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Microsoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxMicrosoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxsamtakke1
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Digital Object Identifier (DOI): Introduction and Applications
Digital Object Identifier (DOI): Introduction and Applications Digital Object Identifier (DOI): Introduction and Applications
Digital Object Identifier (DOI): Introduction and Applications Nader Ale Ebrahim
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
Chapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsChapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsnehabsairam
 
Exadata Performance Optimization
Exadata Performance OptimizationExadata Performance Optimization
Exadata Performance OptimizationEnkitec
 
Conference slides: MySQL Cluster Performance Tuning
Conference slides: MySQL Cluster Performance TuningConference slides: MySQL Cluster Performance Tuning
Conference slides: MySQL Cluster Performance TuningSeveralnines
 
Diagramas De Secuencia
Diagramas De SecuenciaDiagramas De Secuencia
Diagramas De SecuenciaFabian Garcia
 
OO Development 3 - Models And UML
OO Development 3 - Models And UMLOO Development 3 - Models And UML
OO Development 3 - Models And UMLRandy Connolly
 
Colour Modelling - domain modelling with the 3rd dimension
Colour Modelling - domain modelling with the 3rd dimensionColour Modelling - domain modelling with the 3rd dimension
Colour Modelling - domain modelling with the 3rd dimensionDouglas English
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Oracle advanced queuing
Oracle advanced queuingOracle advanced queuing
Oracle advanced queuingGurpreet singh
 
32- Validation Activity in Azure Data Factory.pptx
32- Validation Activity in Azure Data Factory.pptx32- Validation Activity in Azure Data Factory.pptx
32- Validation Activity in Azure Data Factory.pptxBRIJESH KUMAR
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 

Tendances (20)

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Microsoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxMicrosoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptx
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Digital Object Identifier (DOI): Introduction and Applications
Digital Object Identifier (DOI): Introduction and Applications Digital Object Identifier (DOI): Introduction and Applications
Digital Object Identifier (DOI): Introduction and Applications
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Chapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsChapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortals
 
Exadata Performance Optimization
Exadata Performance OptimizationExadata Performance Optimization
Exadata Performance Optimization
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Conference slides: MySQL Cluster Performance Tuning
Conference slides: MySQL Cluster Performance TuningConference slides: MySQL Cluster Performance Tuning
Conference slides: MySQL Cluster Performance Tuning
 
Diagramas De Secuencia
Diagramas De SecuenciaDiagramas De Secuencia
Diagramas De Secuencia
 
Sql
SqlSql
Sql
 
OO Development 3 - Models And UML
OO Development 3 - Models And UMLOO Development 3 - Models And UML
OO Development 3 - Models And UML
 
Colour Modelling - domain modelling with the 3rd dimension
Colour Modelling - domain modelling with the 3rd dimensionColour Modelling - domain modelling with the 3rd dimension
Colour Modelling - domain modelling with the 3rd dimension
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Oracle advanced queuing
Oracle advanced queuingOracle advanced queuing
Oracle advanced queuing
 
Data structures
Data structuresData structures
Data structures
 
32- Validation Activity in Azure Data Factory.pptx
32- Validation Activity in Azure Data Factory.pptx32- Validation Activity in Azure Data Factory.pptx
32- Validation Activity in Azure Data Factory.pptx
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
MS Sql Server: Creating Views
MS Sql Server: Creating ViewsMS Sql Server: Creating Views
MS Sql Server: Creating Views
 

Similaire à TeraSort

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structureShakil Ahmed
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Simplilearn
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0BG Java EE Course
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R languagechhabria-nitesh
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxjeremylockett77
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoreambikavenkatesh2
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.pptaliraza2732
 
Ida python intro
Ida python introIda python intro
Ida python intro小静 安
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.pptssuserd64918
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxdudelover
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerKenta Sato
 

Similaire à TeraSort (20)

Advanced data structure
Advanced data structureAdvanced data structure
Advanced data structure
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...
 
Using Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUGUsing Parallel Computing Platform - NHDNUG
Using Parallel Computing Platform - NHDNUG
 
Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0Algorithms with-java-advanced-1.0
Algorithms with-java-advanced-1.0
 
Clojure intro
Clojure introClojure intro
Clojure intro
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.ppt
 
Ida python intro
Ida python introIda python intro
Ida python intro
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, Stronger
 
Python Day1
Python Day1Python Day1
Python Day1
 
1.2 matlab numerical data
1.2  matlab numerical data1.2  matlab numerical data
1.2 matlab numerical data
 

Dernier

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

TeraSort

  • 1. Weekly meeting Week 5 Le Duc Tung PRL group, NII, Tokyo Nov 21, 2012 Le Duc Tung Weekly meeting
  • 2. Last week Described Quicksort using MapReduce, however, Required many MapReduce tasks: (log n) to n steps. Le Duc Tung Weekly meeting
  • 3. Terasort using Hadoop Written by Owen O’Malley and Arun Murthy at Yahoo Inc. Won the annual general purpose terabyte sort benchmark in 2008 and 2009. In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB memory). Sorting 1TB data in 3.48 minutes. In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). Sorting 100TB data in 173 minutes. Terasort includes 3 MapReduce applications: Teragen: generates the data. Terasort: samples the input data and uses them with MapReduce to sort the data. Teravalidate: validates the output is sorted. Le Duc Tung Weekly meeting
  • 4. A closer look at MapReduce’s implementation model ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html” Le Duc Tung Weekly meeting
  • 5. Teragen Input: The number of rows and the output directory. Output format: Le Duc Tung Weekly meeting
  • 6. MapReduce for Teragen Preparation of data for Map task. InputSplit The number of rows → splits of (startid, num of rows). RecordReader (startid, num of rows) → key-value pairs of (rowid, null) Example: We need generate data of 100 rows We use 2 mappers We generate 2 splits for 2 mappers. Split 0: consists of two values: startid = 0, number of rows = 50. Split 1: consists of two values: startid = 50, number of rows = 50. Key-value pairs generated from split 0: (0, null), (1, null), ..., (49, null) Key-value pairs generated from split 1: (50, null), (51, null), ..., (99, null) Le Duc Tung Weekly meeting
  • 7. Map and Reduce task Map Input is a pair of (rowid, null) Output is a pair of (key, value), in which Step 1: Create a random generator based on ”rowid”. Step 2: Create a key that is a text of 10 random bytes. Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters + 1 block of 8 characters” Characters are got from [’A’ .. ’Z’] Reduce task Does nothing. Le Duc Tung Weekly meeting
  • 8. Terasort Step 1: (Before submitting job) Generate the sample keys by sampling the input. It is a sorted list of sampled keys. Step 2: (Using Partitioner) Distribute keys to the correspondent reducers using sample keys. Example: If we have N reducers. Generate (N-1) sample keys. All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to reduce i The above guarantees that the output of reduce i are all less than the output of reduce (i + 1) Le Duc Tung Weekly meeting
  • 9. Generate sample keys From the input, create an array R containing all keys or the maximum of 100.000 keys (1MB). However, we can specify how many elements the array has. Using Quicksort to sort keys in the array. Create an array of (N-1) sample keys from the array R, in which N is the number of reducers, such that, (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the original array. stepSize is of (No. of elements of R)/(No. of reducers) List of sampled keys is written into HDFS. Le Duc Tung Weekly meeting
  • 10. Partitioner public interface Partitioner<K2, V2> extends JobConfigurable { /** * Get the partition number for a given key (hence record) * given the total number of partitions * i.e. number of reduce-tasks for the job. * * <p>Typically a hash function on a all or a subset of * the key.</p> * * @param key the key to be partitioned. * @param value the entry value. * @param numPartitions the total number of partitions. * @return the partition number for the <code>key</code>. */ int getPartition(K2 key, V2 value, int numPartitions); } Le Duc Tung Weekly meeting
  • 11. Partitioner and Trie tree getPartition() function will return the position of k in the trie tree. ”source: http://en.wikipedia.org/wiki/Trie” Le Duc Tung Weekly meeting
  • 12. Trie for sample keys Assume that we have three sample keys: :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120 respectively. LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115 respectively. le5awB.$sm, in which, ASCII code of l and e is 108 and 101 respectively. Using the 2 first bytes of keys for building a 2-level trie tree. Le Duc Tung Weekly meeting
  • 13. Default sort By default, data will be sorted before passing to reducers. Reduce task does nothing, it just outputs the (K, V) pairs to output files. Le Duc Tung Weekly meeting