Vgu bis2010 Mapreduce and Batch processing

•Télécharger en tant que PPTX, PDF•

0 j'aime•947 vues

Lam Pham

Technologie

Content
Part 1: Son Pham
Batch Layer <
Part 2: Phong Le
> MapReduce
Part 3: Lam Pham
MapReduce <
Part 4: Chuong Nguyen
> Demo

Batch Layer
• Precomputation
• High latency
• Linearly Scalable

Batch Layer
On-the-fly computation:
Precomputation:

Batch Layer – Linear Scalability
“Scalability is the ability of a system to
maintain performance under increased
load by adding more resources”

Linear vs. Non-Linear Scalability
Linear Scalability Non- Linear Scalability
“A linearly scalable system can maintain performance under increased
load by adding resources in proportion to the increased load”

MapReduce
 A distributed computing paradigm
originally pioneered by Google
 Inspired by the “Map” and
“Reduce” functions commonly
used in functional programming
(LISP)
 Operating on data stored in a
distributed filesystem (HDFS…)
 A population free implementation
is Apache Hadoop.

MapReduce
Scalability
Automatically parallelize the computation
across the cluster of machines
Fault-Tolerance
Reassign failed tasks

Vgu bis2010 Mapreduce and Batch processing

Recommandé

Major 2 p ptRahul Agarwal

SDE to SPS (Synergi Pipeline Simulator) - Spatial Data to TextSafe Software

GoFFish - A Sub-graph centric framework for large scale graph analyticscharithwiki

VENUS: Vertex-Centric Streamlined Graph Computation on a Single PCQin Liu

Bcsaigon how we build product people <3 @saigonappsLam Pham

Vgu bis2010 edge_rank_liteLam Pham

How to startup and build a mass product notisLam Pham

Fts 5talk 2012_01Lam Pham

Recommandé

Major 2 p ptRahul Agarwal

SDE to SPS (Synergi Pipeline Simulator) - Spatial Data to TextSafe Software

GoFFish - A Sub-graph centric framework for large scale graph analyticscharithwiki

VENUS: Vertex-Centric Streamlined Graph Computation on a Single PCQin Liu

Bcsaigon how we build product people <3 @saigonappsLam Pham

Vgu bis2010 edge_rank_liteLam Pham

How to startup and build a mass product notisLam Pham

Fts 5talk 2012_01Lam Pham

L017656475IOSR Journals

Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...iosrjce

Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina

Architecture for Real-Time and Batch Big Data AnalyticsNir Rubinstein

Hadoop performance modeling for jobranjith kumar

Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima

Hadoop mapreduce and yarn frame work- unit5RojaT4

Cloudgene - A MapReduce based Workflow Management SystemLukas Forer

Download Itbutest

A Survey on Big Data Analysis Techniquesijsrd.com

Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications

Shared slides-edbt-keynote-03-19-13Daniel Abadi

Seminar_Report_hadoopVarun Narang

Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu

Introduction to Hadoop and MapReduceCsaba Toth

Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes

Paper id 25201498IJRAT

Map reduce advantages over parallel databases reportAhmad El Tawil

Big Data on Implementation of Many to Many Clusteringpaperpublications3

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Contenu connexe

Similaire à Vgu bis2010 Mapreduce and Batch processing

L017656475IOSR Journals

Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...iosrjce

Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina

Architecture for Real-Time and Batch Big Data AnalyticsNir Rubinstein

Hadoop performance modeling for jobranjith kumar

Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima

Hadoop mapreduce and yarn frame work- unit5RojaT4

Cloudgene - A MapReduce based Workflow Management SystemLukas Forer

Download Itbutest

A Survey on Big Data Analysis Techniquesijsrd.com

Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications

Shared slides-edbt-keynote-03-19-13Daniel Abadi

Seminar_Report_hadoopVarun Narang

Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu

Introduction to Hadoop and MapReduceCsaba Toth

Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes

Paper id 25201498IJRAT

Map reduce advantages over parallel databases reportAhmad El Tawil

Big Data on Implementation of Many to Many Clusteringpaperpublications3

Similaire à Vgu bis2010 Mapreduce and Batch processing (20)

L017656475

Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...

Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...

Architecture for Real-Time and Batch Big Data Analytics

Hadoop performance modeling for job

Integrating dbm ss as a read only execution layer into hadoop

Hadoop mapreduce and yarn frame work- unit5

Cloudgene - A MapReduce based Workflow Management System

Download It

A Survey on Big Data Analysis Techniques

Introduccion a Hadoop / Introduction to Hadoop

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...

Shared slides-edbt-keynote-03-19-13

Seminar_Report_hadoop

Introduction of the Design of A High-level Language over MapReduce -- The Pig...

Introduction to Hadoop and MapReduce

Hourglass: a Library for Incremental Processing on Hadoop

Paper id 25201498

Map reduce advantages over parallel databases report

Big Data on Implementation of Many to Many Clustering

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

GenCyber Cyber Security Day PresentationMichael W. Hawkins

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Artificial Intelligence: Facts and MythsJoaquim Jorge

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Slack Application Development 101 Slidespraypatel2

A Call to Action for Generative AI in 2024Results

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Histor y of HAM Radio presentation slidevu2urc

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

A Domino Admins Adventures (Engage 2024)

Driving Behavioral Change for Information Management through Data-Driven Gree...

How to Troubleshoot Apps for the Modern Connected Worker

GenCyber Cyber Security Day Presentation

🐬 The future of MySQL is Postgres 🐘

Exploring the Future Potential of AI-Enabled Smartphone Processors

Artificial Intelligence: Facts and Myths

Handwritten Text Recognition for manuscripts and early printed texts

Slack Application Development 101 Slides

A Call to Action for Generative AI in 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Finology Group – Insurtech Innovation Award 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Histor y of HAM Radio presentation slide

Data Cloud, More than a CDP by Matt Robison

Presentation on how to chat with PDF using ChatGPT code interpreter

Vgu bis2010 Mapreduce and Batch processing

1. MapReduce and Batch Processing VGU BIS2010, Group 13 Son Pham: phamtranthaison@gmail.com | Phong Le: bigbangvn@gmail.com | Lam Pham: lam.pts.vn@gmail.com | Chuong Nguyen: chuongit@gmail.com | Chapter 4

2. Content Part 1: Son Pham Batch Layer < Part 2: Phong Le > MapReduce Part 3: Lam Pham MapReduce < Part 4: Chuong Nguyen > Demo

3. Batch Layer Lambda Architecture

4. Batch Layer • Precomputation • High latency • Linearly Scalable

5. Batch Layer On-the-fly computation: Precomputation:

6. Batch Layer – Linear Scalability “Scalability is the ability of a system to maintain performance under increased load by adding more resources”

7. Linear vs. Non-Linear Scalability Linear Scalability Non- Linear Scalability “A linearly scalable system can maintain performance under increased load by adding resources in proportion to the increased load”

8. MapReduce  A distributed computing paradigm originally pioneered by Google  Inspired by the “Map” and “Reduce” functions commonly used in functional programming (LISP)  Operating on data stored in a distributed filesystem (HDFS…)  A population free implementation is Apache Hadoop.

9. MapReduce

10. MapReduce - “Word count” Example

11. MapReduce Scalability Automatically parallelize the computation across the cluster of machines Fault-Tolerance Reassign failed tasks

Notes de l'éditeur

In the Lambda Architecture, you precompute your queries so that at read time you can get your results quickly.
In the batch layer, you run functions of your entire dataset to precompute your queries. Because it is running functions of the entiredataset, the batch layer is slow to update. However, the fact that it is looking at all the data at once means the batch layer can precompute query
Balancing precomputation and on-the-fly computation:You can't always precompute. Instead, what you can do is precompute an intermediate result and then do somecomputation on the fly to complete queries based on those intermediate results
Load in a Big Data context is a combination of the total amount of data you have, how much new data you receive every day, how many requests per second your application serves, and so on.
"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. A worker who is assigned a map task reads the contents of the corresponding input shard. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. After successful completion, the output of the MapReduce execution is available in the R output files. [1]
To detect failure, the master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed when failure occurs because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global fille system.