SlideShare une entreprise Scribd logo
1  sur  23
Nguyen Thanh Hai
      Portal team
     August 2012
Agenda

    − Meet Hadoop
1    −
     −
     −
         History
         Data!
         Data Storage and Analysis
     −   What Hadoop is Not


2   − The Hadoop Distributed File System
      − HDFS concept
      − Architecture
      − Goals
      − Command User Interface
3   − MapReduce
      − Overview
      − How MapReduce works


4   − Practice
     − Demo
     − Discussion


                    www.exoplatform.com - Copyright 2012 eXo Platform   2
Meet Hadoop

 - History

 - Data!

 - Data Storage and Analysis

 - What Hadoop is Not




             www.exoplatform.com - Copyright 2012 eXo Platform   3
History




          www.exoplatform.com - Copyright 2012 eXo Platform   4
History
- Hadoop got its start in Nutch. A few of them were attempting to
build an open source web search engine and having trouble
managing computations running on even a handful of computers

- Once Google published its GFS and MapReduce papers, the
route became clear. It'd devised systems to solve precisely the
problems they were having with Nutch. So they started, two of
them, half-time, to try to re-create these systems as a part of
Nutch

- Around that time. Yahoo! got interested, and quickly put
together a team. They split off the distributed computing part of
Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon
grew into a technology that could truly scale to the Web.

                  www.exoplatform.com - Copyright 2012 eXo Platform   5
Data! We live in the data age




               www.exoplatform.com - Copyright 2012 eXo Platform   6
Data! We live in the data age




               www.exoplatform.com - Copyright 2012 eXo Platform   7
Data Storage and Analysis

- While the storage capacities of hard drives have increased massively over
 the years, access speeds the rate at which data can be read from drivers
have not kept up. Once typical drive from 1990 cloud store 1,370 MB of
data and had a transfer speed of 4.4 MB/s. Over 20 years later, one
terabyte drives are the norm, but the transfer speed is around 100MB/s

- This is a long time to read all data on a single drive and writing is even
slower.




                       www.exoplatform.com - Copyright 2012 eXo Platform       8
Data Storage and Analysis

The obvious way:

- Imagine if we have 100 drivers, each holding one hundredth of the data.
Working in parallel, we could read the data in under two minutes.

- Only using one hundredth of a disk may seem wasteful. But we can store one
hundred datasets, each of which is one terabyte, and provide shared access to
them.




                      www.exoplatform.com - Copyright 2012 eXo Platform         9
Data Storage and Analysis

The problems to solve:

- The first: As soon as you start using many pieces of hardware, the chance that
       first
one will fail is fairly high. A common way of avoiding data loss is through
replication: redundant copies of the data are kept by the system so that in the
event of failure, there is another copy available.

- The second: That most analysis tasks need to be able to combine the data in
      second
some way; data read from one disk may need to be combine with the data from
any of the other 99 disks. Various distributed systems allow data to be combined
from multiple sources, but doing this correctly is notoriously challenging

With Hadoop:
Hadoop provides: a reliable shared storage and analysis system. The storage is
provided by HDFS and analysis by MapReduce.



                      www.exoplatform.com - Copyright 2012 eXo Platform          10
What Hadoop is Not
- It is not a substitute for a database. Hadoop stores data in files, and dose not
                               database
index them. If you want to find something, you have to run a MapReduce job
going through all the data. This take time, and mean that you cannot directly use
Hadoop as a substitute for a database. Where Hadoop works is where the data is
too big for a database. With very large datasets, the cost of regenerating indexes
is so high you can't easily index changing data.

- MapReduce is not always the best algorithm. MapReduce is profound idea:
                                        algorithm
talking a simple functional programming operation and applying it, in parallel, to
gigabytes or terabytes of data. But there is a price. For that parallelism, you need
to have each MR operation independent from all the others. If you need to know
everything that has gone before, you have a problem.

- Hadoop and MapReduce is not a place to learn Java programming

- Hadoop is not an ideal place to learn networking error messages

- Hadoop clusters are not a place to learn Unix/Linux system administration


                       www.exoplatform.com - Copyright 2012 eXo Platform          11
The Hadoop Distributed File System

  - HDFS Concept

  - Architecture

  - Goals

  - Command Line User Interface




              www.exoplatform.com - Copyright 2012 eXo Platform   12
HDFS concept

Block:

- A disk has a block size, which is the minimum amount of data that it can read or
write. Filesystem for a single disk build on this by dealing with data in blocks. The
disk blocks are normally 512 bytes.

- HDFS, too, has concept of the block, but it is a much larger unit – 64MB by
default. Like in a filesystem for a single disk, files in HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a filesystem for a
single disk, a file in HDFS that is smaller than a single block does not occupy a
full block's worth of underlying storage.




                       www.exoplatform.com - Copyright 2012 eXo Platform            13
HDFS Concept

NameNode and DataNodes:

- An Hadoop cluster has two types of node operating in a master-worker pattern:
a namenode (the master) and a number of datanodes (workers)

- The NameNode manages the filesystem namespace. It maintains the filesystem
tree and the metadata for all the files and directories in the tree. It executes file
system namespace operations like opening, closing, and renaming files and
directories. It also determines the mapping of blocks to DataNodes.

- DataNodes are the workhorses of the filesystem. They store and retrieve blocks
when they are told to (client or NameNode), and they report back to the
NameNode periodically with list of blocks that they are storing.




                       www.exoplatform.com - Copyright 2012 eXo Platform           14
Architecture




               www.exoplatform.com - Copyright 2012 eXo Platform   15
Architecture




               www.exoplatform.com - Copyright 2012 eXo Platform   16
HDFS Goals
- Hardware Failure: An HDFS instance may consist of hundreds or thousands of
server machines, each storing part of the file system's data. The fact that these are
a huge number of components and that each component has a non-trivial probability
of failure means that some components of HDFS is always non-functional.
Therefore, detection of faults and quick, automatic recovery from them is core
architectural goal of HDFS.

- Large Data Sets: Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large
file. It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster. It should support ten of millions of files on single instance.

- “Moving Computation is Cheaper than Moving Data”: A computation
requested by an application is much more efficient if it is executed near the data it
operates on. This is especially true when the size of data is huge. This minimizes
network congestion and increases the overall throughput of the system. The
assumption is that it is often better to migrate the computation closer to where the
data is located rather than moving data to where the application running. HDFS
provides interfaces for applications to move themselves closer to where data is
located.

                       www.exoplatform.com - Copyright 2012 eXo Platform             17
Command Line User Interface




              www.exoplatform.com - Copyright 2012 eXo Platform   18
MapReduce

 - Overview

 - How MapReduce Works




              www.exoplatform.com - Copyright 2012 eXo Platform   19
Overview
- Hadoop MapReduce is a software framework for easily writing application which
process vast amounts of data (multi-terabyte data-sets) in parallel on large cluster
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

- A MapReduce job usually splits the input data-sets into independent chunks which
are processed by the map task in a completely parallel manner. The framework
sorts the output of the maps, which are then input to the reduce task. Typically both
                                                                  task
the input and the output of job are sorted by filesystem. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.

- The MapReduce framework consist of a single master JobTracker and one
worker TaskTrackser per cluster-node. The master is responsible for scheduling
the jobs component tasks on the worker, monitoring them and re-executing the
failed tasks. The workers execute the tasks as directly by the manner.




                      www.exoplatform.com - Copyright 2012 eXo Platform           20
How MapReduce Works




            www.exoplatform.com - Copyright 2012 eXo Platform   21
How MapReduce Works




            www.exoplatform.com - Copyright 2012 eXo Platform   22
Practice

  - Demo

  - Discussion




             www.exoplatform.com - Copyright 2012 eXo Platform   23

Contenu connexe

Tendances

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 

Tendances (19)

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
HDFS
HDFSHDFS
HDFS
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
hadoop
hadoophadoop
hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 

En vedette

En vedette (7)

Wg11 automaive
Wg11 automaiveWg11 automaive
Wg11 automaive
 
E xo mobile_overview_best_practice_in_mobile_application_design
E xo mobile_overview_best_practice_in_mobile_application_designE xo mobile_overview_best_practice_in_mobile_application_design
E xo mobile_overview_best_practice_in_mobile_application_design
 
I os
I osI os
I os
 
Advance jquery-plugin
Advance jquery-pluginAdvance jquery-plugin
Advance jquery-plugin
 
Jquery
JqueryJquery
Jquery
 
Magento
MagentoMagento
Magento
 
Memory and runtime analysis
Memory and runtime analysisMemory and runtime analysis
Memory and runtime analysis
 

Similaire à Hadoop

Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopAditi Yadav
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Similaire à Hadoop (20)

Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
hadoop
hadoophadoop
hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 

Plus de adm_exoplatform

Plus de adm_exoplatform (8)

Development withforce
Development withforceDevelopment withforce
Development withforce
 
Jquery ui
Jquery uiJquery ui
Jquery ui
 
Cmsms
CmsmsCmsms
Cmsms
 
Java application server in the cloud
Java application server in the cloudJava application server in the cloud
Java application server in the cloud
 
Jvm mbeans jmxtran
Jvm mbeans jmxtranJvm mbeans jmxtran
Jvm mbeans jmxtran
 
Git training
Git trainingGit training
Git training
 
Cluster mode and plf cluster
Cluster mode and plf clusterCluster mode and plf cluster
Cluster mode and plf cluster
 
Cluster mode and plf cluster
Cluster mode and plf clusterCluster mode and plf cluster
Cluster mode and plf cluster
 

Dernier

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Dernier (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Hadoop

  • 1. Nguyen Thanh Hai Portal team August 2012
  • 2. Agenda − Meet Hadoop 1 − − − History Data! Data Storage and Analysis − What Hadoop is Not 2 − The Hadoop Distributed File System − HDFS concept − Architecture − Goals − Command User Interface 3 − MapReduce − Overview − How MapReduce works 4 − Practice − Demo − Discussion www.exoplatform.com - Copyright 2012 eXo Platform 2
  • 3. Meet Hadoop - History - Data! - Data Storage and Analysis - What Hadoop is Not www.exoplatform.com - Copyright 2012 eXo Platform 3
  • 4. History www.exoplatform.com - Copyright 2012 eXo Platform 4
  • 5. History - Hadoop got its start in Nutch. A few of them were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers - Once Google published its GFS and MapReduce papers, the route became clear. It'd devised systems to solve precisely the problems they were having with Nutch. So they started, two of them, half-time, to try to re-create these systems as a part of Nutch - Around that time. Yahoo! got interested, and quickly put together a team. They split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web. www.exoplatform.com - Copyright 2012 eXo Platform 5
  • 6. Data! We live in the data age www.exoplatform.com - Copyright 2012 eXo Platform 6
  • 7. Data! We live in the data age www.exoplatform.com - Copyright 2012 eXo Platform 7
  • 8. Data Storage and Analysis - While the storage capacities of hard drives have increased massively over the years, access speeds the rate at which data can be read from drivers have not kept up. Once typical drive from 1990 cloud store 1,370 MB of data and had a transfer speed of 4.4 MB/s. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100MB/s - This is a long time to read all data on a single drive and writing is even slower. www.exoplatform.com - Copyright 2012 eXo Platform 8
  • 9. Data Storage and Analysis The obvious way: - Imagine if we have 100 drivers, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. - Only using one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them. www.exoplatform.com - Copyright 2012 eXo Platform 9
  • 10. Data Storage and Analysis The problems to solve: - The first: As soon as you start using many pieces of hardware, the chance that first one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. - The second: That most analysis tasks need to be able to combine the data in second some way; data read from one disk may need to be combine with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging With Hadoop: Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. www.exoplatform.com - Copyright 2012 eXo Platform 10
  • 11. What Hadoop is Not - It is not a substitute for a database. Hadoop stores data in files, and dose not database index them. If you want to find something, you have to run a MapReduce job going through all the data. This take time, and mean that you cannot directly use Hadoop as a substitute for a database. Where Hadoop works is where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data. - MapReduce is not always the best algorithm. MapReduce is profound idea: algorithm talking a simple functional programming operation and applying it, in parallel, to gigabytes or terabytes of data. But there is a price. For that parallelism, you need to have each MR operation independent from all the others. If you need to know everything that has gone before, you have a problem. - Hadoop and MapReduce is not a place to learn Java programming - Hadoop is not an ideal place to learn networking error messages - Hadoop clusters are not a place to learn Unix/Linux system administration www.exoplatform.com - Copyright 2012 eXo Platform 11
  • 12. The Hadoop Distributed File System - HDFS Concept - Architecture - Goals - Command Line User Interface www.exoplatform.com - Copyright 2012 eXo Platform 12
  • 13. HDFS concept Block: - A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem for a single disk build on this by dealing with data in blocks. The disk blocks are normally 512 bytes. - HDFS, too, has concept of the block, but it is a much larger unit – 64MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block- sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block's worth of underlying storage. www.exoplatform.com - Copyright 2012 eXo Platform 13
  • 14. HDFS Concept NameNode and DataNodes: - An Hadoop cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers) - The NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. It executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. - DataNodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (client or NameNode), and they report back to the NameNode periodically with list of blocks that they are storing. www.exoplatform.com - Copyright 2012 eXo Platform 14
  • 15. Architecture www.exoplatform.com - Copyright 2012 eXo Platform 15
  • 16. Architecture www.exoplatform.com - Copyright 2012 eXo Platform 16
  • 17. HDFS Goals - Hardware Failure: An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system's data. The fact that these are a huge number of components and that each component has a non-trivial probability of failure means that some components of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is core architectural goal of HDFS. - Large Data Sets: Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large file. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support ten of millions of files on single instance. - “Moving Computation is Cheaper than Moving Data”: A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of data is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving data to where the application running. HDFS provides interfaces for applications to move themselves closer to where data is located. www.exoplatform.com - Copyright 2012 eXo Platform 17
  • 18. Command Line User Interface www.exoplatform.com - Copyright 2012 eXo Platform 18
  • 19. MapReduce - Overview - How MapReduce Works www.exoplatform.com - Copyright 2012 eXo Platform 19
  • 20. Overview - Hadoop MapReduce is a software framework for easily writing application which process vast amounts of data (multi-terabyte data-sets) in parallel on large cluster (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. - A MapReduce job usually splits the input data-sets into independent chunks which are processed by the map task in a completely parallel manner. The framework sorts the output of the maps, which are then input to the reduce task. Typically both task the input and the output of job are sorted by filesystem. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. - The MapReduce framework consist of a single master JobTracker and one worker TaskTrackser per cluster-node. The master is responsible for scheduling the jobs component tasks on the worker, monitoring them and re-executing the failed tasks. The workers execute the tasks as directly by the manner. www.exoplatform.com - Copyright 2012 eXo Platform 20
  • 21. How MapReduce Works www.exoplatform.com - Copyright 2012 eXo Platform 21
  • 22. How MapReduce Works www.exoplatform.com - Copyright 2012 eXo Platform 22
  • 23. Practice - Demo - Discussion www.exoplatform.com - Copyright 2012 eXo Platform 23