SlideShare une entreprise Scribd logo
1  sur  7
HADOOP TECHNOLOGY




ABSTRACT

Hadoop      is   the     popular     open      source       like Facebook, Twitter, RFID readers, sensors, and
implementation of MapReduce, a powerful tool                so on.Your management wants to derive
designed for deep analysis and transformation of            information from both the relational data and the
very large data sets. Hadoop enables you to                 unstructured
explore complex data, using custom analyses                 data, and wants this information as soon as
tailored to your information and questions.                 possible.
Hadoop is the system that allows unstructured               What should you do? Hadoop may be the answer!
data to be distributed across hundreds or                   Hadoop is an open source project of the Apache
thousands of machines forming shared nothing                Foundation.
clusters, and the execution of Map/Reduce                   It is a framework written in Java originally
routines to run on the data in that cluster. Hadoop         developed by Doug Cutting who named it after his
has its own filesystem which replicates data to             son's toy elephant.
multiple nodes to ensure if one node holding data           Hadoop uses Google’s MapReduce and Google File
goes down, there are at least 2 other nodes from            System technologies as its foundation.
which to retrieve that piece of information. This           It is optimized to handle massive quantities of data
protects the data availability from node failure,           which could be structured, unstructured or
something which is critical when there are many             semi-structured, using commodity hardware, that
nodes in a cluster (aka RAID at a server level).            is, relatively inexpensive computers.
                                                            This massive parallel processing is done with great
What is Hadoop?                                             performance. However, it is a batch operation
                                                            handling massive quantities of data, so the
The data are stored in a relational database in your        response time is not immediate.
desktop computer and this desktop computer                  As of Hadoop version 0.20.2, updates are not
has no problem handling this load.                          possible, but appends will be possible starting in
Then your company starts growing very quickly,              version 0.21.
and that data grows to 10GB.                                Hadoop replicates its data across different
And then 100GB.                                             computers, so that if one goes down, the data are
And you start to reach the limits of your current           processed on one of the replicated computers.
desktop computer.                                           Hadoop is not suitable for OnLine Transaction
So you scale-up by investing in a larger computer,          Processing workloads where data are randomly
and you are then OK for a few more months.                  accessed on structured data like a relational
When your data grows to 10TB, and then 100TB.               database.Hadoop is not suitable for OnLine
And you are fast approaching the limits of that             Analytical Processing or Decision Support System
computer.                                                   workloads where data are sequentially accessed on
Moreover, you are now asked to feed your                    structured data like a relational database, to
application with unstructured data coming from              generate       reports    that  provide     business
sources                                                     intelligence.
                                                            Hadoop is used for Big Data. It complements
                                                            OnLine Transaction Processing and OnLine
                                                            Analytical Processing.



                                                        1
HADOOP TECHNOLOGY

It is NOT a replacement for a relational database              Avro is a data serialization system.
system.                                                        UIMA is the architecture for the development,
So, what is Big Data?                                          discovery, composition and deployment for the
With all the devices available today to collect data,          analysis of unstructured data.
such as RFID readers, microphones, cameras,                    Let’s now talk about examples of Hadoop in action.
sensors, and so on, we are seeing an explosion in              Early in 2011, Watson, a super computer
data being collected worldwide.                                developed by IBM competed in the popular
Big Data is a term used to describe large collections          Question and
of data (also known as datasets) that may be                   Answer show “Jeopardy!”.
unstructured, and grow so large and quickly that it            Watson was successful in beating the two most
is difficult to manage with regular database or                popular players in that game.
statistics tools.                                              It was input approximately 200 million pages of
Other interesting statistics providing examples of             text using Hadoop to distribute the workload for
this data explosion are:                                       loading this information into memory.
There are more than 2 billion internet users in the            Once the information was loaded, Watson used
world today,                                                   other technologies for advanced search and
and 4.6 billion mobile phones in 2011,                         analysis.
and 7TB of data are processed by Twitter every                 In the telecommunications industry we have China
day,                                                           Mobile, a company that built a Hadoop cluster
and 10TB of data are processed by Facebook every               to perform data mining on Call Data Records.
day.                                                           China Mobile was producing 5-8TB of these
Interestingly, approximately 80% of these data are             records daily. By using a Hadoop-based system
unstructured.                                                  they
With this massive quantity of data, businesses                 were able to process 10 times as much data as
need fast, reliable, deeper data insight.                      when using their old system,
Therefore, Big Data solutions based on Hadoop and              and at one fifth of the cost.
other analytics software are becoming more                     In the media we have the New York Times which
and more relevant.                                             wanted to host on their website all public
This is a list of other open source projects related           domain articles from 1851 to 1922.
to Hadoop:                                                     They converted articles from 11 million image files
Eclipse is a popular IDE donated by IBM to the                 to 1.5TB of PDF documents. This was
open source community.                                         implemented by one employee who ran a job in 24
Lucene is a text search engine library written in              hours on a 100-instance Amazon EC2 Hadoop
Java.                                                          cluster
Hbase is the Hadoop database.                                  at a very low cost.
Hive provides data warehousing tools to extract,               In the technology field we again have IBM with
transform and load data, and query this data                   IBM ES2, an enterprise search technology based
stored in Hadoop files.                                        on Hadoop, Lucene and Jaql.
Pig is a platform for analyzing large data sets. It is a       ES2 is designed to address unique challenges of
high level language for expressing data                        enterprise search such as the use of an
analysis.                                                      enterprisespecific
Jaql, or jackal, is a query language for JavaScript            vocabulary, abbreviations and acronyms.
open notation.                                                 ES2 can perform mining tasks to build acronym
Zoo Keeper is a centralized configuration service              libraries, regular expression patterns, and
and naming registry for large distributed                      geoclassification
systems.                                                       rules.



                                                           2
HADOOP TECHNOLOGY

There are also many internet or social network            as possible to the data it operates on maximizes
companies using Hadoop such as Yahoo,                     the bandwidth available for reading
Facebook, Amazon, eBay, Twitter, StumbleUpon,             the data. In the diagram, the data we wish to apply
Rackspace, Ning, AOL, and so on.                          processing to is block B1, the
Yahoo is, of course, the largest production user          light blue rectangle on node n1 on rack 1.
with an application running a Hadoop cluster              When deciding which TaskTracker should receive a
consisting of approximately 10,000 Linux machines.        MapTask that reads data from
Yahoo is also the largest contributor to the Hadoop       B1, the best option is to choose the TaskTracker
open source project.                                      that runs on the same node as the
Now, Hadoop is not a magic bullet that solves all         data.
kinds of problems.                                        If we can't place the computation on the same
Hadoop is not good to process transactions                node, our next best option is to place
because it is random access.                              it on a node in the same rack as the data.
It is not good when the work cannot be                    The worst case that Hadoop currently supports is
parallelized.                                             when the computation must be
It is not good for low latency data access.               done from a node in a different rack than the data.
Not good for processing lots of small files.              When rack-awareness is
And not good for intensive calculations with little       configured for your cluster, Hadoop will always try
data.                                                     to run the task on the
Big Data solutions are more than just Hadoop.             TaskTracker node with the highest bandwidth
They can integrate analytic solutions to the mix to       access to the data.
derive valuable information that can combine              Let us walk through an example of how a file gets
structured legacy data with new unstructured data.        written to HDFS.
Big data solutions may also be used to derive             First, the client submits a "create" request to the
information from data in motion.                          NameNode. The NameNode checks
For example, IBM has a product called InfoSphere          that the file does not already exist and the client
Streams that can be used to quickly determine             has permission to write the file.
customer sentiment for a new product based on             If that succeeds, the NameNode determines the
Facebook or Twitter comments.                             DataNode to write the first block to. If the client is
Finally, let’s end this presentation with one final       running on a DataNode, it will try to place it there.
thought: Cloud computing has gained a                     Otherwise, it chooses at random.By default, data is
tremendous track in the past few years, and it is a       replicated to two other places in the cluster. A
perfect fit for Big Data solutions.                       pipeline is built between the three DataNodes that
Using the cloud, a Hadoop cluster can be setup in         make up the pipeline. The second DataNode is
minutes, on demand, and it can run for as long            arandomly chosen node on a rack other than that
as is needed without having to pay for more than          of the first replica of the block. Thisis to increase
what is used.                                             redundancy.
                                                          The final replica is placed on a random node within
AWARENESS OF THE TOPOLOGY OF                              the same rack as the secondreplica. The data is
THE NETWORK                                               piped from the second DataNode to the third.
                                                          To ensure the write was successful before
Hadoop has awareness of the topology of the               continuing, acknowledgment packets are
network. This allows it to optimize                       sent back from the third DataNode to the second,
where it sends the computations to be applied to          From the second DataNode to the first
the data. Placing the work as close                       And from the first DataNode to the client




                                                      3
HADOOP TECHNOLOGY

This process occurs for each of the blocks that              We will call this function "map" and pass the
make up the file, in this case, the                          function fn as an argument to map.
second                                                       We now have a general function named map and
and the third block. Notice that, for every block,           can pass our "multiply by 2"
there is a replica on at least two                           function as an argument.
racks.                                                       Writing the function definition in one statement is
When the client is done writing to the DataNode              a common idiom in functional
pipeline and has received                                    programming languages.
acknowledgements, it tells the NameNode that it is           In summary, we can rewrite a for loop as a map
complete. The NameNode will                                  operation taking a function as an
check that the blocks are at least minimally                 argument. Other than saving two lines of code,
replicated before responding.                                why is it useful to rewrite our code
                                                             this way? Let's say that instead of looping over an
MAP REDUCE                                                   array of three elements, we want
                                                             to process a dataset with billions of elements and
We will look at "the shuffle" that connects the              take advantage of a thousand
output of each mapper to the input of a reducer.             computers running in parallel to quickly process
This will take us into the fundamental datatypes             those billions of elements. If we
used by Hadoop and see an example                            decided to add this parallelism to the original
data flow. Finally, we will examine Hadoop                   program, we would need to rewrite the
MapReduce fault tolerance, scheduling,                       whole program. But if we wanted to parallelize the
and task execution optimizations.                            program written as a call to map,
To understand MapReduce, we need to break it                 we wouldn't need to change our program at all.
into its component operations map                            We would just use a parallel
and reduce. Both of these operations come from               implementation of map.
functional programming languages.                            Reduce is similar. Say you want to sum all the
These are languages that let you pass functions as           elements of an array. You could write
arguments to other functions.                                a for loop that iterates over the array and adds
We'll start with an example using a traditional for          each element to a single variable
loop. Say we want to double every                            named sum. But we can we generalize this.
element in an array. We would write code like that           The body of the for loop takes the current sum and
shown.                                                       the current element of the array
The variable "a" enters the for loop as [1,2,3] and          and adds them to produce a new sum. Let's
comes out as [2,4,6]. Each array                             replace this with a function that does the
element is mapped to a new value that is double              same thing.
the old value.                                               We can replace the body of the for loop with an
The body of the for loop, which does the doubling,           assignment of the output of a
can be written as a function.                                function fn to s. The fn function takes the sum s
We now say a[i] is the result of applying the                and the current array element
function fn to a[i]. We define fn as a                       a[i] as its arguments. The implementation of fn is a
function that returns its argument multiplied by 2.          function that returns the sum of
This will allow us to generalize this code. Instead of       its two arguments.
only being able to use this code                             We can now rewrite the sum function so that the
to double numbers, we could use it for any kind of           function fn is passed in as an
map operation.                                               argument.




                                                         4
HADOOP TECHNOLOGY

This generalizes our sum function into a reduce              this child process runs your map code or your
function. We will also let the initial                       reduce code.efficiently run map and reduce
value for the sum variable be passed in as an                operations over large amounts of data.
argument.
We can now call the function reduce whenever we              MAPREDUCE -- SUBMITTING A JOB
need to combine the values of an array in some
way, whether it is a sum, or a concatenation, or             The process of running a MapReduce job on
some other type of operation we wish to apply.               Hadoop consists of 8 major steps. The
Again, the advantage is that, should we wish to              first step is the MapReduce program you've
handle large amounts of data and parallelize this            written tells the JobClient to run a
code, we do not need to change our program, we               MapReduce job.
simply replace the implementation of the reduce              This sends a message to the JobTracker which
function       with     a      more     sophisticated        produces a unique ID for the job.
implementation. This is what Hadoop MapReduce                The JobClient copies job resources, such as a jar file
is. It is aimplementation of map and reduce that is          containing a Java code you
parallel, distributed, fault-tolerant and The process        have written to implement the map or the reduce
of running a MapReduce job on Hadoop consists of             task, to the shared file system,
8 major steps. The                                           usually HDFS.
first step is the MapReduce program you've                   Once the resources are in HDFS, the JobClient can
written tells the JobClient to run a MapReduce job.          tell the JobTracker to start the
This sends a message to the JobTracker which                 job.
produces a unique ID for the job.                            The JobTracker does its own initialization for the
The JobClient copies job resources, such as a jar file       job. It calculates how to split
containing a Java code you                                   the data so that it can send each "split" to a
have written to implement the map or the reduce              different mapper process to maximize
task, to the shared file system,                             throughput. It retrieves these "input splits" from
usually HDFS.                                                the distributed file system.
Once the resources are in HDFS, the JobClient can            The TaskTrackers are continually sending heartbeat
tell the JobTracker to start the                             messages to the JobTracker.
job.                                                         Now that the JobTracker has work for them, it will
The JobTracker does its own initialization for the           return a map task or a reduce
job. It calculates how to split                              task as a response to the heartbeat.
the data so that it can send each "split" to a               The TaskTrackers need to obtain the code to
different mapper process to maximize                         execute, so they get it from the shared
throughput. It retrieves these "input splits" from           file system.
the distributed file system.                                 Then they can launch a Java Virtual Machine with a
The TaskTrackers are continually sending heartbeat           child process running in it and
messages to the JobTracker.                                  this child process runs your map code or your
Now that the JobTracker has work for them, it will           reduce code.
return a map task or a reduce
task as a response to the heartbeat.
The TaskTrackers need to obtain the code to                  MAPREDUCE – MERGESORT/SHUFFLE
execute, so they get it from the shared
file system.                                                 we have a job with a single map step and a
Then they can launch a Java Virtual Machine with a           single reduce step. The first step is the map step. It
child process running in it and                              takes a subset of the full data set



                                                         5
HADOOP TECHNOLOGY

called an input split and applies to each row in the       Finally, coming out of the reducer is, potentially, an
input split an operation you have                          entirely new key and value, k3
written, such as the "multiply the value by two"           and v3. For example, if your reducer summed the
operation we used in our earlier map                       values associated with each k2,
example.                                                   your k3 would be equal to k2 and your v3 would be
There may be multiple map operations running in            the sum of the list of v2s.
parallel with each other, each one                         Let us look at an example of a simple data flow. Say
processing a different input split.                        we want to transform the input
The output data is buffered in memory and spills to        on the left to the output on the right. On the left,
disk. It is sorted and partitioned                         we just have letters. On the right,
by key using the default partitioner. A merge sort         we have counts of the number of occurrences of
sorts each partition.                                      each letter in the input.
The partitions are shuffled amongst the reducers.          Hadoop does the first step for us. It turns the input
For example, partition 1 goes to                           data into key-value pairs and
reducer 1. The second map task also sends its              supplies its own key: an increasing sequence
partition 1 to reducer 1. Partition 2                      number.
goes to the other reducer.                                 The function we write for the mapper needs to
Each reducer does its own merge steps and                  take these key-value pairs and
executes the code of your reduce task.                     produce something that the reduce step can use to
For example, it could do a sum like we used in the         count occurrences. The simplest
earlier reduce example.                                    solution is make each letter a key and make every
This produces sorted output at each reducer.               value a 1.
                                                           The shuffle groups records having the same key
MAPREDUCE –FUNDAMENTAL DATA                                together, so we see B now has two
TYPES                                                      values, both 1, associated with it.
                                                           The reduce is simple: it just sums the values it is
The data that flows into and out of the mappers            given to produce a sum for each
and reducers takes a specific form.                        key.
Data enters Hadoop in unstructured form but
before it gets to the first mapper,                        MAPREDUCE– FAULT TOLERANCE
Hadoop has changed it into key-value pairs with
Hadoop supplying its own key.                              The first kind of failure is a failure of the task,
The mapper produces a list of key value pairs. Both        which could be due to a bug in the
the key and the value may                                  code of your map task or reduce task.
change from the k1 and v1 that came in to a k2 and         The JVM tells the TaskTracker and Hadoop counts
v2. There can now be duplicate                             this as a failed attempt and can
keys coming out of the mappers. The shuffle step           start up a new task.
will take care of grouping them                            What if it hangs rather than fails? That is detected
together.                                                  too and the JobTracker can run
The output of the shuffle is the input to the              your task again on a different machine in case it
reducer step. Now, we still have a list of                 was a hardware problem.
the v2's that come out of the mapper step, but             If it continues to fail on each new attempt, Hadoop
they are grouped by their keys and                         will fail the job altogether. The next kind of failure
there is no longer more than one record with the           is a failure of the TaskTracker itself.
same key.




                                                       6
HADOOP TECHNOLOGY

The JobTracker will know because it is expecting a           relatively expensive when jobs are short, so you
heartbeat. If it doesn't get a heartbeat, it removes         have the option to reuse the same JVM from one
that TaskTracker from the TaskTracker pool.                  task to the next.
Finally, what if the JobTracker fails?
There is only one JobTracker. If it fails, your job is
failed.                                                      SUMMARY

MAPREDUCE –SCHEDULING & TASK                                 One thing is certain, by the time the sixth annual
EXECUTION                                                    Hadoop Summit comes around next year, Big Data
                                                             will be bigger. Business applications that are
So far we have looked at how Hadoop executes a               emerging now will be furthered as more
single job as if it is the only job on the system. But       enterprises incorporate big data analytics and HDP
it would be unfortunate if all of your valuable data         solutions into their architecture. New solutions in
could only be queried by one user at a time.                 fields like Healthcare with disease detection and
Hadoop schedules jobs using one of three                     coordination of patient care will become more
schedulers. The simplest is the default FIFO                 main stream. Crime detection and prevention will
scheduler.                                                   benefit as the industry further harnesses the new
It lets users submit jobs while other jobs are               technology. Hadoop and Big Data promise not only
running, but queues these jobs so that only one of           to result in greatly enhanced marketing and
them is running at a time. The fair scheduler is             product development. It also holds the power to
more sophisticated.                                          drive positive global social impact around
It lets multiple users compete over cluster                  improved wellness outcomes and security, and
resources and tries to give every user an equal              many other areas. This, when you think about it,
share. It also supports guaranteed minimum                   fits perfectly with the spirit of the Summit which
capacities.                                                  calls for continued stewardship of the Hadoop
The capacity scheduler takes a different approach.           Platform and promotion of associated technology
From each user's perspective, it appears that the            by open-source and commercial entities.
they have the cluster to themselves with FIFO
scheduling, but users are actually sharing the               REFERENCES
resources.
Hadoop offers some configuration options for                 Google MapReduce
speeding up the execution of your map and reduce
tasks under certain conditions.                              http://labs.google.com/papers/mapreduce.html
One such option is speculative execution. When a
task takes a long time to run, Hadoop detects this           Hadoop Distributed File System
and launches a second copy of your task on a
different node. Because the tasks are designed to            http://hadoop.apache.org/hdfs
be selfcontained and independent, starting a
second copy does not affect the final answer.
Whichever copy of the task finishes first has its
output go to the next phase. The
other task's redundant output is discarded.
Another option for improving performance is to
reuse the Java Virtual Machine.
The default is to put each task in its own JVM for
isolation purposes, but starting up a JVM can be



                                                         7

Contenu connexe

Tendances

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Tendances (20)

Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop
HadoopHadoop
Hadoop
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

En vedette

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117exsuns
 
Seminar Report On Amazon Web Service
Seminar Report On Amazon Web ServiceSeminar Report On Amazon Web Service
Seminar Report On Amazon Web Serviceshishupal choudhary
 
BigInsights BigData Study 2013 - Exec Summary
BigInsights BigData Study 2013  - Exec SummaryBigInsights BigData Study 2013  - Exec Summary
BigInsights BigData Study 2013 - Exec SummaryBigInsights
 
Visual Cryptography Industrial Training Report
Visual Cryptography Industrial Training ReportVisual Cryptography Industrial Training Report
Visual Cryptography Industrial Training ReportMohit Kumar
 
Seminar report on Symbian OS
Seminar report on Symbian OSSeminar report on Symbian OS
Seminar report on Symbian OSDarsh Kotecha
 
Swot Analysis Seminar
Swot Analysis SeminarSwot Analysis Seminar
Swot Analysis Seminarcarolpage
 
Benford's Law and Fraud Detection
Benford's Law and Fraud DetectionBenford's Law and Fraud Detection
Benford's Law and Fraud DetectionOzan Gurel
 
A minor project report HOME AUTOMATION USING MOBILE PHONES
A minor project report HOME AUTOMATION  USING  MOBILE PHONESA minor project report HOME AUTOMATION  USING  MOBILE PHONES
A minor project report HOME AUTOMATION USING MOBILE PHONESashokkok
 
Seminar report on paper battery
Seminar report on paper batterySeminar report on paper battery
Seminar report on paper batterymanish katara
 
Information technology ppt
Information technology ppt Information technology ppt
Information technology ppt Babasab Patil
 

En vedette (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
Seminar Report On Amazon Web Service
Seminar Report On Amazon Web ServiceSeminar Report On Amazon Web Service
Seminar Report On Amazon Web Service
 
BigInsights BigData Study 2013 - Exec Summary
BigInsights BigData Study 2013  - Exec SummaryBigInsights BigData Study 2013  - Exec Summary
BigInsights BigData Study 2013 - Exec Summary
 
Dna cryptography
Dna cryptographyDna cryptography
Dna cryptography
 
5g
5g5g
5g
 
Visual Cryptography Industrial Training Report
Visual Cryptography Industrial Training ReportVisual Cryptography Industrial Training Report
Visual Cryptography Industrial Training Report
 
Seminar report on Symbian OS
Seminar report on Symbian OSSeminar report on Symbian OS
Seminar report on Symbian OS
 
Swot Analysis Seminar
Swot Analysis SeminarSwot Analysis Seminar
Swot Analysis Seminar
 
Project ara report 2
Project ara report 2Project ara report 2
Project ara report 2
 
5G report
5G report5G report
5G report
 
5G Technology
5G Technology5G Technology
5G Technology
 
Benford's Law and Fraud Detection
Benford's Law and Fraud DetectionBenford's Law and Fraud Detection
Benford's Law and Fraud Detection
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
What is AWS?
What is AWS?What is AWS?
What is AWS?
 
A minor project report HOME AUTOMATION USING MOBILE PHONES
A minor project report HOME AUTOMATION  USING  MOBILE PHONESA minor project report HOME AUTOMATION  USING  MOBILE PHONES
A minor project report HOME AUTOMATION USING MOBILE PHONES
 
Seminar report on paper battery
Seminar report on paper batterySeminar report on paper battery
Seminar report on paper battery
 
Introducing DevOps
Introducing DevOpsIntroducing DevOps
Introducing DevOps
 
Information technology ppt
Information technology ppt Information technology ppt
Information technology ppt
 

Similaire à Hadoop technology doc

Similaire à Hadoop technology doc (20)

00 hadoop welcome_transcript
00 hadoop welcome_transcript00 hadoop welcome_transcript
00 hadoop welcome_transcript
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
HDFS
HDFSHDFS
HDFS
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop
HadoopHadoop
Hadoop
 

Dernier

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Dernier (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Hadoop technology doc

  • 1. HADOOP TECHNOLOGY ABSTRACT Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and the very large data sets. Hadoop enables you to unstructured explore complex data, using custom analyses data, and wants this information as soon as tailored to your information and questions. possible. Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer! data to be distributed across hundreds or Hadoop is an open source project of the Apache thousands of machines forming shared nothing Foundation. clusters, and the execution of Map/Reduce It is a framework written in Java originally routines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after his has its own filesystem which replicates data to son's toy elephant. multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google File goes down, there are at least 2 other nodes from System technologies as its foundation. which to retrieve that piece of information. This It is optimized to handle massive quantities of data protects the data availability from node failure, which could be structured, unstructured or something which is critical when there are many semi-structured, using commodity hardware, that nodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with great What is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so the The data are stored in a relational database in your response time is not immediate. desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are not has no problem handling this load. possible, but appends will be possible starting in Then your company starts growing very quickly, version 0.21. and that data grows to 10GB. Hadoop replicates its data across different And then 100GB. computers, so that if one goes down, the data are And you start to reach the limits of your current processed on one of the replicated computers. desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLine And you are fast approaching the limits of that Analytical Processing or Decision Support System computer. workloads where data are sequentially accessed on Moreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Processing. 1
  • 2. HADOOP TECHNOLOGY It is NOT a replacement for a relational database Avro is a data serialization system. system. UIMA is the architecture for the development, So, what is Big Data? discovery, composition and deployment for the With all the devices available today to collect data, analysis of unstructured data. such as RFID readers, microphones, cameras, Let’s now talk about examples of Hadoop in action. sensors, and so on, we are seeing an explosion in Early in 2011, Watson, a super computer data being collected worldwide. developed by IBM competed in the popular Big Data is a term used to describe large collections Question and of data (also known as datasets) that may be Answer show “Jeopardy!”. unstructured, and grow so large and quickly that it Watson was successful in beating the two most is difficult to manage with regular database or popular players in that game. statistics tools. It was input approximately 200 million pages of Other interesting statistics providing examples of text using Hadoop to distribute the workload for this data explosion are: loading this information into memory. There are more than 2 billion internet users in the Once the information was loaded, Watson used world today, other technologies for advanced search and and 4.6 billion mobile phones in 2011, analysis. and 7TB of data are processed by Twitter every In the telecommunications industry we have China day, Mobile, a company that built a Hadoop cluster and 10TB of data are processed by Facebook every to perform data mining on Call Data Records. day. China Mobile was producing 5-8TB of these Interestingly, approximately 80% of these data are records daily. By using a Hadoop-based system unstructured. they With this massive quantity of data, businesses were able to process 10 times as much data as need fast, reliable, deeper data insight. when using their old system, Therefore, Big Data solutions based on Hadoop and and at one fifth of the cost. other analytics software are becoming more In the media we have the New York Times which and more relevant. wanted to host on their website all public This is a list of other open source projects related domain articles from 1851 to 1922. to Hadoop: They converted articles from 11 million image files Eclipse is a popular IDE donated by IBM to the to 1.5TB of PDF documents. This was open source community. implemented by one employee who ran a job in 24 Lucene is a text search engine library written in hours on a 100-instance Amazon EC2 Hadoop Java. cluster Hbase is the Hadoop database. at a very low cost. Hive provides data warehousing tools to extract, In the technology field we again have IBM with transform and load data, and query this data IBM ES2, an enterprise search technology based stored in Hadoop files. on Hadoop, Lucene and Jaql. Pig is a platform for analyzing large data sets. It is a ES2 is designed to address unique challenges of high level language for expressing data enterprise search such as the use of an analysis. enterprisespecific Jaql, or jackal, is a query language for JavaScript vocabulary, abbreviations and acronyms. open notation. ES2 can perform mining tasks to build acronym Zoo Keeper is a centralized configuration service libraries, regular expression patterns, and and naming registry for large distributed geoclassification systems. rules. 2
  • 3. HADOOP TECHNOLOGY There are also many internet or social network as possible to the data it operates on maximizes companies using Hadoop such as Yahoo, the bandwidth available for reading Facebook, Amazon, eBay, Twitter, StumbleUpon, the data. In the diagram, the data we wish to apply Rackspace, Ning, AOL, and so on. processing to is block B1, the Yahoo is, of course, the largest production user light blue rectangle on node n1 on rack 1. with an application running a Hadoop cluster When deciding which TaskTracker should receive a consisting of approximately 10,000 Linux machines. MapTask that reads data from Yahoo is also the largest contributor to the Hadoop B1, the best option is to choose the TaskTracker open source project. that runs on the same node as the Now, Hadoop is not a magic bullet that solves all data. kinds of problems. If we can't place the computation on the same Hadoop is not good to process transactions node, our next best option is to place because it is random access. it on a node in the same rack as the data. It is not good when the work cannot be The worst case that Hadoop currently supports is parallelized. when the computation must be It is not good for low latency data access. done from a node in a different rack than the data. Not good for processing lots of small files. When rack-awareness is And not good for intensive calculations with little configured for your cluster, Hadoop will always try data. to run the task on the Big Data solutions are more than just Hadoop. TaskTracker node with the highest bandwidth They can integrate analytic solutions to the mix to access to the data. derive valuable information that can combine Let us walk through an example of how a file gets structured legacy data with new unstructured data. written to HDFS. Big data solutions may also be used to derive First, the client submits a "create" request to the information from data in motion. NameNode. The NameNode checks For example, IBM has a product called InfoSphere that the file does not already exist and the client Streams that can be used to quickly determine has permission to write the file. customer sentiment for a new product based on If that succeeds, the NameNode determines the Facebook or Twitter comments. DataNode to write the first block to. If the client is Finally, let’s end this presentation with one final running on a DataNode, it will try to place it there. thought: Cloud computing has gained a Otherwise, it chooses at random.By default, data is tremendous track in the past few years, and it is a replicated to two other places in the cluster. A perfect fit for Big Data solutions. pipeline is built between the three DataNodes that Using the cloud, a Hadoop cluster can be setup in make up the pipeline. The second DataNode is minutes, on demand, and it can run for as long arandomly chosen node on a rack other than that as is needed without having to pay for more than of the first replica of the block. Thisis to increase what is used. redundancy. The final replica is placed on a random node within AWARENESS OF THE TOPOLOGY OF the same rack as the secondreplica. The data is THE NETWORK piped from the second DataNode to the third. To ensure the write was successful before Hadoop has awareness of the topology of the continuing, acknowledgment packets are network. This allows it to optimize sent back from the third DataNode to the second, where it sends the computations to be applied to From the second DataNode to the first the data. Placing the work as close And from the first DataNode to the client 3
  • 4. HADOOP TECHNOLOGY This process occurs for each of the blocks that We will call this function "map" and pass the make up the file, in this case, the function fn as an argument to map. second We now have a general function named map and and the third block. Notice that, for every block, can pass our "multiply by 2" there is a replica on at least two function as an argument. racks. Writing the function definition in one statement is When the client is done writing to the DataNode a common idiom in functional pipeline and has received programming languages. acknowledgements, it tells the NameNode that it is In summary, we can rewrite a for loop as a map complete. The NameNode will operation taking a function as an check that the blocks are at least minimally argument. Other than saving two lines of code, replicated before responding. why is it useful to rewrite our code this way? Let's say that instead of looping over an MAP REDUCE array of three elements, we want to process a dataset with billions of elements and We will look at "the shuffle" that connects the take advantage of a thousand output of each mapper to the input of a reducer. computers running in parallel to quickly process This will take us into the fundamental datatypes those billions of elements. If we used by Hadoop and see an example decided to add this parallelism to the original data flow. Finally, we will examine Hadoop program, we would need to rewrite the MapReduce fault tolerance, scheduling, whole program. But if we wanted to parallelize the and task execution optimizations. program written as a call to map, To understand MapReduce, we need to break it we wouldn't need to change our program at all. into its component operations map We would just use a parallel and reduce. Both of these operations come from implementation of map. functional programming languages. Reduce is similar. Say you want to sum all the These are languages that let you pass functions as elements of an array. You could write arguments to other functions. a for loop that iterates over the array and adds We'll start with an example using a traditional for each element to a single variable loop. Say we want to double every named sum. But we can we generalize this. element in an array. We would write code like that The body of the for loop takes the current sum and shown. the current element of the array The variable "a" enters the for loop as [1,2,3] and and adds them to produce a new sum. Let's comes out as [2,4,6]. Each array replace this with a function that does the element is mapped to a new value that is double same thing. the old value. We can replace the body of the for loop with an The body of the for loop, which does the doubling, assignment of the output of a can be written as a function. function fn to s. The fn function takes the sum s We now say a[i] is the result of applying the and the current array element function fn to a[i]. We define fn as a a[i] as its arguments. The implementation of fn is a function that returns its argument multiplied by 2. function that returns the sum of This will allow us to generalize this code. Instead of its two arguments. only being able to use this code We can now rewrite the sum function so that the to double numbers, we could use it for any kind of function fn is passed in as an map operation. argument. 4
  • 5. HADOOP TECHNOLOGY This generalizes our sum function into a reduce this child process runs your map code or your function. We will also let the initial reduce code.efficiently run map and reduce value for the sum variable be passed in as an operations over large amounts of data. argument. We can now call the function reduce whenever we MAPREDUCE -- SUBMITTING A JOB need to combine the values of an array in some way, whether it is a sum, or a concatenation, or The process of running a MapReduce job on some other type of operation we wish to apply. Hadoop consists of 8 major steps. The Again, the advantage is that, should we wish to first step is the MapReduce program you've handle large amounts of data and parallelize this written tells the JobClient to run a code, we do not need to change our program, we MapReduce job. simply replace the implementation of the reduce This sends a message to the JobTracker which function with a more sophisticated produces a unique ID for the job. implementation. This is what Hadoop MapReduce The JobClient copies job resources, such as a jar file is. It is aimplementation of map and reduce that is containing a Java code you parallel, distributed, fault-tolerant and The process have written to implement the map or the reduce of running a MapReduce job on Hadoop consists of task, to the shared file system, 8 major steps. The usually HDFS. first step is the MapReduce program you've Once the resources are in HDFS, the JobClient can written tells the JobClient to run a MapReduce job. tell the JobTracker to start the This sends a message to the JobTracker which job. produces a unique ID for the job. The JobTracker does its own initialization for the The JobClient copies job resources, such as a jar file job. It calculates how to split containing a Java code you the data so that it can send each "split" to a have written to implement the map or the reduce different mapper process to maximize task, to the shared file system, throughput. It retrieves these "input splits" from usually HDFS. the distributed file system. Once the resources are in HDFS, the JobClient can The TaskTrackers are continually sending heartbeat tell the JobTracker to start the messages to the JobTracker. job. Now that the JobTracker has work for them, it will The JobTracker does its own initialization for the return a map task or a reduce job. It calculates how to split task as a response to the heartbeat. the data so that it can send each "split" to a The TaskTrackers need to obtain the code to different mapper process to maximize execute, so they get it from the shared throughput. It retrieves these "input splits" from file system. the distributed file system. Then they can launch a Java Virtual Machine with a The TaskTrackers are continually sending heartbeat child process running in it and messages to the JobTracker. this child process runs your map code or your Now that the JobTracker has work for them, it will reduce code. return a map task or a reduce task as a response to the heartbeat. The TaskTrackers need to obtain the code to MAPREDUCE – MERGESORT/SHUFFLE execute, so they get it from the shared file system. we have a job with a single map step and a Then they can launch a Java Virtual Machine with a single reduce step. The first step is the map step. It child process running in it and takes a subset of the full data set 5
  • 6. HADOOP TECHNOLOGY called an input split and applies to each row in the Finally, coming out of the reducer is, potentially, an input split an operation you have entirely new key and value, k3 written, such as the "multiply the value by two" and v3. For example, if your reducer summed the operation we used in our earlier map values associated with each k2, example. your k3 would be equal to k2 and your v3 would be There may be multiple map operations running in the sum of the list of v2s. parallel with each other, each one Let us look at an example of a simple data flow. Say processing a different input split. we want to transform the input The output data is buffered in memory and spills to on the left to the output on the right. On the left, disk. It is sorted and partitioned we just have letters. On the right, by key using the default partitioner. A merge sort we have counts of the number of occurrences of sorts each partition. each letter in the input. The partitions are shuffled amongst the reducers. Hadoop does the first step for us. It turns the input For example, partition 1 goes to data into key-value pairs and reducer 1. The second map task also sends its supplies its own key: an increasing sequence partition 1 to reducer 1. Partition 2 number. goes to the other reducer. The function we write for the mapper needs to Each reducer does its own merge steps and take these key-value pairs and executes the code of your reduce task. produce something that the reduce step can use to For example, it could do a sum like we used in the count occurrences. The simplest earlier reduce example. solution is make each letter a key and make every This produces sorted output at each reducer. value a 1. The shuffle groups records having the same key MAPREDUCE –FUNDAMENTAL DATA together, so we see B now has two TYPES values, both 1, associated with it. The reduce is simple: it just sums the values it is The data that flows into and out of the mappers given to produce a sum for each and reducers takes a specific form. key. Data enters Hadoop in unstructured form but before it gets to the first mapper, MAPREDUCE– FAULT TOLERANCE Hadoop has changed it into key-value pairs with Hadoop supplying its own key. The first kind of failure is a failure of the task, The mapper produces a list of key value pairs. Both which could be due to a bug in the the key and the value may code of your map task or reduce task. change from the k1 and v1 that came in to a k2 and The JVM tells the TaskTracker and Hadoop counts v2. There can now be duplicate this as a failed attempt and can keys coming out of the mappers. The shuffle step start up a new task. will take care of grouping them What if it hangs rather than fails? That is detected together. too and the JobTracker can run The output of the shuffle is the input to the your task again on a different machine in case it reducer step. Now, we still have a list of was a hardware problem. the v2's that come out of the mapper step, but If it continues to fail on each new attempt, Hadoop they are grouped by their keys and will fail the job altogether. The next kind of failure there is no longer more than one record with the is a failure of the TaskTracker itself. same key. 6
  • 7. HADOOP TECHNOLOGY The JobTracker will know because it is expecting a relatively expensive when jobs are short, so you heartbeat. If it doesn't get a heartbeat, it removes have the option to reuse the same JVM from one that TaskTracker from the TaskTracker pool. task to the next. Finally, what if the JobTracker fails? There is only one JobTracker. If it fails, your job is failed. SUMMARY MAPREDUCE –SCHEDULING & TASK One thing is certain, by the time the sixth annual EXECUTION Hadoop Summit comes around next year, Big Data will be bigger. Business applications that are So far we have looked at how Hadoop executes a emerging now will be furthered as more single job as if it is the only job on the system. But enterprises incorporate big data analytics and HDP it would be unfortunate if all of your valuable data solutions into their architecture. New solutions in could only be queried by one user at a time. fields like Healthcare with disease detection and Hadoop schedules jobs using one of three coordination of patient care will become more schedulers. The simplest is the default FIFO main stream. Crime detection and prevention will scheduler. benefit as the industry further harnesses the new It lets users submit jobs while other jobs are technology. Hadoop and Big Data promise not only running, but queues these jobs so that only one of to result in greatly enhanced marketing and them is running at a time. The fair scheduler is product development. It also holds the power to more sophisticated. drive positive global social impact around It lets multiple users compete over cluster improved wellness outcomes and security, and resources and tries to give every user an equal many other areas. This, when you think about it, share. It also supports guaranteed minimum fits perfectly with the spirit of the Summit which capacities. calls for continued stewardship of the Hadoop The capacity scheduler takes a different approach. Platform and promotion of associated technology From each user's perspective, it appears that the by open-source and commercial entities. they have the cluster to themselves with FIFO scheduling, but users are actually sharing the REFERENCES resources. Hadoop offers some configuration options for Google MapReduce speeding up the execution of your map and reduce tasks under certain conditions. http://labs.google.com/papers/mapreduce.html One such option is speculative execution. When a task takes a long time to run, Hadoop detects this Hadoop Distributed File System and launches a second copy of your task on a different node. Because the tasks are designed to http://hadoop.apache.org/hdfs be selfcontained and independent, starting a second copy does not affect the final answer. Whichever copy of the task finishes first has its output go to the next phase. The other task's redundant output is discarded. Another option for improving performance is to reuse the Java Virtual Machine. The default is to put each task in its own JVM for isolation purposes, but starting up a JVM can be 7