1. HADOOP TECHNOLOGY
ABSTRACT
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and
implementation of MapReduce, a powerful tool so on.Your management wants to derive
designed for deep analysis and transformation of information from both the relational data and the
very large data sets. Hadoop enables you to unstructured
explore complex data, using custom analyses data, and wants this information as soon as
tailored to your information and questions. possible.
Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!
data to be distributed across hundreds or Hadoop is an open source project of the Apache
thousands of machines forming shared nothing Foundation.
clusters, and the execution of Map/Reduce It is a framework written in Java originally
routines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after his
has its own filesystem which replicates data to son's toy elephant.
multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google File
goes down, there are at least 2 other nodes from System technologies as its foundation.
which to retrieve that piece of information. This It is optimized to handle massive quantities of data
protects the data availability from node failure, which could be structured, unstructured or
something which is critical when there are many semi-structured, using commodity hardware, that
nodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers.
This massive parallel processing is done with great
What is Hadoop? performance. However, it is a batch operation
handling massive quantities of data, so the
The data are stored in a relational database in your response time is not immediate.
desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are not
has no problem handling this load. possible, but appends will be possible starting in
Then your company starts growing very quickly, version 0.21.
and that data grows to 10GB. Hadoop replicates its data across different
And then 100GB. computers, so that if one goes down, the data are
And you start to reach the limits of your current processed on one of the replicated computers.
desktop computer. Hadoop is not suitable for OnLine Transaction
So you scale-up by investing in a larger computer, Processing workloads where data are randomly
and you are then OK for a few more months. accessed on structured data like a relational
When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLine
And you are fast approaching the limits of that Analytical Processing or Decision Support System
computer. workloads where data are sequentially accessed on
Moreover, you are now asked to feed your structured data like a relational database, to
application with unstructured data coming from generate reports that provide business
sources intelligence.
Hadoop is used for Big Data. It complements
OnLine Transaction Processing and OnLine
Analytical Processing.
1
2. HADOOP TECHNOLOGY
It is NOT a replacement for a relational database Avro is a data serialization system.
system. UIMA is the architecture for the development,
So, what is Big Data? discovery, composition and deployment for the
With all the devices available today to collect data, analysis of unstructured data.
such as RFID readers, microphones, cameras, Let’s now talk about examples of Hadoop in action.
sensors, and so on, we are seeing an explosion in Early in 2011, Watson, a super computer
data being collected worldwide. developed by IBM competed in the popular
Big Data is a term used to describe large collections Question and
of data (also known as datasets) that may be Answer show “Jeopardy!”.
unstructured, and grow so large and quickly that it Watson was successful in beating the two most
is difficult to manage with regular database or popular players in that game.
statistics tools. It was input approximately 200 million pages of
Other interesting statistics providing examples of text using Hadoop to distribute the workload for
this data explosion are: loading this information into memory.
There are more than 2 billion internet users in the Once the information was loaded, Watson used
world today, other technologies for advanced search and
and 4.6 billion mobile phones in 2011, analysis.
and 7TB of data are processed by Twitter every In the telecommunications industry we have China
day, Mobile, a company that built a Hadoop cluster
and 10TB of data are processed by Facebook every to perform data mining on Call Data Records.
day. China Mobile was producing 5-8TB of these
Interestingly, approximately 80% of these data are records daily. By using a Hadoop-based system
unstructured. they
With this massive quantity of data, businesses were able to process 10 times as much data as
need fast, reliable, deeper data insight. when using their old system,
Therefore, Big Data solutions based on Hadoop and and at one fifth of the cost.
other analytics software are becoming more In the media we have the New York Times which
and more relevant. wanted to host on their website all public
This is a list of other open source projects related domain articles from 1851 to 1922.
to Hadoop: They converted articles from 11 million image files
Eclipse is a popular IDE donated by IBM to the to 1.5TB of PDF documents. This was
open source community. implemented by one employee who ran a job in 24
Lucene is a text search engine library written in hours on a 100-instance Amazon EC2 Hadoop
Java. cluster
Hbase is the Hadoop database. at a very low cost.
Hive provides data warehousing tools to extract, In the technology field we again have IBM with
transform and load data, and query this data IBM ES2, an enterprise search technology based
stored in Hadoop files. on Hadoop, Lucene and Jaql.
Pig is a platform for analyzing large data sets. It is a ES2 is designed to address unique challenges of
high level language for expressing data enterprise search such as the use of an
analysis. enterprisespecific
Jaql, or jackal, is a query language for JavaScript vocabulary, abbreviations and acronyms.
open notation. ES2 can perform mining tasks to build acronym
Zoo Keeper is a centralized configuration service libraries, regular expression patterns, and
and naming registry for large distributed geoclassification
systems. rules.
2
3. HADOOP TECHNOLOGY
There are also many internet or social network as possible to the data it operates on maximizes
companies using Hadoop such as Yahoo, the bandwidth available for reading
Facebook, Amazon, eBay, Twitter, StumbleUpon, the data. In the diagram, the data we wish to apply
Rackspace, Ning, AOL, and so on. processing to is block B1, the
Yahoo is, of course, the largest production user light blue rectangle on node n1 on rack 1.
with an application running a Hadoop cluster When deciding which TaskTracker should receive a
consisting of approximately 10,000 Linux machines. MapTask that reads data from
Yahoo is also the largest contributor to the Hadoop B1, the best option is to choose the TaskTracker
open source project. that runs on the same node as the
Now, Hadoop is not a magic bullet that solves all data.
kinds of problems. If we can't place the computation on the same
Hadoop is not good to process transactions node, our next best option is to place
because it is random access. it on a node in the same rack as the data.
It is not good when the work cannot be The worst case that Hadoop currently supports is
parallelized. when the computation must be
It is not good for low latency data access. done from a node in a different rack than the data.
Not good for processing lots of small files. When rack-awareness is
And not good for intensive calculations with little configured for your cluster, Hadoop will always try
data. to run the task on the
Big Data solutions are more than just Hadoop. TaskTracker node with the highest bandwidth
They can integrate analytic solutions to the mix to access to the data.
derive valuable information that can combine Let us walk through an example of how a file gets
structured legacy data with new unstructured data. written to HDFS.
Big data solutions may also be used to derive First, the client submits a "create" request to the
information from data in motion. NameNode. The NameNode checks
For example, IBM has a product called InfoSphere that the file does not already exist and the client
Streams that can be used to quickly determine has permission to write the file.
customer sentiment for a new product based on If that succeeds, the NameNode determines the
Facebook or Twitter comments. DataNode to write the first block to. If the client is
Finally, let’s end this presentation with one final running on a DataNode, it will try to place it there.
thought: Cloud computing has gained a Otherwise, it chooses at random.By default, data is
tremendous track in the past few years, and it is a replicated to two other places in the cluster. A
perfect fit for Big Data solutions. pipeline is built between the three DataNodes that
Using the cloud, a Hadoop cluster can be setup in make up the pipeline. The second DataNode is
minutes, on demand, and it can run for as long arandomly chosen node on a rack other than that
as is needed without having to pay for more than of the first replica of the block. Thisis to increase
what is used. redundancy.
The final replica is placed on a random node within
AWARENESS OF THE TOPOLOGY OF the same rack as the secondreplica. The data is
THE NETWORK piped from the second DataNode to the third.
To ensure the write was successful before
Hadoop has awareness of the topology of the continuing, acknowledgment packets are
network. This allows it to optimize sent back from the third DataNode to the second,
where it sends the computations to be applied to From the second DataNode to the first
the data. Placing the work as close And from the first DataNode to the client
3
4. HADOOP TECHNOLOGY
This process occurs for each of the blocks that We will call this function "map" and pass the
make up the file, in this case, the function fn as an argument to map.
second We now have a general function named map and
and the third block. Notice that, for every block, can pass our "multiply by 2"
there is a replica on at least two function as an argument.
racks. Writing the function definition in one statement is
When the client is done writing to the DataNode a common idiom in functional
pipeline and has received programming languages.
acknowledgements, it tells the NameNode that it is In summary, we can rewrite a for loop as a map
complete. The NameNode will operation taking a function as an
check that the blocks are at least minimally argument. Other than saving two lines of code,
replicated before responding. why is it useful to rewrite our code
this way? Let's say that instead of looping over an
MAP REDUCE array of three elements, we want
to process a dataset with billions of elements and
We will look at "the shuffle" that connects the take advantage of a thousand
output of each mapper to the input of a reducer. computers running in parallel to quickly process
This will take us into the fundamental datatypes those billions of elements. If we
used by Hadoop and see an example decided to add this parallelism to the original
data flow. Finally, we will examine Hadoop program, we would need to rewrite the
MapReduce fault tolerance, scheduling, whole program. But if we wanted to parallelize the
and task execution optimizations. program written as a call to map,
To understand MapReduce, we need to break it we wouldn't need to change our program at all.
into its component operations map We would just use a parallel
and reduce. Both of these operations come from implementation of map.
functional programming languages. Reduce is similar. Say you want to sum all the
These are languages that let you pass functions as elements of an array. You could write
arguments to other functions. a for loop that iterates over the array and adds
We'll start with an example using a traditional for each element to a single variable
loop. Say we want to double every named sum. But we can we generalize this.
element in an array. We would write code like that The body of the for loop takes the current sum and
shown. the current element of the array
The variable "a" enters the for loop as [1,2,3] and and adds them to produce a new sum. Let's
comes out as [2,4,6]. Each array replace this with a function that does the
element is mapped to a new value that is double same thing.
the old value. We can replace the body of the for loop with an
The body of the for loop, which does the doubling, assignment of the output of a
can be written as a function. function fn to s. The fn function takes the sum s
We now say a[i] is the result of applying the and the current array element
function fn to a[i]. We define fn as a a[i] as its arguments. The implementation of fn is a
function that returns its argument multiplied by 2. function that returns the sum of
This will allow us to generalize this code. Instead of its two arguments.
only being able to use this code We can now rewrite the sum function so that the
to double numbers, we could use it for any kind of function fn is passed in as an
map operation. argument.
4
5. HADOOP TECHNOLOGY
This generalizes our sum function into a reduce this child process runs your map code or your
function. We will also let the initial reduce code.efficiently run map and reduce
value for the sum variable be passed in as an operations over large amounts of data.
argument.
We can now call the function reduce whenever we MAPREDUCE -- SUBMITTING A JOB
need to combine the values of an array in some
way, whether it is a sum, or a concatenation, or The process of running a MapReduce job on
some other type of operation we wish to apply. Hadoop consists of 8 major steps. The
Again, the advantage is that, should we wish to first step is the MapReduce program you've
handle large amounts of data and parallelize this written tells the JobClient to run a
code, we do not need to change our program, we MapReduce job.
simply replace the implementation of the reduce This sends a message to the JobTracker which
function with a more sophisticated produces a unique ID for the job.
implementation. This is what Hadoop MapReduce The JobClient copies job resources, such as a jar file
is. It is aimplementation of map and reduce that is containing a Java code you
parallel, distributed, fault-tolerant and The process have written to implement the map or the reduce
of running a MapReduce job on Hadoop consists of task, to the shared file system,
8 major steps. The usually HDFS.
first step is the MapReduce program you've Once the resources are in HDFS, the JobClient can
written tells the JobClient to run a MapReduce job. tell the JobTracker to start the
This sends a message to the JobTracker which job.
produces a unique ID for the job. The JobTracker does its own initialization for the
The JobClient copies job resources, such as a jar file job. It calculates how to split
containing a Java code you the data so that it can send each "split" to a
have written to implement the map or the reduce different mapper process to maximize
task, to the shared file system, throughput. It retrieves these "input splits" from
usually HDFS. the distributed file system.
Once the resources are in HDFS, the JobClient can The TaskTrackers are continually sending heartbeat
tell the JobTracker to start the messages to the JobTracker.
job. Now that the JobTracker has work for them, it will
The JobTracker does its own initialization for the return a map task or a reduce
job. It calculates how to split task as a response to the heartbeat.
the data so that it can send each "split" to a The TaskTrackers need to obtain the code to
different mapper process to maximize execute, so they get it from the shared
throughput. It retrieves these "input splits" from file system.
the distributed file system. Then they can launch a Java Virtual Machine with a
The TaskTrackers are continually sending heartbeat child process running in it and
messages to the JobTracker. this child process runs your map code or your
Now that the JobTracker has work for them, it will reduce code.
return a map task or a reduce
task as a response to the heartbeat.
The TaskTrackers need to obtain the code to MAPREDUCE – MERGESORT/SHUFFLE
execute, so they get it from the shared
file system. we have a job with a single map step and a
Then they can launch a Java Virtual Machine with a single reduce step. The first step is the map step. It
child process running in it and takes a subset of the full data set
5
6. HADOOP TECHNOLOGY
called an input split and applies to each row in the Finally, coming out of the reducer is, potentially, an
input split an operation you have entirely new key and value, k3
written, such as the "multiply the value by two" and v3. For example, if your reducer summed the
operation we used in our earlier map values associated with each k2,
example. your k3 would be equal to k2 and your v3 would be
There may be multiple map operations running in the sum of the list of v2s.
parallel with each other, each one Let us look at an example of a simple data flow. Say
processing a different input split. we want to transform the input
The output data is buffered in memory and spills to on the left to the output on the right. On the left,
disk. It is sorted and partitioned we just have letters. On the right,
by key using the default partitioner. A merge sort we have counts of the number of occurrences of
sorts each partition. each letter in the input.
The partitions are shuffled amongst the reducers. Hadoop does the first step for us. It turns the input
For example, partition 1 goes to data into key-value pairs and
reducer 1. The second map task also sends its supplies its own key: an increasing sequence
partition 1 to reducer 1. Partition 2 number.
goes to the other reducer. The function we write for the mapper needs to
Each reducer does its own merge steps and take these key-value pairs and
executes the code of your reduce task. produce something that the reduce step can use to
For example, it could do a sum like we used in the count occurrences. The simplest
earlier reduce example. solution is make each letter a key and make every
This produces sorted output at each reducer. value a 1.
The shuffle groups records having the same key
MAPREDUCE –FUNDAMENTAL DATA together, so we see B now has two
TYPES values, both 1, associated with it.
The reduce is simple: it just sums the values it is
The data that flows into and out of the mappers given to produce a sum for each
and reducers takes a specific form. key.
Data enters Hadoop in unstructured form but
before it gets to the first mapper, MAPREDUCE– FAULT TOLERANCE
Hadoop has changed it into key-value pairs with
Hadoop supplying its own key. The first kind of failure is a failure of the task,
The mapper produces a list of key value pairs. Both which could be due to a bug in the
the key and the value may code of your map task or reduce task.
change from the k1 and v1 that came in to a k2 and The JVM tells the TaskTracker and Hadoop counts
v2. There can now be duplicate this as a failed attempt and can
keys coming out of the mappers. The shuffle step start up a new task.
will take care of grouping them What if it hangs rather than fails? That is detected
together. too and the JobTracker can run
The output of the shuffle is the input to the your task again on a different machine in case it
reducer step. Now, we still have a list of was a hardware problem.
the v2's that come out of the mapper step, but If it continues to fail on each new attempt, Hadoop
they are grouped by their keys and will fail the job altogether. The next kind of failure
there is no longer more than one record with the is a failure of the TaskTracker itself.
same key.
6
7. HADOOP TECHNOLOGY
The JobTracker will know because it is expecting a relatively expensive when jobs are short, so you
heartbeat. If it doesn't get a heartbeat, it removes have the option to reuse the same JVM from one
that TaskTracker from the TaskTracker pool. task to the next.
Finally, what if the JobTracker fails?
There is only one JobTracker. If it fails, your job is
failed. SUMMARY
MAPREDUCE –SCHEDULING & TASK One thing is certain, by the time the sixth annual
EXECUTION Hadoop Summit comes around next year, Big Data
will be bigger. Business applications that are
So far we have looked at how Hadoop executes a emerging now will be furthered as more
single job as if it is the only job on the system. But enterprises incorporate big data analytics and HDP
it would be unfortunate if all of your valuable data solutions into their architecture. New solutions in
could only be queried by one user at a time. fields like Healthcare with disease detection and
Hadoop schedules jobs using one of three coordination of patient care will become more
schedulers. The simplest is the default FIFO main stream. Crime detection and prevention will
scheduler. benefit as the industry further harnesses the new
It lets users submit jobs while other jobs are technology. Hadoop and Big Data promise not only
running, but queues these jobs so that only one of to result in greatly enhanced marketing and
them is running at a time. The fair scheduler is product development. It also holds the power to
more sophisticated. drive positive global social impact around
It lets multiple users compete over cluster improved wellness outcomes and security, and
resources and tries to give every user an equal many other areas. This, when you think about it,
share. It also supports guaranteed minimum fits perfectly with the spirit of the Summit which
capacities. calls for continued stewardship of the Hadoop
The capacity scheduler takes a different approach. Platform and promotion of associated technology
From each user's perspective, it appears that the by open-source and commercial entities.
they have the cluster to themselves with FIFO
scheduling, but users are actually sharing the REFERENCES
resources.
Hadoop offers some configuration options for Google MapReduce
speeding up the execution of your map and reduce
tasks under certain conditions. http://labs.google.com/papers/mapreduce.html
One such option is speculative execution. When a
task takes a long time to run, Hadoop detects this Hadoop Distributed File System
and launches a second copy of your task on a
different node. Because the tasks are designed to http://hadoop.apache.org/hdfs
be selfcontained and independent, starting a
second copy does not affect the final answer.
Whichever copy of the task finishes first has its
output go to the next phase. The
other task's redundant output is discarded.
Another option for improving performance is to
reuse the Java Virtual Machine.
The default is to put each task in its own JVM for
isolation purposes, but starting up a JVM can be
7