Big Data Fundamentals

DECODING THE
BIG DATA HYPE
From A Layman’s Perspective
ABSTRACT
In 2001, Meta Group (Now Gartner) published a
report by Doug Laney, wherein the first analysis of
Big Data challenges was documented. This
document is an attempt to understand those
challenges and the solutions offered. This document
has been prepared using multiple sources and has
been quoted likewise.
By: Smarak Das [EMP Id: 391485]
Teradata Certified Database Administrator &
Technical Specialist

1 | P a g e
DECODING THE BIG DATA HYPE
INDEX(a) Fundamentals Of Big Data Page 2
- Characteristics Of Big Data Page 3
- Warehouse Vs Hadoop Page 7
- Use Cases Of Big Data Page 11
- Big Data Demographics Page 15
(b) All About Hadoop Page 21
- HDFS Page 23
- Assumptions & Goals Page 25
- DataNodes & NameNodes Page 26
- Data Replication & Integrity Page 28
- Data Blocks Organization & Pipelining Page 30
- File Permission Guide Page 34
(c) Basics Of MapReduce Page 35
(d) Hadoop Common Components Page 38
- Pig & PigLatin Page 39
- HIVE Page 40
- JAQL Page 41
- FLUME Page 42
- ZooKeeper Page 43
- Oozie Page 44
- Lucene Page 44
- Avro Page 44

2 | P a g e
Fundamentals Of BIG DATA
You Are Part Of It Every Day
Wikipedia quotes “Big Data is the collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools or
traditional data processing applications.”
The term “Big Data” is a misnomer since it implies that pre-existing data is somehow
small (it isn’t) or the only challenge is its sheer size (size is one of them, but there
are often more). In short, Big Data applies to information that can’t be processed or
analyzed using traditional tools. The effect of e-commerce, social media, rise in
merger/acquisition, increasing collaboration/partnership etc. is driving enterprises to
higher levels of consciousness about how the data is being managed at its basic level.
Today, every organization has access to a wealth of information, yet they don’t know
how to get value out of it because it is sitting in its most raw format or in semi-
structured, unstructured format, and as a result, they don’t even know whether it’s
worth keeping. Also, organization are getting overwhelmed by the volume of the
data generated, variety of data being available, and velocity of data availability.
Companies have the ability to store anything and they are generating data like never
before, yet as this potential gold mine of data piles up, the percentage of data the
business can process is going down.
Today’s world is changing. Through instrumentation and sensors, we are able to
track and sense more things, and if we can track and sense it, we tend to store it. Big
Data is the game changer for the overall effectiveness of your data centers, because
of its potential as a powerful tool in your information management repertoire.
The practice and tools of Big Data and Data Science doesn’t stand alone in the data
ecosystem. They rely on the usability of data, and a platform for future discovery
and innovation. As Big Data grows over 2014, we will see more of Big Data
acceptability, maturity as an industry, and adoption across industries.

3 | P a g e
Characteristics Of Big Data
Three characteristics define Big Data: VOLUME, VARIETY, and VELOCITY.
Source: Google Images
Fig: 3V Of Big Data
Other “V”s have been included in the Big Data characteristics, with “Variety” and
“Variability” being two other characteristics. However, we shall concentrate on
“Volume”, “Variety” and “Velocity”. Each of these characteristics introduces its
own set of complexities concerning data processing. All these 03 features have
created the need for a new class of capabilities to augment the way things are done
today to provide a better line of sight and controls over our existing knowledge
domains and the ability to act on them. Together, handling these 03 challenges
effectively will decide the efficiency and successfulness of any Big Data initiative.

4 | P a g e
VOLUME: The Volume of data is exploding. Year 2000 had 800,000 Petabytes of
data and by 2020, we are expecting to reach 35 Zettabytes. Twitter generates 7 TB
of data, while Facebook generates 10 TB every day. If the figures of Twitter and
Facebook didn’t amaze you, below lists numbers much beyond the realm of any
volume accumulated:
(a) Large Hadron Collider: Generated 15PB of data.
(b)YouTube: 72 hours of Video per hour.
(c) Human Genomics: 7000 PB
(d)Large Synoptic Survey Telescope: 30 TB of Images per day.
(e) Annual Email Traffic (No SPAM): 300 PB +
These numbers will be out of date by the time this document is prepared, and further
outdated by the time you have read it.
Today, we are storing everything: environmental data, financial data, medical data,
surveillance data, click stream data, and the list goes on and on. The term “Big Data”
means organizations are facing massive volumes of data, and this volume is
increasing with each day. As the amount of data available to business is on the rise,
the percent of data it can process, understand and analyze is on the decline, thus
creating a Blind Zone. This Blind Zone creates an uncertainty concerning the value
of all the captured and yet-unexplored data.
Fig: Discrepancy between Data Storage & Data Analysis

5 | P a g e
VARIETY: Of all “Vs”, Variety holds the potential for most exploitation. While
everybody doesn’t have the huge volume of data like Twitter, Facebook, eBay etc.
even the medium to small scale industries have multiple data sources which can be
integrated for organizational benefit. With the explosion of sensors, smart phones,
social collaboration technologies, data in an organization is becoming complex,
because it includes not only relational-database-suitable structured data, but also raw,
unstructured, semi-structured data. Traditional systems are struggling to store and
perform the required analytics to gain understanding from these varieties of data.
Only 20 percent of today’s data is traditional. Rest 80 percent of world’s data is
moving towards unstructured or semi-structured at most. Videos & Pictures aren’t
easily stored in relational databases. Variety is harder to grasp and analyze than
“bigger” (Volume) or “faster” (Variety). To capitalize on Big Data opportunity,
enterprises must be able to analyze all types of data, both relational & non-relational:
text, sensor data, audio, video, transactional, and more.
Figure: Share Of Structured & Un-Structured Data Source: Google Images
Figure: Difference Between Variety of Data Source: Relational Source

6 | P a g e
VELOCITY: Velocity refers to the speed with which data is stored or retrieved.
Earlier, the typical process was to fire a batch job against data and wait for results to
arrive. It used to work because the incoming data rate is slower than the batch
processing rate. Today, data in streaming into the server in real-time and have a very
short shelf-life. As such, the need to analyze the data-in-motion, rather than the data-
in-rest is critical. Sometimes, the competitive advantage for an organization is
decided by identifying a trend, problem, opportunity few minutes or few seconds
before someone else. There is a difference between “How many people live in
London” and “How many people are currently in London”. Dealing effectively with
Big Data requires performing analytics against the volume and variety of data while
it is still in motion, not just after it is at rest.
Source: Scale DB
Fig: Velocity & Big Data

7 | P a g e
Warehouse Vs. Hadoop (The Versus Thing)
Gartner in its 2014 Data Warehouse Database Management Systems Magic
Quadrant said “Entering 2014, the hype around replacing the data warehouse
gives way to the more sensible strategy of augmenting it.” Data in a warehouse
goes through multiple quality rigors: Cleaning, enrichment, matching, glossary,
metadata, master data management, modeling, and other services before it’s ready
for analysis. This is a very expensive and time consuming process. Having said that,
Business realizes that the data in the Data Warehouse is required for reporting and
BI purposes, which is essential for its functioning. Hence, the “high compute per
byte” (high computation cost) is associated with the “high value per byte” warehouse’
data.
The difference between traditional BI analytics and Big Data analytics is shown in
the below figure:
Source: Storage Networking Industry Association (SNIA)
Fig: Big Data Is Different From Business Intelligence

8 | P a g e
In contrast, Big Data repositories rarely undergoes the full quality rigors similar to
Data Warehouse’ data. Hadoop data might seems to be of “low value per byte”, it
also have “low compute by byte” factor. With the volume and velocity of today’s
data, we cannot afford to cleanse and document every piece of data properly, because
it’s not going to be economical. The data in Hadoop might sit for a while for analysis,
and when its value is discovered, it might migrate its way to way to the data
warehouse post the quality rigors associated with Data Warehouse’ data.
03 major considerations for Big Data technologies:
(a) Big Data solutions are ideal for analyzing not only raw data, but structured,
semi-structured, unstructured data as well.
(b)Big Data solutions are ideal when all the data needs to be analyzed in
comparison to a sample of data.
(c) Big Data solutions are ideal for iterative and exploratory analysis when
business measures of data is not predetermined.
Hadoop is not meant for high performance interactive use, not it supports database
features like schemas, indexes, optimizer, data structure, data models etc.

9 | P a g e
Fig: Business Requirement Case Study
Big Data solutions aren’t a replacement of traditional and existing warehouse
solutions. Data bound for the analytic warehouse has to be cleaned, documented, and
trusted before it’s neatly placed into the warehouse. A Big Data solution is going to
give up some of the formalities and strictness of data.
The next figure, as provided by Data Warehouse and Big Data Market Leader
Teradata (Gartner 2014) explains the best approach by workload and data type
concerning when to use what:
Legend:
(a) STABLE SCHEMA: Financial Analysis, OLAP, Enterprise-wide BI,
Reporting, Active Intelligence etc.
(b)EVOLVING SCHEMA: Interactive data discovery, web clickstream, social
feeds, set-top box analysis, sensor logs, JSON etc.
(c) FORMAT, NO SCHEMA: Image processing, audio/video storage and
refining, storage and batch transformation.

10 | P a g e
Source: Teradata
Fig: Teradata Aster [When to Use What] Mapping As Per Requirement]
Your information platform shouldn’t go into the future without these two important
entities working together, because the outcomes of a cohesive analytic solutions
deliver premium results.
Source: SAS Best Practices 2013
Fig: Big Data & Data Warehouse Together = Premium Results

11 | P a g e
Use Cases Of Big Data
The early companies to embrace Big Data were Google, LinkedIn, Facebook, eBay
etc. These companies didn’t have to reconcile or integrate big data with the
traditional sources of data and perform analytics on them as they were built around
Big Data from the beginning. For these companies, Big Data could stand alone, Big
Data analytics could be the only focus of analytics and Big Data technology
architecture could be the only architecture.
However, Large and well-established business should integrate their Big Data
technologies with everything else going on with their company i.e. analytics on Big
Data should co-exist with analytics of other types of data.
Below, we list 05 instances of Big Data use by popular companies:
Source: International Institute for Analytics Study Sponsored by SAS May 2013

12 | P a g e

13 | P a g e

14 | P a g e

15 | P a g e
Big Data Demographics
NVP [NewVantage Partners] conducted a survey in 2013 for Big Data statistics with
the following participating companies {Below Figure}, with the survey participants
including Chief Information Officers, Chief Analytics and Risk Officers, Chief
Technology Officers, Chief Marketing Officers, Senior Line-of-Business Executives
(EVP/SVP), Chief Architects, and Heads of Big Data and Analytics.
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013

16 | P a g e
(a) BIG Data Acceptance In Terms Of Industry
[Financial Services usage of Big Data stems from development of highly sophisticated
customer analytics and predictive behavior models, fraud detection and risk analytics etc.
Health Care & Life Sciences firms are in nascent stage of Big Data adoption, with only
17% LS & HC executives reporting Big Data systems operational in production as
compared with 33% for financial services]
(b)BIG Data Initiative Status
[In 2012, 85% of executives indicated embarking on initial forays of Big Data initiatives.
In 2013, 91% executives were planning or have embarked on a Big Data initiatives. Also,
68% executives reported investment of more than $1MM in Big Data initiatives]

17 | P a g e
(c) BIG Data Initiative Being Planned By Organization
[The most significant factor for Big Data initiatives were to enhance the analytics power
and capabilities as a mean to compete more successfully and operate more efficiently, with
70% of executives’ demand. Effective integration of existing data, be it structured, un-
structured, semi-structured represented 69% of executives’ driving factor for Big Data
initiative]

18 | P a g e
(d)Primary Focus Of Big Data Analysis Initiative
Most Big Data initiatives do not currently require any ROI playback analysis for
justifying their investment in Big Data, with 50% indicating long term strategic
investment. The below figure shows whether an ROI has been conducted before
approving the Big Data investment:

19 | P a g e
(e) Factors Critical To Business Adoption Of Big Data Initiatives
[Executive Sponsorship is the most critical factor driving Big Data initiative]
(f) Primary Business Benefit Expected By Big Data Analysis

20 | P a g e
(g)Big Data Solutions Used
(h)Analytics & Visualization Solutions Used

21 | P a g e
All About Hadoop
The problem is obvious: There is a staggering amount of data lying around in every
enterprise of various formats and types, with more and more data being added to the
repository each and every moment. But, the enterprise aren’t sure whether to
continue storing the data, or analyze it, or whether there is any value in it.
It will not be wrong to say the Big Data is the culmination of technological
advancement which can spiraled the volume, variety and velocity aspect of data
beyond the organizational grasp. People and organization have attempted to tackle
this problem from many different angles. The angle which is currently leading the
pack in terms of popularity is an open source project called Hadoop.
Fig: Hadoop Adoption
Hadoop is a top-level Apache project in the Apache Software Foundation written in
Java. For the definition, we can define Hadoop as a computing environment built on
top of a distributed clustered file system that was designed specifically for very
large-scale operations.

22 | P a g e
Hadoop is based on Google’s work on MapReduce programming paradigm. Unlike
traditional systems, Hadoop is designed to scan through large data sets to produce
its results through a highly scalable, distributed batch processing system. Hadoop is
not about speed-of-thought response times, real-time warehousing, or blazing
transactional speeds as mentioned in the Big Data vs. Data Warehouse section. It is
about discovery and making the once near-impossible possible from a scalable and
analytic perspective.
The Hadoop project has 03 major components:
(a) Hadoop Distributed File System (HDFS)
(b)Hadoop MapReduce.
(c) Hadoop Common.
Fig: Hadoop Base Components
One of the key component of Hadoop is the redundancy built into Hadoop. Hadoop
recognizes that failure is a norm rather than an exception. As mentioned earlier,
Hadoop achieves the performance and scalability via commodity based hardware
(ready-made and easily available inexpensive hardware). It is well known that
commodity based hardware will fail, especially when you have large numbers of
them. But the redundancy built into Hadoop provides the fault tolerance and the
capability of Hadoop to heal itself. This allows Hadoop to scale out workloads across
large clusters of inexpensive machines to work on Big Data problems.

23 | P a g e
Hadoop Distributed File System [HDFS]
The Hadoop Distributed File System [HDFS] is a distributed file system designed to
run on commodity hardware. HDFS is highly fault tolerant and is designed to be
deployed on low cost hardware.
Data in Hadoop cluster is broken down into smaller pieces called Blocks and
distributed across the cluster. In this way, the map & reduce functions can be
executed on smaller subsets of your larger data sets and this provides the scalability
needed by Big Data processing.
Source: Apache Hadoop Org
Fig: Hadoop HDFS Architecture
The goal of Hadoop is to use commonly available servers in very large cluster, where
each server has a set of inexpensive internal disk drives. As commodity hardware
failure’s probability is high, Hadoop built-in tolerance and fault compensation
facilities. Any data is divided into blocks and copies of these blocks are stored on
other servers in the Hadoop Cluster. That is, an individual file is actually stored as
smaller blocks on several servers in the entire cluster.

24 | P a g e
In Hadoop, a file is broken down into multiple “n” blocks. Each of these “n” block
is replicated across 03 Servers by default [The number of Blocks per file and the
replication factor can be customized on a per-file basis e.g. the Development Hadoop
needn’t have any replication]. Coordination amongst all the servers has a significant
overhead, so the ability to process large chunks of data locally helps improve both
performance and communication overhead.
For example: Imagine a file having all the Employee Ids of an organization. This file
is divided into say, 03 parts (Block 1, Block 2, and Block 3) and is stored across
multiple servers in a Hadoop Cluster.
Source: VMware’s Networking and Security Business Unit [Brad Hedlund]
Fig: Block Replication across HDFS
Here, Block 1 is replicated by Block 1` and Block 1``. Same applies for Block 2 and
Block 3. With the default replication factor of 3, each block is replicated thrice.
This redundancy has 02 major advantages:
(a) High availability
(b) It allows Hadoop to break a work into smaller chunks and run those jobs on
all servers in the cluster for better scalability.

25 | P a g e
Assumptions & Goals
(a)Hardware Failure: Hardware Failure is a norm rather than an exception. As
HDFS uses thousands of commodity hardware component, and each
component has a non-trivial probability of failure, some component of HDFS
is always at fault. However, this failure is expected and part of the design
perspective of Hadoop.
(b)Streaming Data Access: HDFS is designed for batch processing, rather than
interactive uses by users. It is not a general purpose distributed file systems
dealing with stand-alone data. It requires continuous streaming access to the
data sets.
(c) Large Data Sets: A typical file in Hadoop is terabytes in size. HDFS is built
to support huge data sets.
(d)Simple Coherency Model: HDFS applications have a write-once-read-many
access model for files. A file once created, written, and saved need not be
changed. This assumption greatly simplifies the data coherency issues. There
is a plan to introduce file’s data-append in the future release of Hadoop.
(e) “Moving Computation To Data, Rather Than Data To Computation”:
Hadoop believes moving computation to data is much faster than moving data
to computation. This is very true for large data sets. Moving huge data sets to
the application greatly increases the network bandwidth usages.
(f) Portability: HDFS is designed to work on any platform supporting Java. With
highly portable Java at helm, Hadoop advocates widespread adoption as a
platform of choice for large data sets applications.

26 | P a g e
DataNodes & NameNodes
Hadoop has a Master/Slave architecture. All of Hadoop data placement is managed
by special server called NAMENODE. Each cluster of Hadoop has 1 NameNode
Server assigned to it. This server keeps track of all the data in the HDFS. All the
NameNode’s information is stored in memory, which allows quick response time to
storage manipulation and read requests. In other words, NameNode deals with
CLUSTER METADATA. The obvious scenario coming to our mind is the Single
Point of Failure (SPOF) with all these details stored in a server. Hence, it is advisable
to choose a robust server component for NameNode as compared to other servers.
Initial version of Hadoop had only 1 NameNode Server. Hadoop version 0.21
included the capability of a Backup Node, which acts as a clod standby for
NameNode.
Fig: Hadoop Master Slave Architecture

27 | P a g e
DataNodes manages the storage attached to each node in the cluster. Usually, 1
DataNode is assigned to 1 Node. Internally, when a file is divided into multiple
blocks, each block is assigned to multiple DataNode [03 By Default]. Overall, the
NameNode handles the namespaces operation like opening, closing, renaming files
in addition to determining the mapping of blocks to DataNodes. The DataNodes is
responsible for servicing read and write request.
Fig: NameNode & DataNode Functionality
When you fire a job for inserting data into HDFS or for retrieving data from HDFS,
Hadoop has the responsibility of communicating with NameNode for necessary
information, be it storage location and replication details for INSERT operation or
the server from which the data for SELECT operation is to be fetched. In other words,
any Hadoop operation needn’t reference NameNode directly or indirectly.
Hadoop isn’t UNIX (POSIX) portable. It means that all the familiar commands for
copying, deleting, inserting, opening, moving etc. are slightly available in different
forms with HDFS. To work around this, either we can develop our own Java
applications to perform some of the functions, or we can use Hadoop components
available readily in the Apache Software Foundation.

28 | P a g e
Data Replication & Integrity
HDFS is designed to store huge files, and each file is divided into many blocks, with
all the blocks occupying same size except the last block. All these blocks are
replicated for fault tolerance. The block size and replication factor is configurable
for each file. The replication factor for a file can be changed later also.
The NameNode makes all the decisions regarding the block’s replication. It also
receives a HeartBeat and BlockReport from each DataNodes. HeartBeat report
signifies all the DataNodes are functioning properly. The BlockReport contains the
list of all blocks on a DataNode.
Fig: Heartbeat & Block Report.
The placement of the first replica is crucial. Optimizing the block’s replica
placement requires lots of tuning and optimization. Hadoop uses a Rack Awareness
policy for replication. A group of nodes forms a Rack. By default, the Replication
Factor is 03. A simple policy will be to place one replica per rack. This policy will
ensure even distribution of blocks, but increases the cost of writes as a write needs
to transfer blocks to multiple racks.

29 | P a g e
Hadoop Rack Awareness policy is to put first replica on one node in the local rack,
another on a different node in the same local rack, and the last on a different node in
a different rack. One third of replicas is on one node, two third of replicas is in one
rack, and the other third are evenly distributed across the remaining racks.
The necessity of block re-replication may arise due to many reasons: DataNode may
become unavailable, a replica may become corrupted, a hard disk on a DataNode
may fail, or the replication factor of a file may be increased. The Blocks’ information
across DataNode is sent to NameNode via HeartBeat message periodically.
It is very possible for a block of data to be fetched arrives corrupted. Whenever a
block of a file is stored in a DataNode, checksum is implemented for data integrity.
This checksum is used for validation of data blocks whenever data is received from
the DataNode by the NameNode.

30 | P a g e
Data Blocks Organization & Pipelining
When a client request a file write, it doesn’t reach the NameNode immediately. In
fact, the HDFS Client caches the data into a temporary file until the accumulated
data is worth over one HDFS block size. At this point of time, the Client contacts
the NameNode. The NameNode inserts the file name into the file system hierarchy
and allocates a data block for it. The NameNode responds to the client request with
the identity of the DataNode and the destination data block. Then the client flushes
the block of data from the local temporary file to the specified DataNode. When a
file is closed, the remaining un-flushed data in the temporary local file is transferred
to the DataNode. The client then tells the NameNode that the file is closed. At this
point, the NameNode commits the file creation operation into a persistent store. If
the NameNode dies before the file is closed, the file is lost.
Figure: Pipelined Flow of HDFS Read Operation

31 | P a g e
Explanation of HDFS File Read Operation:
(a) The HDFS Client requests a file to be read for its operation.
(b)The NameNode is contacted for the file information. The information sought
is the block’s address across DataNodes for the file.
(c) The NameNode fetch the block information based on the latest Block Report
sent by each DataNodes.
(d)The NameNode send these information ordering the Blocks by their ascending
order based on the distance from the first block containing the data.
Example: Assume a file has 2 blocks (B1 and B2) with a replication factor of
02. Node 1 contains B1, B2`; Node 2 contains B2, B1`. When the NameNode
delivers the block ordering, it will place the node containing the first replica,
followed by other nodes based on the distance of other same-block-other-
replica carrying node. For the above example:
B1 (Node1, Node2)
B2 (Node2, Node1)
(e) The InputStream will fetch the data from the ordering specified, choosing the
first Node containing the block, and then checking the checksum to verify data
correctness. If correct, then the block from the first node is used, else the
second node’ block will be used.

32 | P a g e
Figure: Pipelined Flow of HDFS Write Operation
The Normal Line represent
communication between
Client &HDFS. Solid Line
represent data transfer and
dotted line represent
acknowledgement.
Time t0: Client request
WRITE Operation.
Time t1-t2: Packets of data
sent from Client to HDFS in
BLOCK Size.
Time t2: CLOSE Signal
Time t3: Hadoop Saves the
file.
FIG: WRITE PIPELINE

33 | P a g e
The Pipelined approach explained above for WRITE operation has been explained
using the below figure:
Fig: HDFS Write Operation
The policy of putting the first 2 blocks in the same rack, and then the 3rd block in
another rack is based on Hadoop Rack Awareness Policy, explained in the “Data
Replication & Integrity” section.

34 | P a g e
HDFS File Permission Guide
The Hadoop Distributed File System (HDFS) implements a permissions model for
files and directories that shares much of the POSIX model. Each file and directory
is associated with an owner and a group. The file or directory has separate
permissions for the user that is the owner, for other users that are members of the
group, and for all other users. For files, the r permission is required to read the file,
and the w permission is required to write or append to the file. For directories, the r
permission is required to list the contents of the directory, the w permission is
required to create or delete files or directories, and the x permission is required to
access a child of the directory.
Each client process that accesses HDFS has a two-part identity composed of the user
name, and groups list. Whenever HDFS must do a permissions check for a file or
directory foo accessed by a client process,
(a) If the user name matches the owner of foo, then the owner permissions are
tested;
(b)Else if the group of foo matches any of member of the groups list, then the
group permissions are tested;
(c) Otherwise the other permissions of foo are tested.
If a permissions check fails, the client operation fails.

35 | P a g e
Basics Of MapReduce
MapReduce is the heart of Big Data. It is the programming paradigm that allows for
massive scalability across hundreds or thousands of servers in a Hadoop cluster.
The term MapReduce refers to two separate and distinct tasks: Map & Reduce. The
“Map” takes data as input and converts it into another set of data, where every
individual elements is broken down into tuples (key/value pairs). The “Reduce”
takes the output of “Map” as input and combines those tuples into a smaller set of
tuples. As the sequence of name MapReduce suggests, the “Reduce” job is always
performed after “Map” job.
For Example: Consider 3 Files containing Animal’s name and we wish to count the
number of occurrence of each animal. First, the Input File is split into blocks (3 by
Default). For each block, a Map task calculate the number of occurrence of each
animal. The Shuffle unit of MapReduce takes the output of Map and directs them to
appropriate Reducer tasks for consolidation and delivery of the final result.
Fig: MapReduce Operation

36 | P a g e
In Hadoop, every MapReduce Program is called “JOB”. A job is executed by subsequently
breaking it down into smaller pieces called “TASKS”. Every Hadoop cluster has a program
running called “JOBTRACKER”.
The JobTracker communicates with the NameNode to find out where all the data required
by the submitted job exists across the cluster and also breaks the job into map and reduce
tasks. These tasks are then scheduled on all the servers where the data exists. It is very
possible for the tasks to be scheduled on a server where the required data isn’t available
[As every data is replicated across 3 servers by default]. In this case, the server will ask for
the necessary data to be transferred across the network interconnect to perform its task. As
this is not very efficient, the JobTracker tries to avoid this and attempts to schedule the job
on the servers where the data actually resides.
Another set of program called “TASKTRACKER” is responsible for monitoring the status
of every tasks running. If any tasks fails, the status of the failure is reported back to
JobTracker, which will then reschedule the job. We can decide how many times a failed
task will be attempted before the entire job is cancelled.
Fig: MapReduce Basic Concepts

37 | P a g e
SHUFFLE & COMBINER are 02 features of Hadoop used in MapReduce. Shuffle takes
the input from the map tasks and directs the output to reduce tasks. If we wish to perform
some aggregation or other transformation on the output of map tasks before sending to
reduce tasks, then we can use Combiner. The greater the number of reducer tasks, more
will be the overhead but the overall performance is improved.
Fig: MapReduce & HDFS Together

38 | P a g e
Hadoop Common Components
Hadoop Common Components are a set of libraries that support the various Hadoop
subprojects. As mentioned before, Hadoop isn’t UNIX (POSIX) complaint. To
interact with Hadoop, we need to use the /bin/hdfs dfs <args> file system shell
command interface, where args represents the command argument.
Fig: HDFS Shell Commands

39 | P a g e
Application Development in Hadoop
For using Hadoop, we need to use Java for developing the MapReduce programs for
interacting with HDFS. Also, the programmers need to develop and maintain
MapReduce applications for business applications that require long and pipelined
processing.
To abstract some of the complexity of Hadoop programming model, several
application development languages have emerged that run on top of Hadoop. The
popular ones are Pig, Zookeeper, Hive, JAQL etc.
Pig & PigLatin
Pig was developed by Yahoo! To allow people using Hadoop to focus more on
analyzing data rather than writing the mapper and reducer programs. As the animal
Pig eats anything, the Pig programming language is designed to handle any kind of
data. Pig is made up of two components: PigLatin & Pig Runtime Environment
where the PigLatin programs are executed. The relationship between PigLatin & Pig
Runtime Environment is similar to Java Application and JVM.
Fig: Pig & PigLatin
The commonly used commands in Pig are:
(a) LOAD: Before writing programs via PigLatin to access data in HDFS, we
need to specify the data in HDFS which will be used. For this purpose, we use
LOAD ‘File’ Command (Where ‘File’ refers to either a HDFS file or
directory). If a directory is specified, then all the files are loaded into the
PigLatin program.

40 | P a g e
(b)TRANSFORM: The transformation logic is where all the data manipulation
occurs. You can use FILTER to remove rows as required, JOIN to join two
set of data files, GROUP to aggregate data, ORDER to order results, etc.
For Example: To calculate the count of employees per manager belonging to
the HealthCare ISU with the input file being located in HDFS @ Employee
directory:
L = LOAD ‘hdfs//node/Employee’;
FL = FILTER L BY Vertical EQ ‘HC’;
G = GROUP FL BY ManagerID;
RT = FOREACH G GENERATE group, COUNT (EmployeeID);
(c) DUMP & STORE: If DUMP & STORE aren’t specified, then the output of
Pig program isn’t displayed. To display the output to the screen, use DUMP
command. For redirecting the output to a file, use STORE command.
Post writing the PigLatin program, the Pig Runtime Environment translate the
program into a set of map & reduce tasks and runs them under the cover on your
behalf.
HIVE
Although Pig was a very powerful and simple language to understand and use, the
downside was that it was something new to learn and master. HIVE was developed
by Facebook with the intention of developing a runtime Hadoop support structure
which allows anyone fluent in SQL to leverage the power of Hadoop. Their creation
was called HQL (HIVE Structured Language). The HQL statements were broken
down into MapReduce jobs by the HIVE service to be executed across the Hadoop
cluster.
For Example: Using the same scenario as above of calculating the count of
employees per manager id, the following HIVE Code performs the task of creating
a table, populating it, and then querying that table via HIVE:

41 | P a g e
CREATE TABLE Employee (EMPID BIGINT, MGRID BIGINT, VERTICAL
STRING)
COMMENT ‘The Above Is The Employee Table’
STORED AS SEQUENCEFILE;
LOAD DATA INPATH ‘//hdfs://node/Employee’ INTO TABLE Employee;
SELECT MGRID, COUNT (EMPID) FROM EMPLOYEE WHERE VERTICAL
EQUAL ‘HC’ GROUP BY MGRID;
JAQL
JAQL was developed by IBM and allows processing both structured and non-
traditional data. JAQL was inspired by many programming language like LISP, SQL,
XQuery, Pig etc.
Fig: Comparison of Pig, Hive & JAQL

42 | P a g e
GETTING YOUR DATA INTO HADOOP
One of the biggest challenges for Hadoop is that it is not UNIX (POSIX) complaint.
To get your data into Hadoop (HDFS), we need to use the traditional Hadoop
commands:
(a) CopyFromLocal: Move file from local file system into HDFS.
(b)CopyToLocal: Move file from HDFS into local file system.
hdfs dfs –copyFromLocal /user/dir/file hdfs://s1.n1.com/dir/hdfsfile
hdfs dfs –copyToLocal hdfs://s1.n1.com/dir/hdfsfile /user/dir/file
These commands are executed through the HDFS Shell Program, which is similar
to a Java Application. The shell uses the Java APIs for getting the date into and out
of HDFS.
FLUME
Flume is an Apache project for flowing data from source into your Hadoop
environment. In Flume, there are 03 main entities:
(a)SOURCE: A Source can be any data source. Flume has many predefined
source adapters. Some adapters allows the flow of anything coming off a TCP
port to enter the flow. A number of text source file adapters allows you the
granular control to grab a specific file and feed it into whatever new data is
written into the file. Look to Flume when you want data to flow from many
sources.
(b)SINK: A Sink is the target of a specific operation. There are 03 type of Sinks
in Flume. One Sink is basically the final flow destination known as
COLLECTOR TIER EVENT SINK. This is where you land a flow into the
HDFS File System. Another Sink type is AGENT TIER EVENT SINK, which
is used when you want the sink to be the input source of another operation.
When using such Sinks, Flume will ensure communication or
acknowledgement is sent for arriving data. The final sink type is BASIC SINK,
which can be text file, a console display, simple HDFS path etc.

43 | P a g e
(c) DECORATOR: A Decorator is an operation on the stream that can transform
the stream in some manner, be it compressing or adding or removing
information. Very complex transformation or enterprise class transformation
like IBM Information Server isn’t achieved by Flume’s decorator task.
Hadoop is more than just a single project, but rather an ecosystem of projects
at simplifying, managing, coordinating, and analyzing large sets of data. Such
projects are listed in the following sections.
ZOOKEEPER
ZooKeeper is an open source Apache project that provides a centralized
infrastructure and services enabling synchronization across a cluster.
Imagine a Hadoop cluster spanning 500 or more servers. There is a need for
centralized management of the entire cluster in terms of name services, group
services, synchronization services, configuration management, and more. Having
ZooKeeper allows the cross-node synchronization and ensures the tasks across
cluster are serialized or synchronized. A very Hadoop cluster can be supported by
multiple ZooKeeper servers.
Fig: Apache Zookeeper

44 | P a g e
OOZIE
Sometimes, many jobs needs to be chained together to create a complex application.
Oozie is an open source project that simplifies the workflow and coordination
between jobs. It provides users with the ability to define actions and dependencies
between actions. Oozie will then schedule the actions to execute when the required
dependencies have been met.
A workflow in Oozie is defined via DAG (Directed Acyclic Graph), where all the
tasks and dependencies point are specified without any loop. The below figure shows
an example of Oozie workflow, where the node represents the actions and control
flow operations.
LUCENE
Lucene is an extremely popular open source Apache project for text search. Lucene
predates Hadoop and has been a top level Apache project since 2005. If you have
searched on the Internet, it’s very likely that you have used Lucene and yet not
known it.
In a nutshell, if you wish to search a text in a large file, or a set of documents, then
Lucene breaks the documents into text fields and builds an index on these fields. The
index is the key component of Lucene, as it is the basis of its rapid search capabilities.
AVRO
AVRO is an Apache project offering data serialization.

Big Data Fundamentals

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Big Data Fundamentals

Similaire à Big Data Fundamentals (20)

Dernier

Dernier (20)

Big Data Fundamentals