3. What is Hadoop ?
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
Hadoop is best known for MapReduce and its distributed
filesystem (HDFS),and large-scale data processing.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 3 / 29
4. What is Hadoop Ecosystem ?
Introduction to the world of Hadoop and the core related
software projects. There are countless commercial
Hadoop-integrated products focused on making Hadoop
more usable and layman-accessible, but the ones here
were chosen because they provide core functionality and
speed in Hadoop so called Hadoop Ecosystem.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 4 / 29
5. Hadoop Ecosystem
Figure : Hadoop Ecosystem Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
6. HDFS
Hadoop Distributed File System.
Files are stored in HDFS and divided into blocks, which
are then copied to multiple Data Nodes.
Hadoop cluster contains only one NameNode and many
DataNodes.
Data blocks are replicated for High Availability and fast
access.
Figure : HDFS Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 6 / 29
7. HDFS
NameNode
Run on a separate machine.
Manage the file system namespace,and control access of external
clients.
Store file system Meta-data in memory.
File information, each block information of files, and every file
block information in Data Node .
DataNode
Run on Separate machine,which is the basic unit of file storage.
Sent all messages of existing Blocks periodically to Name Node.
Data Node response read and write request from the Name
Node,and also respond, create, delete, and copy the block
command from Name Node.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 7 / 29
8. MapReduce
Programming model for data processing.
Hadoop can run MapReduce programs written in various
languages Java,Python.
Parallel Processing,put Mapreduce in very large-scale
data analysis.
Mapper produce intermediate results.
Reducer aggregates the results.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 8 / 29
9. MapReduce
Files are split into fixed sized blocks and stored on data
nodes (Default 64MB).
Programs written, can process on distributed clusters in
parallel.
Input data is a set of key/value pairs, the output is also
the key/value pairs.
Mainly Two Phase Map and Reduce.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 9 / 29
11. MapReduce (continue...)
Map
Map process each block separately in parallel.
Generate an intermediate key/value pairs set.
Results of these logic blocks are reassembled.
Reduce
Accepts an intermediate key and related value.
Processed the intermediate key and value.
Form a set of relatively small value set.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 11 / 29
12. YARN
YARN (Yet Another Resource Negotiator).
MapReduce 1.0 had issues with scalability, memory usage
and synchronization.
YARN addresses problems with MapReduce 1.0’s
architecture, specifically with the JobTracker service.
YARN splits up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons.
Rather than burdening a single node with handling
scheduling and resource management for the entire
cluster, YARN now distributes this responsibility across
the cluster.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 12 / 29
13. YARN (continue...)
Figure : Yarn Architecture Via Apache
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
14. Avro
Avro is a framework for performing remote procedure
calls and data serialization.
It can be used to pass data from one program or language
to another, e.g. from C to Pig.
Suited for use with scripting languages such as Pig
because data is always stored with its schema in Avro and
therefore the data is self-describing.
Avro can also handle changes in schema still preserving
access to the data.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 14 / 29
15. Pig
Pig is a framework consisting of a high-level scripting
language (Pig Latin).
Run-time environment that allows users to execute
MapReduce on a Hadoop cluster.
Like HiveQL in Hive, Pig Latin is a higher-level language
that compiles to MapReduce.
Pig is more flexible than Hive with respect to possible
data format.
Pig’s data model is similar to the relational data model,
except that tuples (a.k.a. records or rows) can be nested.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 15 / 29
16. Hive
Apache Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization, query
and analysis.
Using Hadoop was not easy for end users those who were
not familiar with MapReduce framework.
A Hive query is converted to MapReduce tasks.
Figure : Hive Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 16 / 29
17. Hive (continue...)
Building blocks of Hive.
Metastore stores the system catalog and metadata about tables,
columns, partitions, etc.
Driver manages the lifecycle of a HiveQL statement as it moves
through Hive.
Query Compiler compiles HiveQL into a directed acyclic graph for
MapReduce tasks.
Execution Engine executes the tasks produced by the compiler in
proper dependency order.
Hive Server provides a thrift interface and a JDBC/ODBC server.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 17 / 29
18. HBase
HBase is distributed column-oriented database built on
top of HDFS.
HBase is not relational and does not support SQL, but
given the proper problem space.
It is able to do what an RDBMS cannot.
HBase is modeled with an HBase master node
orchestrating a cluster of one or more regionserver slaves.
HBase master is responsible for bootstrapping a virgin
install, for assigning regions to registered regionservers,
and for recovering regionserver failures.
HBase manages a ZooKeeper instance as the authority on
cluster state.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 18 / 29
20. Mahout
Mahout is a scalable machine-learning and data mining
library.
There are currently four main groups of algorithms in
Mahout.
Recommendations, a.k.a. collective filtering.
Classification, a.k.a categorization.
Clustering.
Frequent itemset mining, a.k.a parallel frequent pattern mining.
Mahout is not simply a collection of pre-existing
algorithms.
Algorithms in the Mahout library belong to the subset
that can be executed in a distributed fashion, and have
been written to be executable in MapReduce.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 20 / 29
22. Sqoop
Sqoop allows easy import and export of data from
structured data stores.
Command-line tool to import any JDBC supported
database into Hadoop.
Generate Writables for use in MapReduce jobs.
High performance connectors for some RDBMS.
Distributed,reliable,available service for efficiently moving
large amount of data as it is produced.
Suited for gathering log from multiple systems.
Inserting them into HDFS as they are generated.
Design Goal : Reliability , Scalability , Manageability,
Extensibility.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 22 / 29
24. ZooKeeper
ZooKeeper is a distributed, open-source coordination
service for distributed applications.
They are especially prone to errors such as race
conditions and deadlock.
Generate Writables for use in MapReduce jobs.
ZooKeeper is to relieve distributed applications the
responsibility of implementing coordination services from
scratch.
ZooKeeper allows distributed processes to coordinate
with each other through a shared hierarchical namespace.
The name space consists of data registers called znodes,
and these are similar to files and directories.
ZooKeeper data is kept in-memory, which means it can
achieve high throughput and low latency numbers.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 24 / 29
26. Chukwa
Chukwa is a Hadoop subproject devoted to large-scale log
collection and analysis.
Chukwa is built on top of HDFS and MapReduce
framework and inherits Hadoops scalability and
robustness.
Four Components of Chukwa.
Agents that run on each machine and emit data.
Collectors that receive data from the agent and write to a stable storage.
MapReduce jobs for parsing and archiving the data.
HICC, Hadoop Infrastructure Care Center; a web-portal style interface
for displaying data.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 26 / 29
28. HCatalog
An incubator-level project at Apache.
HCatalog is a metadata and table storage management
service for HDFS.
HCatalog depends on the Hive metastore and exposes it
to other services such as MapReduce and Pig.
HCatalog’s goal is to simplify the user’s interaction with
HDFS data.
Enable data sharing between tools and execution
platforms.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 28 / 29
29. Bibliography I
G. Yang, “The application of mapreduce in the cloud computing,” Intelligence
Information Processing and Trusted Computing (IPTC) 2011, vol. 9,
pp. 154–156, Oct 2011.
T. White, Hadoop:The Definitive Guide, Third Edition.
1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc.,
2012.
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 29 / 29