2. What is Hadoop!
§ Built and distributed as part of the Apache Software
Project;
"
§ Hadoop EcoSystem:"
§ Common – set of components and interfaces for a DFS and
general I/O;"
§ Avro – A serialization system for efficient, cross language RPC,
and persistent data storage;"
§ MapReduce – A distributed data processing model and
execution environment that runs on large clusters of commodity
machines;"
§ HDFS – A distributed File System that runs on large clusters of
commodity hardware."
File & Content Solutions!
3. Common Terms in Hadoop HDFS!
§ Name node - manages the File System namespace. It
maintains the File System tree and the metadata for all
the files and directories in the tree.
This information is stored persistently on the local disk in
the form of two files: the namespace image and the edit
log.
"
§ Data node- Workhorses of the File System. They store
and retrieve blocks when they are told to (by clients or
the name node), and they report back to the name node
periodically with lists of blocks that they are storing."
File & Content Solutions!
4. Common Terms in Hadoop HDFS!
§ Secondary Name node - Its main role is to periodically
merge the namespace image with the edit log to prevent
the edit log from becoming too large. The secondary
name node usually runs on a separate physical machine
"
File & Content Solutions!
5. Hadoop Distributed File System - HDFS!
§ HDFS is a File System designed for storing very large
files with streaming data access patterns, running on
clusters of commodity hardware.
"
§ HDFS has a permissions model for files and directories
that is much like POSIX."
POSIX is an acronym for Portable Operating System Interface."
File & Content Solutions!
8. MapReduce!
§ "Map" step: The master node takes the input, divides it
into smaller sub-problems, and distributes them to
worker nodes. A worker node may do this again in turn,
leading to a multi-level tree structure. The worker node
processes the smaller problem, and passes the answer
back to its master node.
"
§ "Reduce" step: The master node then collects the
answers to all the sub-problems and combines them in
some way to form the output – the answer to the problem
it was originally trying to solve."
"
File & Content Solutions!
10. HDFS Storage Solution!
§ The DataLogix Hadoop Storage Solution contains:"
§ Enterprise Scale-Out storage solution for Hadoop workflows.
"
§ Native connectivity for Hadoop and Eco-systems components:"
§ Hive"
§ Hbase"
§ Pig"
§ Mahout
"
§ No single point of failure Name Node;
"
§ No 3x mirroring, native N+M protection is used;
"
§ SnapShot, Sync and NDMP back-up is supported."
File & Content Solutions!
11. Writing into Hadoop with the DataLogix solution!
§ The storage system becomes the Name Node and as well as the Data
Node
"
§ Provides scalability and protection of the data.
"
§ Hadoop cluster no longer has a single point of failure and no longer
writes multiple 64MB-128MB chunks of data to datanodes"
File & Content Solutions!
12. Reading Hadoop Data !
§ Data is read off the cluster back to the compute nodes;
"
§ The Data Nodes are now compute nodes and are independent of
the data in the Hadoop cluster:"
§ Benefits are that Hadoop hardware can be ugraded without the need for
migration of data. "
File & Content Solutions!
13. More information?!!
§ More information about the Hadoop storage solutions?
Please contact us:
DataLogix
Phone: +31(0)30-7440710
e-mail: info@datalogix.nl
www.datalogix.nl"
File & Content Solutions!