SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
A Seminar Report
         On


HADOOP


   By Varun Narang

   MA 399 Seminar
     IIT Guwahati

 Roll Number: 09012332
Index of Topics:

1. Abstract

2. Introduction

3. What is MapReduce?

4. HDFS
Assumptions
Design
Concepts
The Communication Protocols
Robustness
Cluster Rebalancing
Data Integrity
Metadata disk failure
Snapshots

5. Data Organisation
     Data Blocks
     Staging
     Replication Pipelining

6. Accessibility

7. Space Reclaimation
    • File Deletes and Undeletes
    • Decrease Replication Factor
    • Hadoop Filesystems

7. Hadoop Archives
Bibliography
1)Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press

2)MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay
   Ghemawat

3)Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-
    Reduce, Delip Rao, David Yarowsky, Dept. of Computer Science, Johns Hopkins
    University

4)Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia,
    Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, University of
    California, Berkeley

5)MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball
   Winter 2007
Abstract

Problem Statement:
The amount total digital data in the world has exploded in recent years. This has
happened primarily due to information (or data) generated by various enterprises all over
the globe. In 2006, the universal data was estimated to be 0.18 zettabytes in 2006, and is
forecasting a tenfold growth by 2011 to 1.8 zettabytes.
1 zettabyte = 1021 bytes

   The problem is that while the storage capacities of hard drives have increased
massively over the years, access speeds—the rate at which data can be read from drives
have not kept up. One typical drive from 1990 could store 1370 MB of data and had a
transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300
seconds. In 2010, 1 Tbdrives are the standard hard disk size, but the transfer speed is
around 100 MB/s, so it takes more than two and a half hours to read all the data off the
disk.

Solution Proposed:
Parallelisation:
A very obvious solution to solving this problem is parallelisation. The input data is
usually large and the computations have to be distributed across hundreds or thousands of
machines in order to finish in a reasonable amount of time.
Reading 1 Tb from a single hard drive may take a long time, but on parallelizing this over
100 different machines can solve the problem in 2 minutes.

The key issues involved in this Solution:
   • Hardware failure
   • Combine the data after analysis (i.e reading)

Apache Hadoop is a framework for running applications on large cluster built of
commodity hardware. The Hadoop framework transparently provides applications both
reliability and data motion.
It solves the problem of Hardware Failure through replication. Redundant copies of the
data are kept by the system so that in the event of failure, there is another copy available.
(Hadoop Distributed File System)
The second problem is solved by a simple programming model- Mapreduce. This
programming paradigm abstracts the problem from data read/write to computation over a
series of keys.

Even though HDFS and MapReduce are the most significant features of Hadoop, other
subprojects provide complementary services:
The various subprojects of hadoop includes:-
   • Core
   • Avro
   • Pig
   • HBase
   • Zoo Keeper
   • Hive
   • Chukwa
Introduction
  Hadoop is designed to efficiently process large volumes of information by connecting many
commodity computers together to work in parallel. A 1000 CPU single machine (i.e a
supercomputer with a vast memory storage) would cost a lot. Thus Hadoop parallelizes the
computation by tying smaller and more reasonably priced machines together into a single cost-
effective compute cluster.
The features of hadoop that stand out are its simplified programming model and its efficient,
automatic distribution of data and work across machines.

Now we take a deeper look into these two main features of Hadoop and list their important
characteristics and description.


1. Data Distribution:

 In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The
Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed
by different nodes in the cluster. In addition to this each chunk is replicated across several
machines, so that a single machine failure does not result in any data being unavailable. In case of
a system failure, the data is re-replicated which can result in partial storage. Even though the file
chunks are replicated and distributed across several machines, they form a single namespace, so
their contents are universally accessible.




   Data is conceptually record-oriented in the Hadoop programming
framework. Individual input files are broken into segments and each segment
is processed upon by a node. The Hadoop framework schedules the
processes to be run in proximity to the location of data/records using
knowledge from the distributed file system. Each computation process
running on a node operates on a subset of the data. Which data operated on
by which node is decide based on its proximity to the node: i.e:
Most data is read from the local disk straight into the CPU, alleviating strain
on network bandwidth and preventing unnecessary network transfers. This
strategy of moving computation to the data, instead of moving the data to the
computation allows Hadoop to achieve high data locality which in turn results in high
performance.
2. MapReduce: Isolated Processes

  Hadoop limits the amount of communication which can be performed by the processes, as each
individual record is processed by a task in isolation from one another. It makes the whole
framework much more reliable. Programs must be written to conform to a particular
programming model, named "MapReduce."




MapReduce is composed of two chief elements: Mappers and Reducers.
    1. Data segments or records are processed in isolation by tasks called Mappers.
    2. The output from the Mappers is then brought together by Reducers, where results from
       different mappers are merged together.
 Separate nodes in a Hadoop cluster communicate implicitly. Pieces of data can be tagged with
key names which inform Hadoop how to send related bits of information to a common destination
node. Hadoop internally manages all of the data transfer and cluster topology issues.
  By restricting the communication between nodes, Hadoop makes the distributed system much
more reliable. Individual node failures can be worked around by restarting tasks on other
machines. The other workers continue to operate as though nothing went wrong, leaving the
challenging aspects of partially restarting the program.




                               What is MapReduce?
MapReduce is a programming model for processing and generating large data sets. Users
specify a map function that processes a key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate values associated with the same
intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large
cluster of commodity machines. The run-time system takes care of the details of partitioning the
input data, scheduling the program's execution across a set of machines, handling machine
failures, and managing the required inter-machine communication (i.e this procedure is abstracted
or hidden from the user who can focus on the computational problem)

Note: This abstraction was inspired by the map and reduces primitives present in Lisp
and many other functional languages.




The Programming Model:
The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two
functions: Map and Reduce.

Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key Iand passes them to the Reduce function.




The Reduce function, also written by the user, accepts an intermediate key Iand a
set of values for that key. It merges together these values to form a possibly smaller
set of values. Typically just zero or one output value is produced per Reduce
invocation. The intermediate values
are supplied to the user's reduce function via an iterator.
Map and Reduce (Associated Types):
The input keys and values are drawn from a different domain than the output keys
and values. Also, the intermediate keys and values are from the same domain as the
output keys and values.

  Hadoop Map-Reduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Analyzing the Data with Hadoop MapReduce:
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. It splits the input data-set into independent chunks which are processed by the map
tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are
then input to the reduce tasks.
Both the input and the output of the job are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
Data Locality Optimisation:
  Typically the compute nodes and the storage nodes are the same. The Map-Reduce framework
and the Distributed File System run on the same set of nodes. This configuration allows the
framework to effectively schedule tasks on the nodes where data is already present, resulting in
very high aggregate bandwidth across the cluster.

There are two types of nodes that control the job execution process:
   1. jobtrackers
   2. tasktrackers

The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on
tasktrackers.

Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the
overall progress of each job.
If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
Input splits: Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map
function for each record in the split.
The quality of the load balancing increases as the splits become more fine-grained. On the other
hand, if splits are too small, then the overhead of managing the splits and of map task creation
begins to dominate the total job execution time. For most jobs, a good split size tends to be the
size of a HDFS block, 64 MB by default.
WHY?
Map tasks write their output to local disk, not to HDFS. Map output is intermediate output: it’s
processed by reduce tasks to produce the final output, and once the job is complete the map
output can be thrown away. So storing it in HDFS, with replication, would be a waste of time. It
is also possible that the node running the map task fails before the map output has been consumed
by the reduce task.

Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is
normally the output from all mappers.

 In case we have a single reduce task that is fed by all of the map tasks: The sorted map outputs
have to be transferred across the network to the node where the reduce task is running, where they
are merged and then passed to the user-defined reduce function.

The output of the reducer is normally stored in HDFS for reliability. For each HDFS block of the
reduce output, the first replica is stored on the local node, with other replicas being stored on off-
rack nodes.
MapReduce data flow with a single reduce task

When there are multiple reducers:
 The map tasks partition their output, each creating one partition for each reduce task. There can
be many keys (and their associated values) in each partition, but the records for every key are all
in a single partition.




    MapReduce data flow with multiple reduce task.

    It is also possible to have zero reduce tasks as illustrated in the figure below.
MapReduce data flow with no reduce tasks




Combiner Functions

Many MapReduce jobs are limited by the bandwidth available on the cluster. In order to
minimize the data transferred between the map and reduce tasks, combiner functions are
introduced. Hadoop allows the user to specify a combiner function to be run on the map output—
the combiner function’s output forms the input to the reduce function. Combiner finctions can
help cut down the amount of data shuffled between the maps and the reduces.



Hadoop Streaming:

  Hadoop provides an API to MapReduce that allows you to write your map and reduce functions
in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and your program, so you can use any language that can read standard input and
write to standard output to write your MapReduce program.


Hadoop Pipes:

Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming, which
uses standard input and output to communicate with the map and reduce code, Pipes uses sockets
as the channel over which the tasktracker communicates with the process running the C++ map or
reduce function. JNI is not used.
HADOOP DISTRIBUTED FILESYSTEM (HDFS)

   Filesystems that manage the storage across a network of machines are called distributed
filesystems. They are network-based, and thus all the complications of network programming are
also present in distributed file system. Hadoop comes with a distributed filesystem called HDFS,
which stands for Hadoop Distributed Filesystem.

  HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very
large amounts of data (terabytes or even petabytes), and provide high-throughput access to this
information.

      ASSUMPTIONS AND GOALS:

         1. Hardware Failure
  An HDFS instance may consist of hundreds or thousands of server machines, each storing part
of the file system’s data. In case of such a large number of nodes, the probability of one of them
failing becomes substantial.

        2. Streaming Data Access

 Applications that run on HDFS need streaming access to their data sets. HDFS is designed more
for batch processing rather than interactive use by users. The emphasis is on high throughput of
data access rather than low latency of data access.

        3. Large Data Sets

  Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate
data bandwidth and scale to hundreds of nodes in a single cluster.

        4. Simple Coherency Model

  HDFS applications need a write-once-read-many access model for files. A file once created,
written, and closed need not be changed. This assumption simplifies data coherency issues and
enables high throughput data access. A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a plan to support appending-writes to files in the future.

        5. “Moving Computation is Cheaper than Moving Data”

  A computation requested by an application is much more efficient if it is executed near the data
it operates on. This is especially true when the size of the data set is huge. This minimizes
network congestion and increases the overall throughput of the system. HDFS provides interfaces
for applications to move themselves closer to where the data is located.

        6. Portability Across Heterogeneous Hardware and Software Platforms

 HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.
The Design of HDFS:

   •   Very large files:

   “Very large” in this context means files that are hundreds of megabytes, gigabytes, or
   terabytes in size. There are Hadoop clusters running today that store petabytes of data.

   •   Streaming data access:

   HDFS is built around the idea that the most efficient data processing pattern is a write-once,
   read-many-times pattern. A dataset is typically generated or copied from source, then various
   analyses are performed on that dataset over time. Each analysis will involve a large
   proportion, if not all, of the dataset, so the time to read the whole dataset is more important
   than the latency in reading the first record.

   •   Commodity hardware:

   Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on
   clusters of commodity hardware for which the chance of node failure across the cluster is
   high for large clusters. HDFS is designed to carry on working without a noticeable
   interruption to the user in the face of such failure. The use of commodity hardware restricts
   the effectiveness of Hadoop in some applications. These applications have the following
   common characteristics:

   •   Low-latency data access

   Applications that require low-latency access to data, in the tens of milliseconds range, will
   not work well with HDFS.

   •   Lots of small files:

   Since the master node (or namenode) holds file-system metadata in memory, the limit to the
   number of files in a filesystem is governed by the amount of memory on the namenode. As a
   rule of thumb, each file, directory, and block takes about 150 bytes.

   •   Multiple writers, arbitrary file modifications:

   Files in HDFS may be written to by a single writer. Writes are always made at the end of the
   file. There is no support for multiple writers, or for modifications at arbitrary offsets in the
   file.
A few Important Concepts of Hadoop Distributed File System:
1. Blocks:
   A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an integral
multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk
blocks are normally 512 bytes

HDFS too has the concept of a block, but it is a much larger unit—64 MB by default. Like in a
filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored
as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a
single block does not occupy a full block’s worth of underlying storage.
HDFS blocks are large compared to disk blocks.




    Having a block abstraction for a distributed filesystem brings several benefits:

     A file can be larger than any single disk in the network. There’s nothing that requires the
      blocks from a file to be stored on the same disk, so they can take advantage of any of the
      disks in the cluster.

     Making the unit of abstraction a block rather than a file simplifies the storage subsystem.

     Blocks fit well with replication for providing fault tolerance and availability. To insure
      against corrupted blocks and disk and machine failure, each block is replicated to a small
      number of physically separate machines (typically three).

        2. Namenodes and Datanodes:

 A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the
master) and a number of datanodes (workers).

The namenode has two chief functions:
   • To manage the filesystem namespace.
   • To maintains the filesystem tree and the metadata for all the files and directories in the
      tree.
•    . The namenode also knows the datanodes on which all the blocks for a given file are
         located
 This information is stored persistently on the local disk in the form of two files: the namespace
image and the edit log. The namenode however does not store block locations persistently, since
this information is reconstructed from datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and
datanodes.




  Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with lists
of blocks that they are storing. Without the namenode, the filesystem cannot be used. In fact, if
the machine running the namenode were obliterated, all the files on the filesystem would be lost
since there would be no way of knowing how to reconstruct the files from the blocks on the
datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop
provides two mechanisms for this.
The first way is to back up the files that make up the persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems.
These writes are synchronous and atomic. The usual configuration Choice is to write to local disk
as well as a remote NFS mount.

 Another approach is to run a secondary namenode. It does not act as a namenode. Its main role
is to periodically merge the namespace image with the edit log to prevent the edit log from
becoming too large. The secondary namenode usually runs on a separate physical machine, since
it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a
copy of the merged namespace image, which can be used in the event of the namenode failing.
However, the state of the secondary namenode lags that of the primary, so in the event of total
failure of the primary data, loss is almost guaranteed.


3. The File System Namespace:

  HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar
to most other existing file systems; one can create and remove files, move a file from one
directory to another, or rename a file. HDFS does not yet implement user quotas or access
permissions.
  The Namenode maintains the file system namespace. Any change to the file system namespace
or its properties is recorded by the Namenode.

4. Data Replication:

  HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size. The
blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the Datanodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
Datanode.




5. Replica Placement:

 Optimizing replica placement distinguishes HDFS from most other distributed file systems.. The
purpose of a rack-aware replica placement policy is to improve data reliability, availability, and
network bandwidth utilization.
 Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches. In most cases,
network bandwidth between machines in the same rack is greater than network bandwidth
between machines in different racks.
  For the common case, when the replication factor is three, HDFS’s placement policy is to put
one replica on one node in the local rack, another on a different node in the local rack, and the last
on a different node in a different rack. This policy cuts the inter-rack write traffic which generally
improves write performance.

6. Replica Selection:
  To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read
request from a replica that is closest to the reader.

7. Safemode:

  On startup, the NameNode enters a special state called Safemode. Replication of data blocks
does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that
a DataNode is hosting. Each block has a specified minimum number of replicas. A block is
considered safely replicated when the minimum number of replicas of that data block has checked
in with the NameNode. After a configurable percentage of safely replicated data blocks checks in
with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It
then determines the list of data blocks (if any) that still have fewer than the specified number of
replicas. The NameNode then replicates these blocks to other DataNodes.

8. The Persistence of File System Metadata:

  The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called
the EditLog to persistently record every change that occurs to file system metadata. The
NameNode uses a file in its local host OS file system to store the EditLog. The entire file system
namespace, including the mapping of blocks to files and file system properties, is stored in a file
called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.
 The NameNode keeps an image of the entire file system namespace and file Blockmap in
memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of
RAM is plenty to support a huge number of files and directories
   The DataNode stores HDFS data in files in its local file system. The DataNode has no
knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file
system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to
determine the optimal number of files per directory and creates subdirectories appropriately. It is
not optimal to create all local files in the same directory because the local file system might not
be able to efficiently support a huge number of files in a single directory. When a DataNode starts
up, it scans through its local file system, generates a list of all HDFS data blocks that correspond
to each of these local files and sends this report to the NameNode: this is the Blockreport
The Communication Protocols:
   All HDFS communication protocols are layered on top of the TCP/IP protocol. A client
establishes a connection to a configurable TCP port on the NameNode machine. It talks the
ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode
Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the
DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds
to RPC requests issued by DataNodes or clients.
Robustness:
  The primary objective of HDFS is to store data reliably even in the presence of failures. The
three common types of failures are NameNode failures, DataNode failures and network
partitions.

        Data Disk Failure, Heartbeats and Re-Replication

  Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition
can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode
detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes
without recent Heartbeats as dead and does not forward any new IO requests to them. Any data
that was registered to a dead DataNode is not available to HDFS any more. DataNode death may
cause the replication factor of some blocks to fall below their specified value. The NameNode
constantly tracks which blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode may become
unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the
replication factor of a file may be increased.
Cluster Rebalancing

   The HDFS architecture is compatible with data rebalancing schemes. A scheme might
automatically move data from one DataNode to another if the free space on a DataNode falls
below a certain threshold. In the event of a sudden high demand for a particular file, a scheme
might dynamically create additional replicas and rebalance other data in the cluster. These types
of data rebalancing schemes are not yet implemented.

      Data Integrity

 It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can
occur because of faults in a storage device, network faults, or buggy software. The HDFS client
software implements checksum checking on the contents of HDFS files. When a client creates an
HDFS file, it computes a checksum of each block of the file and stores these checksums in a
separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies
that the data it received from each DataNode matches the checksum stored in the associated
checksum file. If not, then the client can opt to retrieve that block from another DataNode that has
a replica of that block.

      Metadata Disk Failure

The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can
cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured
to support maintaining multiple copies of the FsImage and EditLog. Any update to either the
FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously.
This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate
of namespace transactions per second that a NameNode can support. However, this degradation is
acceptable because even though HDFS applications are very data intensive in nature, they are not
metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and
EditLog to use.
The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode
machine fails, manual intervention is necessary. Currently, automatic restart and failover of the
NameNode software to another machine is not supported.

      Snapshots

Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot
feature may be to roll back a corrupted HDFS instance to a previously known good point in time.
HDFS does not currently support snapshots but will in a future release.
Data Organization
        Data Blocks

  HDFS is designed to support very large files. Applications that are compatible with HDFS are
those that deal with large data sets. These applications write their data only once but they read it
one or more times and require these reads to be satisfied at streaming speeds. HDFS supports
write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an
HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a
different DataNode.

        Staging

  A client request to create a file does not reach the NameNode immediately. In fact, initially the
HDFS client caches the file data into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file accumulates data worth over one HDFS
block size, the client contacts the NameNode. The NameNode inserts the file name into the file
system hierarchy and allocates a data block for it. The NameNode responds to the client request
with the identity of the DataNode and the destination data block. Then the client flushes the block
of data from the local temporary file to the specified DataNode. When a file is closed, the
remaining un-flushed data in the temporary local file is transferred to the DataNode. The client
then tells the NameNode that the file is closed. At this point, the NameNode commits the file
creation operation into a persistent store. If the NameNode dies before the file is closed, the file is
lost.
 The above approach has been adopted after careful consideration of target applications that run
on HDFS. These applications need streaming writes to files. If a client writes to a remote file
directly without any client side buffering, the network speed and the congestion in the network
impacts throughput considerably. This approach is not without precedent. Earlier distributed file
systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement
has been relaxed to achieve higher performance of data uploads.

        Replication Pipelining

  When a client is writing data to an HDFS file, its data is first written to a local file as explained
in the previous section. Suppose the HDFS file has a replication factor of three. When the local
file accumulates a full block of user data, the client retrieves a list of DataNodes from the
NameNode. This list contains the DataNodes that will host a replica of that block. The client then
flushes the data block to the first DataNode. The first DataNode starts receiving the data in small
portions (4 KB), writes each portion to its local repository and transfers that portion to the second
DataNode in the list. The second DataNode, in turn starts receiving each portion of the data
block, writes that portion to its repository and then flushes that portion to the third DataNode.
Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be
receiving data from the previous one in the pipeline and at the same time forwarding data to the
next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.


۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩
Accessibility:

  HDFS can be accessed from applications in many different ways. Natively, HDFS provides a
java API for applications to use. A C language wrapper for this Java API is also available. In
addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in
progress to expose HDFS through the WebDAV protocol.

      Space Reclamation:
        1. File Deletes and Undeletes

  When a file is deleted by a user or an application, it is not immediately removed from HDFS.
Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as
long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the
expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion
of a file causes the blocks associated with the file to be freed. Note that there could be an
appreciable time delay between the time a file is deleted by a user and the time of the
corresponding increase in free space in HDFS.
  A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user
wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and
retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The
/trash directory is just like any other directory with one special feature: HDFS applies specified
policies to automatically delete files from this directory. The current default policy is to delete
files from /trash that are more than 6 hours old. In the future, this policy will be configurable
through a well defined interface.
2. Decrease Replication Factor
  When the replication factor of a file is reduced, the NameNode selects excess replicas that can
be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then
removes the corresponding blocks and the corresponding free space appears in the cluster. Once
again, there might be a time delay between the completion of the setReplication API call and the
appearance of free space in the cluster.


3. Hadoop Filesystems

Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The
Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there
are several concrete implementations, which are described in following table.




                                                                 A filesystem for a locally connected
disk with client-side checksums.
Local      file                                    Use RawLocalFileSys
                   fs.LocalFileSystem              tem for a local filesystem with no
                                                   checksums.
                                                   Hadoop’s distributed filesystem.
                                                   HDFS is designed to work efficiently
HDFS       hdfs    hdfs.DistributedFileSystem      in conjunction with Map-
                                                   Reduce.
                                                   A filesystem providing read-only
                                                   access to HDFS over HTTP. (Despite
HFTP       hftp                                    its name, HFTP has no connection
                   hdfs.HftpFileSystem             with FTP.) Often used with distcp
                                                   (“Parallel Copying with
                                                   A filesystem providing read-only
                                                   access to HDFS over HTTPS. (Again,
HSFTP      hsftp   Hdfs.HsftpFileSystem            this has no connection with FTP.)
                                                   A filesystem layered on another
                                                   filesystem    for   archiving      files.
HAR        har     Fs.HarFileSystem                Hadoop
                                                   Archives are typically used
                                                   for archiving files in HDFS to
                                                   reduce
                                                   the namenode’s memory usage.
                                                   CloudStore (formerly Kosmos
                                                   filesystem)
KFS(Clou   Kfs     fs.kfs.KosmosFileSystem         is a distributed filesystem
d Store)                                           like HDFS or Google’s GFS,
                                                   written in C++.
                                                   A filesystem backed by an FTP
FTP        ftp     fs.ftp.FtpFileSystem            server.
                                                   A filesystem backed by Amazon
S3(Nativ   s3n     fs.s3native.NativeS3FileSyste   S3.
e)                 m
                                                   A filesystem backed by Amazon
                                                   S3, which stores files in blocks
S3(Block   S3      fs.s3.S3FileSystem A            (much like HDFS) to overcome
Based)                                             S3’s 5 GB file size limit.
Hadoop Archives:


 HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is
held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory
on the namenode. (Note, however, that small files do not take up any more disk space than is
required to store the raw contents of the file. For example, a 1 MB file stored with a block size of
128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file
archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode
memory usage while still allowing transparent access to files. In particular, Hadoop Archives can
be used as input to MapReduce.



Using Hadoop Archives

  A Hadoop Archive is created from a collection of files using the archive tool. The tool runs a
MapReduce job to process the input files in parallel, so to run it, you need a MapReduce cluster
running to use it.

Limitations

  There are a few limitations to be aware of with HAR files. Creating an archive creates a copy of
the original files, so you need as much disk space as the files you are archiving to create the
archive (although you can delete the originals once you have created the archive). There is
currently no support for archive compression, although the files that go into the archive can be
compressed (HAR files are like tar files in this respect). Archives are immutable once they have
been created. To add or remove files, you must recreate the archive. In practice, this is not a
problem for files that don’t change after being written, since they can be archived in batches on a
regular basis, such as daily or weekly. As noted earlier, HAR files can be used as input to
MapReduce. However, there is no archive-aware InputFormat that can pack multiple files into a
single MapReduce split, so processing lots of small files, even in a HAR file, can still be
inefficient.

Contenu connexe

Tendances

Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 

Tendances (20)

Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 

Similaire à Seminar_Report_hadoop

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 

Similaire à Seminar_Report_hadoop (20)

Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 

Dernier

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Dernier (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Seminar_Report_hadoop

  • 1. A Seminar Report On HADOOP By Varun Narang MA 399 Seminar IIT Guwahati Roll Number: 09012332
  • 2. Index of Topics: 1. Abstract 2. Introduction 3. What is MapReduce? 4. HDFS Assumptions Design Concepts The Communication Protocols Robustness Cluster Rebalancing Data Integrity Metadata disk failure Snapshots 5. Data Organisation  Data Blocks  Staging  Replication Pipelining 6. Accessibility 7. Space Reclaimation • File Deletes and Undeletes • Decrease Replication Factor • Hadoop Filesystems 7. Hadoop Archives
  • 3. Bibliography 1)Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press 2)MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat 3)Ranking and Semi-supervised Classification on Large Scale Graphs Using Map- Reduce, Delip Rao, David Yarowsky, Dept. of Computer Science, Johns Hopkins University 4)Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, University of California, Berkeley 5)MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball Winter 2007
  • 4. Abstract Problem Statement: The amount total digital data in the world has exploded in recent years. This has happened primarily due to information (or data) generated by various enterprises all over
  • 5. the globe. In 2006, the universal data was estimated to be 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes. 1 zettabyte = 1021 bytes The problem is that while the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives have not kept up. One typical drive from 1990 could store 1370 MB of data and had a transfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300 seconds. In 2010, 1 Tbdrives are the standard hard disk size, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. Solution Proposed: Parallelisation: A very obvious solution to solving this problem is parallelisation. The input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. Reading 1 Tb from a single hard drive may take a long time, but on parallelizing this over 100 different machines can solve the problem in 2 minutes. The key issues involved in this Solution: • Hardware failure • Combine the data after analysis (i.e reading) Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. It solves the problem of Hardware Failure through replication. Redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. (Hadoop Distributed File System) The second problem is solved by a simple programming model- Mapreduce. This programming paradigm abstracts the problem from data read/write to computation over a series of keys. Even though HDFS and MapReduce are the most significant features of Hadoop, other subprojects provide complementary services: The various subprojects of hadoop includes:- • Core • Avro • Pig • HBase • Zoo Keeper • Hive • Chukwa
  • 6. Introduction Hadoop is designed to efficiently process large volumes of information by connecting many commodity computers together to work in parallel. A 1000 CPU single machine (i.e a
  • 7. supercomputer with a vast memory storage) would cost a lot. Thus Hadoop parallelizes the computation by tying smaller and more reasonably priced machines together into a single cost- effective compute cluster. The features of hadoop that stand out are its simplified programming model and its efficient, automatic distribution of data and work across machines. Now we take a deeper look into these two main features of Hadoop and list their important characteristics and description. 1. Data Distribution: In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. In case of a system failure, the data is re-replicated which can result in partial storage. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible. Data is conceptually record-oriented in the Hadoop programming framework. Individual input files are broken into segments and each segment is processed upon by a node. The Hadoop framework schedules the processes to be run in proximity to the location of data/records using knowledge from the distributed file system. Each computation process running on a node operates on a subset of the data. Which data operated on by which node is decide based on its proximity to the node: i.e: Most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data, instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance.
  • 8. 2. MapReduce: Isolated Processes Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another. It makes the whole framework much more reliable. Programs must be written to conform to a particular programming model, named "MapReduce." MapReduce is composed of two chief elements: Mappers and Reducers. 1. Data segments or records are processed in isolation by tasks called Mappers. 2. The output from the Mappers is then brought together by Reducers, where results from different mappers are merged together. Separate nodes in a Hadoop cluster communicate implicitly. Pieces of data can be tagged with key names which inform Hadoop how to send related bits of information to a common destination node. Hadoop internally manages all of the data transfer and cluster topology issues. By restricting the communication between nodes, Hadoop makes the distributed system much more reliable. Individual node failures can be worked around by restarting tasks on other machines. The other workers continue to operate as though nothing went wrong, leaving the challenging aspects of partially restarting the program. What is MapReduce? MapReduce is a programming model for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value
  • 9. pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication (i.e this procedure is abstracted or hidden from the user who can focus on the computational problem) Note: This abstraction was inspired by the map and reduces primitives present in Lisp and many other functional languages. The Programming Model: The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key Iand passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key Iand a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator.
  • 10. Map and Reduce (Associated Types): The input keys and values are drawn from a different domain than the output keys and values. Also, the intermediate keys and values are from the same domain as the output keys and values. Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Analyzing the Data with Hadoop MapReduce: MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. It splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Data Locality Optimisation: Typically the compute nodes and the storage nodes are the same. The Map-Reduce framework and the Distributed File System run on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. There are two types of nodes that control the job execution process: 1. jobtrackers 2. tasktrackers The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
  • 11. Input splits: Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split. The quality of the load balancing increases as the splits become more fine-grained. On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default. WHY? Map tasks write their output to local disk, not to HDFS. Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be a waste of time. It is also possible that the node running the map task fails before the map output has been consumed by the reduce task. Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers. In case we have a single reduce task that is fed by all of the map tasks: The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reducer is normally stored in HDFS for reliability. For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off- rack nodes.
  • 12. MapReduce data flow with a single reduce task When there are multiple reducers: The map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for every key are all in a single partition. MapReduce data flow with multiple reduce task. It is also possible to have zero reduce tasks as illustrated in the figure below.
  • 13. MapReduce data flow with no reduce tasks Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster. In order to minimize the data transferred between the map and reduce tasks, combiner functions are introduced. Hadoop allows the user to specify a combiner function to be run on the map output— the combiner function’s output forms the input to the reduce function. Combiner finctions can help cut down the amount of data shuffled between the maps and the reduces. Hadoop Streaming: Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program. Hadoop Pipes: Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets
  • 14. as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. JNI is not used.
  • 15. HADOOP DISTRIBUTED FILESYSTEM (HDFS) Filesystems that manage the storage across a network of machines are called distributed filesystems. They are network-based, and thus all the complications of network programming are also present in distributed file system. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. ASSUMPTIONS AND GOALS: 1. Hardware Failure An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. In case of such a large number of nodes, the probability of one of them failing becomes substantial. 2. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. 3. Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. 4. Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. 5. “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. HDFS provides interfaces for applications to move themselves closer to where the data is located. 6. Portability Across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
  • 16. The Design of HDFS: • Very large files: “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. • Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. • Commodity hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware for which the chance of node failure across the cluster is high for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure. The use of commodity hardware restricts the effectiveness of Hadoop in some applications. These applications have the following common characteristics: • Low-latency data access Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. • Lots of small files: Since the master node (or namenode) holds file-system metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. • Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
  • 17. A few Important Concepts of Hadoop Distributed File System: 1. Blocks: A disk has a block size, which is the minimum amount of data that it can read or write. Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes HDFS too has the concept of a block, but it is a much larger unit—64 MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. HDFS blocks are large compared to disk blocks. Having a block abstraction for a distributed filesystem brings several benefits:  A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.  Making the unit of abstraction a block rather than a file simplifies the storage subsystem.  Blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). 2. Namenodes and Datanodes: A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode has two chief functions: • To manage the filesystem namespace. • To maintains the filesystem tree and the metadata for all the files and directories in the tree.
  • 18. . The namenode also knows the datanodes on which all the blocks for a given file are located This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode however does not store block locations persistently, since this information is reconstructed from datanodes when the system starts. A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this.
  • 19. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configuration Choice is to write to local disk as well as a remote NFS mount. Another approach is to run a secondary namenode. It does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary data, loss is almost guaranteed. 3. The File System Namespace: HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. The Namenode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the Namenode. 4. Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file.
  • 20. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the Datanodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a Datanode. 5. Replica Placement: Optimizing replica placement distinguishes HDFS from most other distributed file systems.. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks. For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. 6. Replica Selection: To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. 7. Safemode: On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is
  • 21. considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes. 8. The Persistence of File System Metadata: The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too. The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport
  • 22. The Communication Protocols: All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.
  • 23. Robustness: The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions. Data Disk Failure, Heartbeats and Re-Replication Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
  • 24. Cluster Rebalancing The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented. Data Integrity It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block. Metadata Disk Failure The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use. The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported. Snapshots Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release.
  • 25. Data Organization Data Blocks HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode. Staging A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost. The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads. Replication Pipelining When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next. ۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩
  • 26. Accessibility: HDFS can be accessed from applications in many different ways. Natively, HDFS provides a java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol. Space Reclamation: 1. File Deletes and Undeletes When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS. A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface. 2. Decrease Replication Factor When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster. 3. Hadoop Filesystems Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there are several concrete implementations, which are described in following table. A filesystem for a locally connected
  • 27. disk with client-side checksums. Local file Use RawLocalFileSys fs.LocalFileSystem tem for a local filesystem with no checksums. Hadoop’s distributed filesystem. HDFS is designed to work efficiently HDFS hdfs hdfs.DistributedFileSystem in conjunction with Map- Reduce. A filesystem providing read-only access to HDFS over HTTP. (Despite HFTP hftp its name, HFTP has no connection hdfs.HftpFileSystem with FTP.) Often used with distcp (“Parallel Copying with A filesystem providing read-only access to HDFS over HTTPS. (Again, HSFTP hsftp Hdfs.HsftpFileSystem this has no connection with FTP.) A filesystem layered on another filesystem for archiving files. HAR har Fs.HarFileSystem Hadoop Archives are typically used for archiving files in HDFS to reduce the namenode’s memory usage. CloudStore (formerly Kosmos filesystem) KFS(Clou Kfs fs.kfs.KosmosFileSystem is a distributed filesystem d Store) like HDFS or Google’s GFS, written in C++. A filesystem backed by an FTP FTP ftp fs.ftp.FtpFileSystem server. A filesystem backed by Amazon S3(Nativ s3n fs.s3native.NativeS3FileSyste S3. e) m A filesystem backed by Amazon S3, which stores files in blocks S3(Block S3 fs.s3.S3FileSystem A (much like HDFS) to overcome Based) S3’s 5 GB file size limit.
  • 28. Hadoop Archives: HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode. (Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files. In particular, Hadoop Archives can be used as input to MapReduce. Using Hadoop Archives A Hadoop Archive is created from a collection of files using the archive tool. The tool runs a MapReduce job to process the input files in parallel, so to run it, you need a MapReduce cluster running to use it. Limitations There are a few limitations to be aware of with HAR files. Creating an archive creates a copy of the original files, so you need as much disk space as the files you are archiving to create the archive (although you can delete the originals once you have created the archive). There is currently no support for archive compression, although the files that go into the archive can be compressed (HAR files are like tar files in this respect). Archives are immutable once they have been created. To add or remove files, you must recreate the archive. In practice, this is not a problem for files that don’t change after being written, since they can be archived in batches on a regular basis, such as daily or weekly. As noted earlier, HAR files can be used as input to MapReduce. However, there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file, can still be inefficient.