Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Cloud File System with GFS and HDFS
1. Dr Neelesh Jain
Instructor and Trainer
Follow me: Youtube/FB : DrNeeleshjain
Cloud File System
GFS & HDFS
GoogleFS
2. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
Introduction
A file system in the cloud is a hierarchical storage
system that provides shared access to file data. Users
can create, delete, modify, read, and write files and can
organize them logically in directory trees for intuitive
access.
3. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
GFS
Google File System (GFS or GoogleFS) is a
proprietary distributed file system developed by
Google to provide efficient, reliable access to data
using large clusters of commodity hardware. The last
version of Google File System codenamed Colossus
was released in 2010
GFS is not implemented in the kernel of an operating
system, but is instead provided as a userspace library.
4. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
GFS
Files are divided into fixed-size chunks of 64 megabytes, similar to
clusters or sectors in regular file systems, which are only
extremely rarely overwritten, or shrunk; files are usually appended
to or read. It is also designed and optimized to run on Google's
computing clusters, dense nodes which consist of cheap
"commodity" computers, which means precautions must be taken
against the high failure rate of individual nodes and the
subsequent data loss. Other design decisions select for high data
throughputs, even when it comes at the cost of latency.
5. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
GFS
A GFS cluster consists of multiple nodes. These nodes are divided
into two types: one Master node and multiple Chunkservers. Each file
is divided into fixed-size chunks. Chunkservers store these chunks.
Each chunk is assigned a globally unique 64-bit label by the master
node at the time of creation, and logical mappings of files to
constituent chunks are maintained. Each chunk is replicated several
times throughout the network. At default, it is replicated three times,
but this is configurable. Files which are in high demand may have a
higher replication factor, while files for which the application client
uses strict storage optimizations may be replicated less than three
times - in order to cope with quick garbage cleaning policies.
6. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
GFS
The Master server does not usually store the actual chunks, but
rather all the metadata associated with the chunks, such as the
tables mapping the 64-bit labels to chunk locations and the files
they make up (mapping from files to chunks), the locations of the
copies of the chunks, what processes are reading or writing to a
particular chunk, or taking a "snapshot" of the chunk pursuant to
replicate it (usually at the instigation of the Master server, when,
due to node failures, the number of copies of a chunk has fallen
beneath the set number). All this metadata is kept current by the
Master server periodically receiving updates from each chunk
server ("Heart-beat messages").
7. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
GFS
Permissions for modifications are handled by a system of time-
limited, expiring "leases", where the Master server grants
permission to a process for a finite period of time during which no
other process will be granted permission by the Master server to
modify the chunk.
Programs access the chunks by first querying the Master server
for the locations of the desired chunks; if the chunks are not
being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and
receives the data from the chunkserver directly
8. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
GFS
Google File System is designed for system-to-system interaction, and not for user-to-
system interaction. The chunk servers replicate the data automatically.
9. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
GFS Features
GFS features include:
● Fault tolerance
● Critical data replication
● Automatic and efficient data recovery
● High aggregate throughput
● Reduced client and master interaction because of large chunk
server size
● Namespace management and locking
● High availability
The largest GFS clusters have more than 1,000 nodes with 300 TB
disk storage capacity. This can be accessed by hundreds of clients on
a continuous basis.
10. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
HDFS File System
11. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
HDFS
Hadoop File System was developed using distributed file system
design. It is run on commodity hardware. Unlike other distributed
systems, HDFS is highly faulttolerant and designed using low-cost
hardware.
HDFS holds very large amount of data and provides easier access. To
store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
12. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
Features HDFS
● It is suitable for the distributed storage and processing.
● Hadoop provides a command interface to interact with
HDFS.
● The built-in servers of namenode and datanode help users
to easily check the status of cluster.
● Streaming access to file system data.
● HDFS provides file permissions and authentication.
14. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software. It is a software that can be run on
commodity hardware. The system having the namenode acts as the master
server and it does the following tasks −
● Manages the file system namespace.
● Regulates client’s access to files.
● It also executes file system operations such as renaming, closing, and
opening files and directories.
15. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
HDFS Architecture
Datanode
The datanode is a commodity hardware having the GNU/Linux
operating system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes
manage the data storage of their system.
● Datanodes perform read-write operations on the file systems, as
per client request.
● They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode.
16. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
HDFS Architecture
● Block
Generally the user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or
stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS
can read or write is called a Block. The default block size is 64MB,
but it can be increased as per the need to change in HDFS
configuration.
17. Dr. Neelesh Jain 8770193851 Follow me: Youtube/FB : DrNeeleshjain
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS
should have mechanisms for quick and automatic fault detection and
recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to
manage the applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the
computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
18. Thank you !
Request you kindly Subscribe my
Channel and follow me on
DrNeeleshjain
Stay Tuned and Keep Learning