2. Introduction
Google is a multi-billion dollar company.
It's one of the big power players on the World Wide Web and beyond.
The company relies on a distributed computing system to provide users with the infrastructure
they need to access, create and alter data.
DISTRIBUTED COMPUTING SYSTEM:
A distributed file system (DFS) is a file system with data stored on a server.
The server allows the client users to share files and store data just like they are storing the
information locally.
However, the servers have full control over the data and give access control to the clients.
3. Intro (continued)..
The machines that power Google's operations aren't cutting-edge powerful
computers.
They're relatively inexpensive machines running on Linux operating systems.
Google uses the GFS to organize and manipulate huge files.
The GFS is unique to Google and isn't for sale.
But it could serve as a model for other file systems with similar needs.
4. How GFS works?
GFS provides the users to access the basic file commands.
These include commands like open, create, read, write and close files along with special
commands like append and snapshot.
Append allows clients to add information to an existing file without overwriting previously
written data.
Snapshot is a command that creates quick copy of a computer's contents.
GFS tend to be very large, usually in the multi-gigabyte (GB) range.
Accessing and manipulating files that large would take up a lot of the network's bandwidth.
5. Solution..
The GFS addresses this problem by breaking files up into chunks of 64 megabytes
(MB) each.
Every chunk receives a unique 64-bit identification number called a chunk handle.
By making all the file chunks to be the same size, the GFS simplifies the process.
Using chunk handle, it is easy to check the memory capacity of each computer in
the network.
GFS easily identifies which computer’s memory is full & which one’s are un-used.
7. Google organized the GFS into clusters of computers.
Within GFS clusters there are three kinds of entities :
clients, master servers and chunkservers.
“Client" refers to any entity that makes a file request.
The “master server” acts as the coordinator & maintains an operation log.
The master server also keeps track of metadata, which is the information that describes
chunks.
T here's only one active master server per cluster at any one time.
8. Chunk Servers working
The master server doesn't actually handle file data, it leaves that up to the chunkservers.
The chunkservers don't send chunks to the master server.
Instead, they send requested chunks directly to the client.
The GFS copies every chunk multiple times and stores it on different chunkservers.
Each copy is called a replica.
The GFS makes three replicas, one primary replica & 2 secondary replicas.
9. Working
When the client makes a simple file-read request,
The server responds with the location for the primary replica of the respective chunk.
By comparing the IP address of the client, The master server chooses the chunkserver closest to
the client.
The client then sends the write data to all the replicas, starting with the closest replica and
ending with the furthest one.
Once the replicas receive the data, the primary replica begins to assign consecutive serial
numbers to each change to the file. Changes are called mutations.
If that doesn't work, the master server will identify the affected replica as garbage.
10. Other functions
To prevent data corruption, the GFS uses a system called checksumming.
The master server monitors chunks by looking at the checksums.
If the checksum of a replica doesn't match the checksum in the master server's memory, the
master server deletes the replica and creates a new one to replace it.
13. Introduction
Apache Hadoop is an open source software framework for storage and large
scale processing of data-sets on clusters of commodity hardware.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
It was originally developed to support distribution for the Nutch search
engine project.
Doug, named the project after his son's toy elephant.
14. Why HDFS?
HDFS has many similarities with other distributed file systems, but is different in several respects :
HDFS follows Write-once-read-many model that simplifies data coherency since it relies mostly on
“batch-processing” rather than “interactive-access” by users.
Another unique attribute of HDFS is the processing logic is close to the data rather than moving the
data to the application space.
Fault tolerance.
Data access via MapReduce.
Portability across heterogeneous commodity hardware and operating systems.
Scalability to reliably store and process large amounts of data.
Reduce cost by distributing data and processing across clusters of commodity personal computers.
15.
16. Hadoop Distributed File System
HDFS
Google File System
GFS
Cross Platform Linux
Developed in Java environment Developed in C,C++ environment
Initially it was developed by Yahoo and now its an
open source Framework
It was developed & still owned by Google
It has Name node and Data Node It has Master-node and Chunk server
128 MB will be the default block size 64 MB will be the default block size
Name node receive heartbeat from Data node Master node receive heartbeat from Chunk server
Commodity hardware are used Commodity hardware are used
‘’Write Once and Read Many” times model Multiple writer , multiple reader model
Deleted files are renamed into particular folder
and then it will removed via garbage
Deleted files are not reclaimed immediately and
are renamed in hidden name space and it will
deleted after three days if it’s not in use
Edit Log is maintained Operational Log is maintained
Only append is possible Random file write possible