SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Network and Complex Systems www.iiste.org
ISSN 2224-610X (Paper) ISSN 2225-0603 (Online)
Vol.4, No.2, 2014
5
A Mathematical Appraisal: Evolution of Distributed File System
andHadoop
Atul Saurabh
Assistant Professor
Babaria Institute of Technology, BITs edu Campus Varnama, Vadodara, Gujrat
Mob: +91-9574465442 E-mail: atul.saurabh@gmail.com
Swapnil M Parikh
Assistant Professor
Babaria Institute of Technology, BITs edu Campus Varnama, Vadodara, Gujrat
Mob: +91-8238046537 E-mail: swapnil.parikh@gmail.com
The research is financed by Asian Development Bank. No. 2006-A171(Sponsoring information)
Abstract
The fast growing technology has left a great impact on the human life. Many traditional systems
are either replaced or running in parallel with their electronic counterpart. As for example: - the
traditional postal system is now nearly replaced by mobile phones and emails. The electronic system is
providing more functionalities than their traditional counterparts. Due to social media, peoples may
communicate with each other, share their thoughts and moments of life in form of texts, images or videos. On
the other hand, to enhance technologies and knowledge many research activities are propelled and data
from different sources are gathered in large volume for further analysis. In short t o d a y ’s world is
surrounded with large volume of data in different form. This put a requirement for effective management
of these billions of terabytes of electronic data generally called BIG DATA. The effective management
must be based on proven mathematical concepts so that chance of casualties may be reduced.
This paper presents a mathematical appraisal for evolution of distribution of file data and explains
some basic solution of primitive problems based on probability theory.
Keywords: BIG DATA, Distributed S y s t e m , DFS, Commodity Hardware, Hadoop, Hadoop File System,
HDF.
1. INTRODUCTION
We are living in the electronic world. The electronic world lives with one single motto: every task, if
possible, will be achieved using some automated tool. Due this course many useful tools like facebook,
stock exchange software, scientific research software etc. are developed. These tools in turns generate a large
volume of data. As for example The New York Stock Exchange generates about one terabyte of new trade
data per day. The generated data either may be structured (easy to process) or unstructured. Some
systems also produce a large amount of binary data like facebook hosts around one petabyte of images.
This large volume of data i s addressed by a concept called Big Data.
1.1 Problems with Big Data
In computer world it is said that problems either move around time or space. Big Data is no
exception. As the technologies evolve, the storage capacity of hard disk and transfer rate increases a lot.
But the rate, at which the data is growing, creates two small problems:
• How to store this Big growing data (space problem)?
• How to process and analyze Big Data in significantly low amount of time (time problem)?
1.2 Studied Solutions
1.2.1 Adding hard disk in serial
The obvious solution to the storage is to introduce more hard disk with bigger capacity. This will solve the
Network and Complex Systems www.iiste.org
ISSN 2224-610X (Paper) ISSN 2225-0603 (Online)
Vol.4, No.2, 2014
6
storage problem but introduces t w o more problems.
I. The more hard disk we introduce, the more we increase the rate of failure or data lost.
Mathematical Analysis
Let us suppose a system of n hard disks with probability of failure p1 , p2 , p3 · · · pn . The event of a
failure of a hard disk is independent of others. So, the chance of failure of hard disks, Pf, is as mentioned
below:
Since,
So,
Here the probability of failing hard d i s k is decreased. But the system (of hard disk) will fail if any one
of the hard disk fails. So, the probability that the system gets fail is
Where Pf s s indicates the probability of failure of the system when disks are arranged in series. Hence
the probability of a system failure is
If we increase the number of hard disks in a system, overall probability of the system failure also increases.
II. The bigger is the capacity of the hard disk, the bigger it takes time to retrieve the data from the
disk.
Mathematical Analysis
Suppose maximum seek time of ith hard disk is Si and probability that data will be available in
ith disk is pai.
So, total time Ts to seek data in n-system serial hard disk is
Thus overall seek time increases a lot. So, it seems that adding more hard disk is impractical.
Even though mathematical analysis says that adding more hard disk degrades the overall performance
o f the system, it indicates one ray of hope.
The total probability of hard disk failure from equation (1) is less than the individual one [see
equation- (3)]. So, adding more hard disk gives good news. The total failure rate of hard disk gets
decreases.
1.2.1 Adding hard disk in parallel
Somehow we can take advantage of above result [eqn-3]. If the hard d i s k is added in parallel w a y
then failure of one hard disk will not affect the overall system. In that case probability of system
failure will not depend on individual failure of the hard disk. The system will fail if and only if all
the hard disks fail and then probability of failure of the system become:
Where Pfsp is the probability of system failure when the hard disks are added in parallel. If
Network and Complex Systems www.iiste.org
ISSN 2224-610X (Paper) ISSN 2225-0603 (Online)
Vol.4, No.2, 2014
7
comparing equation (4) and equation (7) we can conclude that the overall chance of system failure get
decreases when the disks are added in parallel.
Adding hard d i s k s in parallel c e r t a i n l y increases the performance in terms of disk failure but help us
at time factor. To address this problem we need to analyze how a file is searched in one hard disk.
The file is stored as follows:
The actual content of the file is stored i n data block and some basic information called metadata is stored
in inode and directory. So, whenever a search is made instead o f searching o n data block, metadata is
searched first.
We can remove metadata part from each hard disk and store in one hard disk which is dedicated for
metadata storage only. In this situation we need to search only one hard d isk for obtaining the data.
So the seek time becomes as mentioned below
Hence effectively the seek time gets reduced [compare equation (6) to equation (8)].
2. General Architecture
From the above discussion a tree based architecture can be proposed [see figure:1]. The root of the tree
keeps the metadata and other nodes keep the actual data. This architecture reduces the seek time and
also have a major impact on system failure. Now the system is failed only if all the children get failed.
2.2 Pros and Cons
• The root keeps all the metadata.
• Since the size of metadata is small enough, big storage is not required at root level.
• Data searching should be faster. A faster algorithm is required to store data at root
node.
• The processing power of root must be high as data can be required by many users
simultaneously
• If the root node gets fail, all the data is lost.
• Since the child node only require to store data the storage capacity must be high.
• The child is not only involved in storing and transferring the data.
• The block size of child hard disk must be large enough so that storing and retrieving
of data do not require higher amount of block to traverse. This makes the traversal
time smaller.
Figure 1: Hierarchical Structure of Hard disk
3. Concept of Distributed File System
The mathematical analysis su gge sts us to adapt a very clear architecture where data is not
stored onto the single hard d i s k . The data i s stored o r spread over a range of hard disk possibly
connected through the ne t wo r k. This architecture solves most of the problems related to the data
storage and also put a requirement of a system which will manage the storage.
Network and Complex Systems www.iiste.org
ISSN 2224-610X (Paper) ISSN 2225-0603 (Online)
Vol.4, No.2, 2014
8
The DFS performs the following tasks:
• Deciding on which node data will be stored.
• What to do when the data storage gets failed?
• How t o recover data if the concerned node crashes?
• How to store m e t a data for fast retrieval?
4. Concept of Hadoop
This section covers architectural components of Hadoop. Hadoop makes use of master/slave architecture
for both distributed storage and distributed computation. The distributed storage system is called as Hadoop
Distributed File System (HDFS) and distributed computation is done with MapReduce. So, Hadoop has
two major components.
• Hadoop Distributed File System (HDFS) (Storage).
• MapReduce (Processing)
Figure 2: Hadoop File System Architecture
4.1 Hadoop File S y s t e m s
Hadoop is a file system designed for storing very large amount files with streaming data access pattern,
running on clusters of commodity hardware. HDFS has two components
1. NameNode
It is master of the system which maintains and manages blocks which are present on the
DataNodes in the system. It keeps track of how your files are broken down into file blocks, which
nodes store those blocks, and the overall health of the distributed file system.
It is a single point of failure for a Hadoop cluster.
2. DataNodes
They are salve nodes which are deployed on each machine and provide the actual storage.
Definition 1: The file system which manages the storage of files across a network of
machine is called Distributed File System (DFS).
Network and Complex Systems www.iiste.org
ISSN 2224-610X (Paper) ISSN 2225-0603 (Online)
Vol.4, No.2, 2014
9
3. Secondary NameNode
It is an assistant process which monitors the state of Hadoop cluster. As we discussed earlier
NameNode is a single point of failure for a Hadoop cluster, the secondary NameNode helps to minimize
the downtime and loss of data.
4.2 MapReduce
The beauty of Hadoop system lies in MapReduce. It is a programming model for processing large data
sets with a parallel, distributed algorithm on a cluster.
The MapReduce te r m actually refers to the two distinct te r ms: Map and Reduce. The first is the map
job, which takes a set of data and converts it into an- other set of data, where individual elements are
bro- ken down into tuples (key/value pairs). The reduce job takes the output from a map as input and
com- bines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce job is always performed after the map job.
The master node has a Job Tracker and each slave has Task Tracker which is mainly responsible for MapRe-
duce. The Job Tracker daemon keeps tracks of a job that may be assigned to multiple nodes. If the task at
any node fails, the Job Tracker reschedules that task at another node. Task Tracker in turn, is responsible
for keeping trace of the task that is assigned to individual node.
Figure 3: The Hadoop System
4.3 Features
• Hadoop is designed to operate on very large sized file typically of the r a n g e o f
megabytes, gigabytes, terabytes or petabytes.
• Data is accessed in the form of byte stream.
• Hadoop d o e s not r e q u i r e e x p e n s i v e h a r d w a r e . Commonly available hardware is
sufficient.
4.4 Drawback
• Hadoop provides high throughput of data wit h the cost of latency and hence low
latency d ata access will not work with Hadoop.
• Hadoop is not designed to work with small sized files. In Hadoop architecture metadata is
separated and stored on another hard disk. So number of files that can be stored in Hadoop
file system does not depend upon the total capacity of available hard disks. It only
depends upon the size of hard disk that holds the metadata.
5. Conclusion
The classical centralized system was a great solution to the problem of uniform accessibility of the
information which was otherwise scattered on different network node. The information is stored on a set of
hard disks connected to a single large mainframe machine. But it failed to solve many complex problems like
Network and Complex Systems www.iiste.org
ISSN 2224-610X (Paper) ISSN 2225-0603 (Online)
Vol.4, No.2, 2014
10
Big Data, and frequent failure of the central system. Now a day’s paradigm is shifting from centralized
system to distributed system. The distributed system provides a better solution in that area normally
where the centralized system got failed. But this shifting is not sudden. There is a complete mathematical
result which provides a way for this shifting. The probability theory helps us to formalize the basic tree
like architecture of distributed file system i f we try t o solve the problem like Big Data. The Hadoop
system which is currently the most acceptable s y s t e m used in distributed area is also following the
same basic architecture with slight modification. The modification is made for the solution of failure of
root of the file system. The Hadoop system not only solves the Big Data problem but also provides an
API for solving problem in parallel and monitoring the progress.
References
Lam Chuck. Hadoop in Action (1st
ed) (2013). Dreamtech Publication, (Chapter 1 and 2). ISBN- 978-81-7722-
813-7.
White Tom. Hadoop-The Definitive Guide (3rd
ed) (2012). O’reilly Publication, (Chapter 1, 2 and 3). ISBN- 978-
93-5023-756-4
Hadoop Documentation, [Online] Available: http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html (January 7,
2014).
Gantz F John, Chute Christopher et al. The Diverse and Exploding Digital Universe, [Online] Available:
http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf,
(December 21, 2013).
Schroeder Stan. Facebook Statistics. [Online] Available: http://mashable.com/2008/10/15/facebook-
10-billion-photos/, (December 20, 2013).
Haddad Diana. Genealogy Insider (2009). The family tree magazine. [Online] Available:
http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret+Data+
Center.aspx. (January 1, 2014).
Papoulis Athanasios and Pillai Unnikrishna S. Probability, Random Variables and Stochastic Processes
(4th
ed.) (2010) Tata McGraw-Hill Publication (Chapter 1 to 3). ISBN- 978-0-07-048658-4.
Bach J. M. The Design of the UNIX Operating System (1986). PHI Publication (Chapter 4) ISBN-978-81-
203-0516-8.

Contenu connexe

Tendances

Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0
George Ang
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
Cloud Data De Duplication in Multiuser Environment DeposM2
Cloud Data De Duplication in Multiuser Environment DeposM2Cloud Data De Duplication in Multiuser Environment DeposM2
Cloud Data De Duplication in Multiuser Environment DeposM2
ijtsrd
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
Irina Hutanu
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Grid computing dis
Grid computing disGrid computing dis
Grid computing dis
gopishna09
 

Tendances (17)

Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
hadoop
hadoophadoop
hadoop
 
Cloud Data De Duplication in Multiuser Environment DeposM2
Cloud Data De Duplication in Multiuser Environment DeposM2Cloud Data De Duplication in Multiuser Environment DeposM2
Cloud Data De Duplication in Multiuser Environment DeposM2
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 
Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understanding
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Deduplication - Remove Duplicate
Deduplication - Remove DuplicateDeduplication - Remove Duplicate
Deduplication - Remove Duplicate
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
Grid computing dis
Grid computing disGrid computing dis
Grid computing dis
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
A basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your dataA basic course on Research data management, part 3: sharing your data
A basic course on Research data management, part 3: sharing your data
 
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
 
Basics of IR: Web Information Systems class
Basics of IR: Web Information Systems class Basics of IR: Web Information Systems class
Basics of IR: Web Information Systems class
 

En vedette

A moving average analysis of the age distribution and
A moving average analysis of the age distribution andA moving average analysis of the age distribution and
A moving average analysis of the age distribution and
Alexander Decker
 
September 2010-President's List
September 2010-President's ListSeptember 2010-President's List
September 2010-President's List
Lesley Johnson
 
The Digital Garage Certification
The Digital Garage CertificationThe Digital Garage Certification
The Digital Garage Certification
Stuart Seager
 

En vedette (11)

Inside mml2wav.rb
Inside mml2wav.rbInside mml2wav.rb
Inside mml2wav.rb
 
A moving average analysis of the age distribution and
A moving average analysis of the age distribution andA moving average analysis of the age distribution and
A moving average analysis of the age distribution and
 
Gresume0716
Gresume0716Gresume0716
Gresume0716
 
A440
A440A440
A440
 
Elixir紹介
Elixir紹介Elixir紹介
Elixir紹介
 
Who’d be a parent - your role in drug and alcohol prevention
Who’d be a parent - your role in drug and alcohol prevention Who’d be a parent - your role in drug and alcohol prevention
Who’d be a parent - your role in drug and alcohol prevention
 
أخطاء و ضلالات سيد قطب بالكتاب والصفحة و رد العلماء عليه
 أخطاء و ضلالات سيد قطب بالكتاب والصفحة و رد العلماء عليه  أخطاء و ضلالات سيد قطب بالكتاب والصفحة و رد العلماء عليه
أخطاء و ضلالات سيد قطب بالكتاب والصفحة و رد العلماء عليه
 
September 2010-President's List
September 2010-President's ListSeptember 2010-President's List
September 2010-President's List
 
POSIX Threads
POSIX ThreadsPOSIX Threads
POSIX Threads
 
The Digital Garage Certification
The Digital Garage CertificationThe Digital Garage Certification
The Digital Garage Certification
 
学生、教職員が一線に並んで交流できる 「英語でビブリオバトル」
学生、教職員が一線に並んで交流できる 「英語でビブリオバトル」学生、教職員が一線に並んで交流できる 「英語でビブリオバトル」
学生、教職員が一線に並んで交流できる 「英語でビブリオバトル」
 

Similaire à A mathematical appraisal

Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
Shivansh Gaur
 

Similaire à A mathematical appraisal (20)

Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Data Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvementsData Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvements
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Srinivasan2-10-12
Srinivasan2-10-12Srinivasan2-10-12
Srinivasan2-10-12
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
4 026
4 0264 026
4 026
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
17-NoSQL.pptx
17-NoSQL.pptx17-NoSQL.pptx
17-NoSQL.pptx
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 

Plus de Alexander Decker

Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...
Alexander Decker
 
A usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websitesA usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websites
Alexander Decker
 
A universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banksA universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banks
Alexander Decker
 
A unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized dA unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized d
Alexander Decker
 
A trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistanceA trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistance
Alexander Decker
 
A transformational generative approach towards understanding al-istifham
A transformational  generative approach towards understanding al-istifhamA transformational  generative approach towards understanding al-istifham
A transformational generative approach towards understanding al-istifham
Alexander Decker
 
A time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibiaA time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibia
Alexander Decker
 
A therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school childrenA therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school children
Alexander Decker
 
A theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banksA theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banks
Alexander Decker
 
A systematic evaluation of link budget for
A systematic evaluation of link budget forA systematic evaluation of link budget for
A systematic evaluation of link budget for
Alexander Decker
 
A synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjabA synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjab
Alexander Decker
 
A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...
Alexander Decker
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
Alexander Decker
 
A survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniquesA survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniques
Alexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
Alexander Decker
 
A survey on challenges to the media cloud
A survey on challenges to the media cloudA survey on challenges to the media cloud
A survey on challenges to the media cloud
Alexander Decker
 
A survey of provenance leveraged
A survey of provenance leveragedA survey of provenance leveraged
A survey of provenance leveraged
Alexander Decker
 
A survey of private equity investments in kenya
A survey of private equity investments in kenyaA survey of private equity investments in kenya
A survey of private equity investments in kenya
Alexander Decker
 
A study to measures the financial health of
A study to measures the financial health ofA study to measures the financial health of
A study to measures the financial health of
Alexander Decker
 

Plus de Alexander Decker (20)

Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...Abnormalities of hormones and inflammatory cytokines in women affected with p...
Abnormalities of hormones and inflammatory cytokines in women affected with p...
 
A validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale inA validation of the adverse childhood experiences scale in
A validation of the adverse childhood experiences scale in
 
A usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websitesA usability evaluation framework for b2 c e commerce websites
A usability evaluation framework for b2 c e commerce websites
 
A universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banksA universal model for managing the marketing executives in nigerian banks
A universal model for managing the marketing executives in nigerian banks
 
A unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized dA unique common fixed point theorems in generalized d
A unique common fixed point theorems in generalized d
 
A trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistanceA trends of salmonella and antibiotic resistance
A trends of salmonella and antibiotic resistance
 
A transformational generative approach towards understanding al-istifham
A transformational  generative approach towards understanding al-istifhamA transformational  generative approach towards understanding al-istifham
A transformational generative approach towards understanding al-istifham
 
A time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibiaA time series analysis of the determinants of savings in namibia
A time series analysis of the determinants of savings in namibia
 
A therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school childrenA therapy for physical and mental fitness of school children
A therapy for physical and mental fitness of school children
 
A theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banksA theory of efficiency for managing the marketing executives in nigerian banks
A theory of efficiency for managing the marketing executives in nigerian banks
 
A systematic evaluation of link budget for
A systematic evaluation of link budget forA systematic evaluation of link budget for
A systematic evaluation of link budget for
 
A synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjabA synthetic review of contraceptive supplies in punjab
A synthetic review of contraceptive supplies in punjab
 
A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...A synthesis of taylor’s and fayol’s management approaches for managing market...
A synthesis of taylor’s and fayol’s management approaches for managing market...
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 
A survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniquesA survey on live virtual machine migrations and its techniques
A survey on live virtual machine migrations and its techniques
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on challenges to the media cloud
A survey on challenges to the media cloudA survey on challenges to the media cloud
A survey on challenges to the media cloud
 
A survey of provenance leveraged
A survey of provenance leveragedA survey of provenance leveraged
A survey of provenance leveraged
 
A survey of private equity investments in kenya
A survey of private equity investments in kenyaA survey of private equity investments in kenya
A survey of private equity investments in kenya
 
A study to measures the financial health of
A study to measures the financial health ofA study to measures the financial health of
A study to measures the financial health of
 

A mathematical appraisal

  • 1. Network and Complex Systems www.iiste.org ISSN 2224-610X (Paper) ISSN 2225-0603 (Online) Vol.4, No.2, 2014 5 A Mathematical Appraisal: Evolution of Distributed File System andHadoop Atul Saurabh Assistant Professor Babaria Institute of Technology, BITs edu Campus Varnama, Vadodara, Gujrat Mob: +91-9574465442 E-mail: atul.saurabh@gmail.com Swapnil M Parikh Assistant Professor Babaria Institute of Technology, BITs edu Campus Varnama, Vadodara, Gujrat Mob: +91-8238046537 E-mail: swapnil.parikh@gmail.com The research is financed by Asian Development Bank. No. 2006-A171(Sponsoring information) Abstract The fast growing technology has left a great impact on the human life. Many traditional systems are either replaced or running in parallel with their electronic counterpart. As for example: - the traditional postal system is now nearly replaced by mobile phones and emails. The electronic system is providing more functionalities than their traditional counterparts. Due to social media, peoples may communicate with each other, share their thoughts and moments of life in form of texts, images or videos. On the other hand, to enhance technologies and knowledge many research activities are propelled and data from different sources are gathered in large volume for further analysis. In short t o d a y ’s world is surrounded with large volume of data in different form. This put a requirement for effective management of these billions of terabytes of electronic data generally called BIG DATA. The effective management must be based on proven mathematical concepts so that chance of casualties may be reduced. This paper presents a mathematical appraisal for evolution of distribution of file data and explains some basic solution of primitive problems based on probability theory. Keywords: BIG DATA, Distributed S y s t e m , DFS, Commodity Hardware, Hadoop, Hadoop File System, HDF. 1. INTRODUCTION We are living in the electronic world. The electronic world lives with one single motto: every task, if possible, will be achieved using some automated tool. Due this course many useful tools like facebook, stock exchange software, scientific research software etc. are developed. These tools in turns generate a large volume of data. As for example The New York Stock Exchange generates about one terabyte of new trade data per day. The generated data either may be structured (easy to process) or unstructured. Some systems also produce a large amount of binary data like facebook hosts around one petabyte of images. This large volume of data i s addressed by a concept called Big Data. 1.1 Problems with Big Data In computer world it is said that problems either move around time or space. Big Data is no exception. As the technologies evolve, the storage capacity of hard disk and transfer rate increases a lot. But the rate, at which the data is growing, creates two small problems: • How to store this Big growing data (space problem)? • How to process and analyze Big Data in significantly low amount of time (time problem)? 1.2 Studied Solutions 1.2.1 Adding hard disk in serial The obvious solution to the storage is to introduce more hard disk with bigger capacity. This will solve the
  • 2. Network and Complex Systems www.iiste.org ISSN 2224-610X (Paper) ISSN 2225-0603 (Online) Vol.4, No.2, 2014 6 storage problem but introduces t w o more problems. I. The more hard disk we introduce, the more we increase the rate of failure or data lost. Mathematical Analysis Let us suppose a system of n hard disks with probability of failure p1 , p2 , p3 · · · pn . The event of a failure of a hard disk is independent of others. So, the chance of failure of hard disks, Pf, is as mentioned below: Since, So, Here the probability of failing hard d i s k is decreased. But the system (of hard disk) will fail if any one of the hard disk fails. So, the probability that the system gets fail is Where Pf s s indicates the probability of failure of the system when disks are arranged in series. Hence the probability of a system failure is If we increase the number of hard disks in a system, overall probability of the system failure also increases. II. The bigger is the capacity of the hard disk, the bigger it takes time to retrieve the data from the disk. Mathematical Analysis Suppose maximum seek time of ith hard disk is Si and probability that data will be available in ith disk is pai. So, total time Ts to seek data in n-system serial hard disk is Thus overall seek time increases a lot. So, it seems that adding more hard disk is impractical. Even though mathematical analysis says that adding more hard disk degrades the overall performance o f the system, it indicates one ray of hope. The total probability of hard disk failure from equation (1) is less than the individual one [see equation- (3)]. So, adding more hard disk gives good news. The total failure rate of hard disk gets decreases. 1.2.1 Adding hard disk in parallel Somehow we can take advantage of above result [eqn-3]. If the hard d i s k is added in parallel w a y then failure of one hard disk will not affect the overall system. In that case probability of system failure will not depend on individual failure of the hard disk. The system will fail if and only if all the hard disks fail and then probability of failure of the system become: Where Pfsp is the probability of system failure when the hard disks are added in parallel. If
  • 3. Network and Complex Systems www.iiste.org ISSN 2224-610X (Paper) ISSN 2225-0603 (Online) Vol.4, No.2, 2014 7 comparing equation (4) and equation (7) we can conclude that the overall chance of system failure get decreases when the disks are added in parallel. Adding hard d i s k s in parallel c e r t a i n l y increases the performance in terms of disk failure but help us at time factor. To address this problem we need to analyze how a file is searched in one hard disk. The file is stored as follows: The actual content of the file is stored i n data block and some basic information called metadata is stored in inode and directory. So, whenever a search is made instead o f searching o n data block, metadata is searched first. We can remove metadata part from each hard disk and store in one hard disk which is dedicated for metadata storage only. In this situation we need to search only one hard d isk for obtaining the data. So the seek time becomes as mentioned below Hence effectively the seek time gets reduced [compare equation (6) to equation (8)]. 2. General Architecture From the above discussion a tree based architecture can be proposed [see figure:1]. The root of the tree keeps the metadata and other nodes keep the actual data. This architecture reduces the seek time and also have a major impact on system failure. Now the system is failed only if all the children get failed. 2.2 Pros and Cons • The root keeps all the metadata. • Since the size of metadata is small enough, big storage is not required at root level. • Data searching should be faster. A faster algorithm is required to store data at root node. • The processing power of root must be high as data can be required by many users simultaneously • If the root node gets fail, all the data is lost. • Since the child node only require to store data the storage capacity must be high. • The child is not only involved in storing and transferring the data. • The block size of child hard disk must be large enough so that storing and retrieving of data do not require higher amount of block to traverse. This makes the traversal time smaller. Figure 1: Hierarchical Structure of Hard disk 3. Concept of Distributed File System The mathematical analysis su gge sts us to adapt a very clear architecture where data is not stored onto the single hard d i s k . The data i s stored o r spread over a range of hard disk possibly connected through the ne t wo r k. This architecture solves most of the problems related to the data storage and also put a requirement of a system which will manage the storage.
  • 4. Network and Complex Systems www.iiste.org ISSN 2224-610X (Paper) ISSN 2225-0603 (Online) Vol.4, No.2, 2014 8 The DFS performs the following tasks: • Deciding on which node data will be stored. • What to do when the data storage gets failed? • How t o recover data if the concerned node crashes? • How to store m e t a data for fast retrieval? 4. Concept of Hadoop This section covers architectural components of Hadoop. Hadoop makes use of master/slave architecture for both distributed storage and distributed computation. The distributed storage system is called as Hadoop Distributed File System (HDFS) and distributed computation is done with MapReduce. So, Hadoop has two major components. • Hadoop Distributed File System (HDFS) (Storage). • MapReduce (Processing) Figure 2: Hadoop File System Architecture 4.1 Hadoop File S y s t e m s Hadoop is a file system designed for storing very large amount files with streaming data access pattern, running on clusters of commodity hardware. HDFS has two components 1. NameNode It is master of the system which maintains and manages blocks which are present on the DataNodes in the system. It keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed file system. It is a single point of failure for a Hadoop cluster. 2. DataNodes They are salve nodes which are deployed on each machine and provide the actual storage. Definition 1: The file system which manages the storage of files across a network of machine is called Distributed File System (DFS).
  • 5. Network and Complex Systems www.iiste.org ISSN 2224-610X (Paper) ISSN 2225-0603 (Online) Vol.4, No.2, 2014 9 3. Secondary NameNode It is an assistant process which monitors the state of Hadoop cluster. As we discussed earlier NameNode is a single point of failure for a Hadoop cluster, the secondary NameNode helps to minimize the downtime and loss of data. 4.2 MapReduce The beauty of Hadoop system lies in MapReduce. It is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. The MapReduce te r m actually refers to the two distinct te r ms: Map and Reduce. The first is the map job, which takes a set of data and converts it into an- other set of data, where individual elements are bro- ken down into tuples (key/value pairs). The reduce job takes the output from a map as input and com- bines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job. The master node has a Job Tracker and each slave has Task Tracker which is mainly responsible for MapRe- duce. The Job Tracker daemon keeps tracks of a job that may be assigned to multiple nodes. If the task at any node fails, the Job Tracker reschedules that task at another node. Task Tracker in turn, is responsible for keeping trace of the task that is assigned to individual node. Figure 3: The Hadoop System 4.3 Features • Hadoop is designed to operate on very large sized file typically of the r a n g e o f megabytes, gigabytes, terabytes or petabytes. • Data is accessed in the form of byte stream. • Hadoop d o e s not r e q u i r e e x p e n s i v e h a r d w a r e . Commonly available hardware is sufficient. 4.4 Drawback • Hadoop provides high throughput of data wit h the cost of latency and hence low latency d ata access will not work with Hadoop. • Hadoop is not designed to work with small sized files. In Hadoop architecture metadata is separated and stored on another hard disk. So number of files that can be stored in Hadoop file system does not depend upon the total capacity of available hard disks. It only depends upon the size of hard disk that holds the metadata. 5. Conclusion The classical centralized system was a great solution to the problem of uniform accessibility of the information which was otherwise scattered on different network node. The information is stored on a set of hard disks connected to a single large mainframe machine. But it failed to solve many complex problems like
  • 6. Network and Complex Systems www.iiste.org ISSN 2224-610X (Paper) ISSN 2225-0603 (Online) Vol.4, No.2, 2014 10 Big Data, and frequent failure of the central system. Now a day’s paradigm is shifting from centralized system to distributed system. The distributed system provides a better solution in that area normally where the centralized system got failed. But this shifting is not sudden. There is a complete mathematical result which provides a way for this shifting. The probability theory helps us to formalize the basic tree like architecture of distributed file system i f we try t o solve the problem like Big Data. The Hadoop system which is currently the most acceptable s y s t e m used in distributed area is also following the same basic architecture with slight modification. The modification is made for the solution of failure of root of the file system. The Hadoop system not only solves the Big Data problem but also provides an API for solving problem in parallel and monitoring the progress. References Lam Chuck. Hadoop in Action (1st ed) (2013). Dreamtech Publication, (Chapter 1 and 2). ISBN- 978-81-7722- 813-7. White Tom. Hadoop-The Definitive Guide (3rd ed) (2012). O’reilly Publication, (Chapter 1, 2 and 3). ISBN- 978- 93-5023-756-4 Hadoop Documentation, [Online] Available: http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html (January 7, 2014). Gantz F John, Chute Christopher et al. The Diverse and Exploding Digital Universe, [Online] Available: http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf, (December 21, 2013). Schroeder Stan. Facebook Statistics. [Online] Available: http://mashable.com/2008/10/15/facebook- 10-billion-photos/, (December 20, 2013). Haddad Diana. Genealogy Insider (2009). The family tree magazine. [Online] Available: http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret+Data+ Center.aspx. (January 1, 2014). Papoulis Athanasios and Pillai Unnikrishna S. Probability, Random Variables and Stochastic Processes (4th ed.) (2010) Tata McGraw-Hill Publication (Chapter 1 to 3). ISBN- 978-0-07-048658-4. Bach J. M. The Design of the UNIX Operating System (1986). PHI Publication (Chapter 4) ISBN-978-81- 203-0516-8.