SlideShare une entreprise Scribd logo
1  sur  21
Data is the New Oil
By : Abhilash Pande
Aarati Chavan
Himanshu Arora
Just how much data is
generated?
 The world produces 2.5 quintillion bytes a day, and 90%
of all data has been produced in just the last two years.
 Twitter processes 7TB of data ever day,
 600TB of data is processed by Facebook every day.
 Facebook hosts around 10 billion photos taking up 1
petabyte of storage
 The internet archive stores around 2 petabytes of data
and is growing at a rate of 20 terabytes per month
 Interestingly, about 80% of this data is unstructured.
How do we manage this data?
 This is where Big Data comes into play.
 The basic idea behind the phrase Big Data is that
everything we do is increasingly leaving a digital trace
which we can use and analyze to become smarter.
 Big data is data sets that are so voluminous and complex
that traditional data processing application software are
inadequate to deal with them.
 Big data challenges include capturing data, data
storage, data analysis, search, sharing, transfer,
visualization, querying, updating and information
privacy.
Who benefits from this data?
 Corporations have been the greatest beneficiaries from
this data revolution. In 2006, oil and energy companies
dominated the list of top six most valuable firms in the
world, but in 2016, the list is dominated by data firms
like Alphabet, Apple, Facebook, Amazon and Microsoft.
Impact of Big Data on Business
Here are a few examples of the Business Impact on
Industries:
Banking
Big Data has proven to be very effective and beneficial for
financial Institutions. It helps the financial institution in
predicting the Consumer behavior and has offered effective
predictive analysis in order to develop excellent customer
experience.
Manufacturing
Manufacturing is a process which is made out of many sub-
processes
As it involves the complete database of an organization, it
has proved to be an effective tool in refining the product
quality and systemizing the defect detection of the products.
Impact of Big Data on Business
 Oil and Gas:
There are a few big players in the Industry who have
already implied this technology to their business. Right
from evaluating a prospective oil field to selling it to the
buyers, every step of an oil and gas industry has a
significant role of Big Data
 Retail
Retail is another sector that has grown with time. There
are many retailers that give the credit of their explicit
success to this new technology.
It provides aids in maintaining and replenishing the
inventory volume and helps in the analysis of sales and
profit as well.
Where is all data coming from?
There are three major source of big Data
 Social Data: Social media data is providing remarkable
insights to companies on consumer behavior that can be
used for analysis, with 230 million tweets posted on
Twitter per day, 2.7 billion Likes and comments added
to Facebook every day, and 60 hours of video uploaded
to YouTube every minute
 Machine Generated Data: Machine data consists of
information generated from industrial equipment, real-
time data from sensors that track parts and monitor
machinery
 Business Generated Data : Data produced as a result of
business activities can be recorded in structured or
unstructured databases.
4 V’s of Big Data
 Velocity: is the speed of data in which it accumulate.
 Volume: is the scale of the data or increase in the
amount of data stored.
 Variety: is the diversity of the data like we have
Structured data which fits into rows and column
Unstructured Data like tweet videos pictures.
 Veracity: is the conformity to facts and accuracy with
large amount of data or quality and origin of the data.
 These all sum up with “”Value”
 This refer to our ability to turn our data into value,it
may be medical or social benefits
Type of Data
 Structured Data: structured data is comprised of
clearly defined data types whose pattern makes them
easily searchable
 Unstructured Data: is data that doesn’t have
predefined form. It is comprised of data that is usually
not as easily searchable, including formats like audio,
video, and social media postings.
 Semi-Structured Data: it is a combination of structured
and semi structured data but lack of strictly define
module.
Technologies Available for
Managing Big Data
 1. Apache Hadoop
 Apache Hadoop is a java based free software framework that can
effectively store large amount of data in a cluster.
 2. Microsoft HDInsight
 It is a Big Data solution from Microsoft powered by Apache Hadoop
which is available as a service in the cloud.
 3. NoSQL
 While the traditional SQL can be effectively used to handle large
amount of structured data, we need NoSQL (Not Only SQL) to
handle unstructured data. NoSQL databases store unstructured
data with no particular schema.
 4. Spark
 Apache Spark is an open source processing engine built around
speed, user ease and sophisticated analytics.
What Is Apache Hadoop?
 Apache Hadoop is an open-source software framework
used for distributed storage and processing of datasets of
big data using the MapReduce programming model.
 It consists of computer clusters built from commodity
hardware.
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
 It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
 All the modules in Hadoop are designed with a fundamental
assumption that hardware failures are common occurrences
and should be automatically handled by the framework
Hadoop
 The Core of Apache Hadoop consists of a storage part,
known as Hadoop Distributed File System (HDFS), and a
processing part which is a MapReduce programming
model.
 Hadoop splits files into large blocks and distributes
them across nodes in a cluster.
 It then transfers packaged code into nodes to process
the data in parallel.
 This approach takes advantage of data locality, where
nodes manipulate the data they have access to.
 This allows the dataset to be processed faster and more
efficiently than it would be in a more conventional
supercomputer architecture that relies on a parallel file
system where computation and data are distributed via
high-speed networking.
Hadoop
 The base Apache Hadoop framework is composed of the
following modules:
 Hadoop Common – contains libraries and utilities
needed by other Hadoop modules;
 Hadoop Distributed File System (HDFS) – a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the
cluster;
 Hadoop YARN – a platform responsible for managing
computing resources in clusters and using them for
scheduling users' applications,
 Hadoop MapReduce – an implementation of the
MapReduce programming model for large-scale data
processing.
Framework
Hadoop Distributed File System
(HDFS)
 The HDFS is a distributed, scalable, and portable file
system written in Java for the Hadoop framework.
 It provide shell commands and Java application
programming interface (API) methods that are similar to
other file systems.
 HDFS is highly fault-tolerant and is designed to be
deployed on low-cost hardware.
 HDFS provides high throughput access to application
data and is suitable for applications that have large
data sets.
 HDFS is a filesystem designed for storing very large files
with streaming data access patterns
Key Concepts
Here are some of the key concepts related to HDFS.
1. NameNode: HDFS works in a master-slave fashion. All the
metadata related to HDFS including the information about
data nodes, files stored on HDFS, and Replication, etc. are
stored and maintained on the NameNode. A NameNode
serves as the master and there is only one NameNode per
cluster.
2. DataNode: DataNode is the slave node and holds the user
data in the form of Data Blocks. There can be any number
of DataNodes in a Hadoop Cluster.
3. Data Block : A Data Block can be considered as the standard
unit of data/files stored on HDFS. Each incoming file is
broken into 64 MB by default(dependent on version)
4. Replication : Data blocks are replicated across different
nodes in the cluster to ensure a high degree of fault
tolerance. Replication enables the use of low cost
commodity hardware for the storage of data.
Yarn
 YARN is Hadoop’s cluster resource management system.
MapReduce
 MapReduce is a framework using which we can write
applications to process huge amounts of data, in
parallel, on large clusters of commodity hardware in a
reliable manner.
 Hadoop can run MapReduce programs written in various
languages
 MapReduce programs are inherently parallel
 MapReduce program executes in three stages, namely
map stage, shuffle stage, and reduce stage.
MapReduce
 Map stage : The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data
and creates several small chunks of data.
 Reduce stage : This stage is the combination of the Shufflestage
and the Reduce stage. The Reducer’s job is to process the data
that comes from the mapper. After processing, it produces a new
set of output, which will be stored in the HDFS.
 After completion of the given tasks, the cluster collects and
reduces the data to form an appropriate result, and sends it
back to the Hadoop server.
Advantages of Hadoop
 Scalable: Hadoop is a highly scalable storage platform,
because it can store and distribute very large data sets
across hundreds of inexpensive servers that operate in
parallel.
 Cost effective :Hadoop also offers a cost effective
storage solution for businesses' exploding data sets.
 Flexible :Hadoop enables businesses to easily access
new data sources and tap into different types of data to
generate value from that data.
 Fast : Hadoop's unique storage method is based on a
distributed file system that basically 'maps' data
wherever it is located on a cluster.
 Resilient to failure :A key advantage of using Hadoop is
its fault tolerance.
Thank You

Contenu connexe

Tendances

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 

Tendances (18)

IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Bigdata
Bigdata Bigdata
Bigdata
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
paper
paperpaper
paper
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 

Similaire à Big data Presentation

Similaire à Big data Presentation (20)

Big data
Big dataBig data
Big data
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptx
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
INTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATAINTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATA
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big data
Big dataBig data
Big data
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Stratebi Big Data
Stratebi Big DataStratebi Big Data
Stratebi Big Data
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 

Dernier

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Dernier (20)

%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 

Big data Presentation

  • 1. Data is the New Oil By : Abhilash Pande Aarati Chavan Himanshu Arora
  • 2. Just how much data is generated?  The world produces 2.5 quintillion bytes a day, and 90% of all data has been produced in just the last two years.  Twitter processes 7TB of data ever day,  600TB of data is processed by Facebook every day.  Facebook hosts around 10 billion photos taking up 1 petabyte of storage  The internet archive stores around 2 petabytes of data and is growing at a rate of 20 terabytes per month  Interestingly, about 80% of this data is unstructured.
  • 3. How do we manage this data?  This is where Big Data comes into play.  The basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace which we can use and analyze to become smarter.  Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them.  Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.
  • 4. Who benefits from this data?  Corporations have been the greatest beneficiaries from this data revolution. In 2006, oil and energy companies dominated the list of top six most valuable firms in the world, but in 2016, the list is dominated by data firms like Alphabet, Apple, Facebook, Amazon and Microsoft.
  • 5. Impact of Big Data on Business Here are a few examples of the Business Impact on Industries: Banking Big Data has proven to be very effective and beneficial for financial Institutions. It helps the financial institution in predicting the Consumer behavior and has offered effective predictive analysis in order to develop excellent customer experience. Manufacturing Manufacturing is a process which is made out of many sub- processes As it involves the complete database of an organization, it has proved to be an effective tool in refining the product quality and systemizing the defect detection of the products.
  • 6. Impact of Big Data on Business  Oil and Gas: There are a few big players in the Industry who have already implied this technology to their business. Right from evaluating a prospective oil field to selling it to the buyers, every step of an oil and gas industry has a significant role of Big Data  Retail Retail is another sector that has grown with time. There are many retailers that give the credit of their explicit success to this new technology. It provides aids in maintaining and replenishing the inventory volume and helps in the analysis of sales and profit as well.
  • 7. Where is all data coming from? There are three major source of big Data  Social Data: Social media data is providing remarkable insights to companies on consumer behavior that can be used for analysis, with 230 million tweets posted on Twitter per day, 2.7 billion Likes and comments added to Facebook every day, and 60 hours of video uploaded to YouTube every minute  Machine Generated Data: Machine data consists of information generated from industrial equipment, real- time data from sensors that track parts and monitor machinery  Business Generated Data : Data produced as a result of business activities can be recorded in structured or unstructured databases.
  • 8. 4 V’s of Big Data  Velocity: is the speed of data in which it accumulate.  Volume: is the scale of the data or increase in the amount of data stored.  Variety: is the diversity of the data like we have Structured data which fits into rows and column Unstructured Data like tweet videos pictures.  Veracity: is the conformity to facts and accuracy with large amount of data or quality and origin of the data.  These all sum up with “”Value”  This refer to our ability to turn our data into value,it may be medical or social benefits
  • 9. Type of Data  Structured Data: structured data is comprised of clearly defined data types whose pattern makes them easily searchable  Unstructured Data: is data that doesn’t have predefined form. It is comprised of data that is usually not as easily searchable, including formats like audio, video, and social media postings.  Semi-Structured Data: it is a combination of structured and semi structured data but lack of strictly define module.
  • 10. Technologies Available for Managing Big Data  1. Apache Hadoop  Apache Hadoop is a java based free software framework that can effectively store large amount of data in a cluster.  2. Microsoft HDInsight  It is a Big Data solution from Microsoft powered by Apache Hadoop which is available as a service in the cloud.  3. NoSQL  While the traditional SQL can be effectively used to handle large amount of structured data, we need NoSQL (Not Only SQL) to handle unstructured data. NoSQL databases store unstructured data with no particular schema.  4. Spark  Apache Spark is an open source processing engine built around speed, user ease and sophisticated analytics.
  • 11. What Is Apache Hadoop?  Apache Hadoop is an open-source software framework used for distributed storage and processing of datasets of big data using the MapReduce programming model.  It consists of computer clusters built from commodity hardware.  The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.  All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework
  • 12. Hadoop  The Core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model.  Hadoop splits files into large blocks and distributes them across nodes in a cluster.  It then transfers packaged code into nodes to process the data in parallel.  This approach takes advantage of data locality, where nodes manipulate the data they have access to.  This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
  • 13. Hadoop  The base Apache Hadoop framework is composed of the following modules:  Hadoop Common – contains libraries and utilities needed by other Hadoop modules;  Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;  Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users' applications,  Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.
  • 15. Hadoop Distributed File System (HDFS)  The HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework.  It provide shell commands and Java application programming interface (API) methods that are similar to other file systems.  HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.  HDFS provides high throughput access to application data and is suitable for applications that have large data sets.  HDFS is a filesystem designed for storing very large files with streaming data access patterns
  • 16. Key Concepts Here are some of the key concepts related to HDFS. 1. NameNode: HDFS works in a master-slave fashion. All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode. A NameNode serves as the master and there is only one NameNode per cluster. 2. DataNode: DataNode is the slave node and holds the user data in the form of Data Blocks. There can be any number of DataNodes in a Hadoop Cluster. 3. Data Block : A Data Block can be considered as the standard unit of data/files stored on HDFS. Each incoming file is broken into 64 MB by default(dependent on version) 4. Replication : Data blocks are replicated across different nodes in the cluster to ensure a high degree of fault tolerance. Replication enables the use of low cost commodity hardware for the storage of data.
  • 17. Yarn  YARN is Hadoop’s cluster resource management system.
  • 18. MapReduce  MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.  Hadoop can run MapReduce programs written in various languages  MapReduce programs are inherently parallel  MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
  • 19. MapReduce  Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.  Reduce stage : This stage is the combination of the Shufflestage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.  After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 20. Advantages of Hadoop  Scalable: Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel.  Cost effective :Hadoop also offers a cost effective storage solution for businesses' exploding data sets.  Flexible :Hadoop enables businesses to easily access new data sources and tap into different types of data to generate value from that data.  Fast : Hadoop's unique storage method is based on a distributed file system that basically 'maps' data wherever it is located on a cluster.  Resilient to failure :A key advantage of using Hadoop is its fault tolerance.