Big data Presentation

Data is the New Oil
By : Abhilash Pande
Aarati Chavan
Himanshu Arora

Just how much data is
generated?
 The world produces 2.5 quintillion bytes a day, and 90%
of all data has been produced in just the last two years.
 Twitter processes 7TB of data ever day,
 600TB of data is processed by Facebook every day.
 Facebook hosts around 10 billion photos taking up 1
petabyte of storage
 The internet archive stores around 2 petabytes of data
and is growing at a rate of 20 terabytes per month
 Interestingly, about 80% of this data is unstructured.

How do we manage this data?
 This is where Big Data comes into play.
 The basic idea behind the phrase Big Data is that
everything we do is increasingly leaving a digital trace
which we can use and analyze to become smarter.
 Big data is data sets that are so voluminous and complex
that traditional data processing application software are
inadequate to deal with them.
 Big data challenges include capturing data, data
storage, data analysis, search, sharing, transfer,
visualization, querying, updating and information
privacy.

Who benefits from this data?
 Corporations have been the greatest beneficiaries from
this data revolution. In 2006, oil and energy companies
dominated the list of top six most valuable firms in the
world, but in 2016, the list is dominated by data firms
like Alphabet, Apple, Facebook, Amazon and Microsoft.

Impact of Big Data on Business
Here are a few examples of the Business Impact on
Industries:
Banking
Big Data has proven to be very effective and beneficial for
financial Institutions. It helps the financial institution in
predicting the Consumer behavior and has offered effective
predictive analysis in order to develop excellent customer
experience.
Manufacturing
Manufacturing is a process which is made out of many sub-
processes
As it involves the complete database of an organization, it
has proved to be an effective tool in refining the product
quality and systemizing the defect detection of the products.

Impact of Big Data on Business
 Oil and Gas:
There are a few big players in the Industry who have
already implied this technology to their business. Right
from evaluating a prospective oil field to selling it to the
buyers, every step of an oil and gas industry has a
significant role of Big Data
 Retail
Retail is another sector that has grown with time. There
are many retailers that give the credit of their explicit
success to this new technology.
It provides aids in maintaining and replenishing the
inventory volume and helps in the analysis of sales and
profit as well.

Where is all data coming from?
There are three major source of big Data
 Social Data: Social media data is providing remarkable
insights to companies on consumer behavior that can be
used for analysis, with 230 million tweets posted on
Twitter per day, 2.7 billion Likes and comments added
to Facebook every day, and 60 hours of video uploaded
to YouTube every minute
 Machine Generated Data: Machine data consists of
information generated from industrial equipment, real-
time data from sensors that track parts and monitor
machinery
 Business Generated Data : Data produced as a result of
business activities can be recorded in structured or
unstructured databases.

4 V’s of Big Data
 Velocity: is the speed of data in which it accumulate.
 Volume: is the scale of the data or increase in the
amount of data stored.
 Variety: is the diversity of the data like we have
Structured data which fits into rows and column
Unstructured Data like tweet videos pictures.
 Veracity: is the conformity to facts and accuracy with
large amount of data or quality and origin of the data.
 These all sum up with “”Value”
 This refer to our ability to turn our data into value,it
may be medical or social benefits

Type of Data
 Structured Data: structured data is comprised of
clearly defined data types whose pattern makes them
easily searchable
 Unstructured Data: is data that doesn’t have
predefined form. It is comprised of data that is usually
not as easily searchable, including formats like audio,
video, and social media postings.
 Semi-Structured Data: it is a combination of structured
and semi structured data but lack of strictly define
module.

Technologies Available for
Managing Big Data
 1. Apache Hadoop
 Apache Hadoop is a java based free software framework that can
effectively store large amount of data in a cluster.
 2. Microsoft HDInsight
 It is a Big Data solution from Microsoft powered by Apache Hadoop
which is available as a service in the cloud.
 3. NoSQL
 While the traditional SQL can be effectively used to handle large
amount of structured data, we need NoSQL (Not Only SQL) to
handle unstructured data. NoSQL databases store unstructured
data with no particular schema.
 4. Spark
 Apache Spark is an open source processing engine built around
speed, user ease and sophisticated analytics.

What Is Apache Hadoop?
 Apache Hadoop is an open-source software framework
used for distributed storage and processing of datasets of
big data using the MapReduce programming model.
 It consists of computer clusters built from commodity
hardware.
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
 It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
 All the modules in Hadoop are designed with a fundamental
assumption that hardware failures are common occurrences
and should be automatically handled by the framework

Hadoop
 The Core of Apache Hadoop consists of a storage part,
known as Hadoop Distributed File System (HDFS), and a
processing part which is a MapReduce programming
model.
 Hadoop splits files into large blocks and distributes
them across nodes in a cluster.
 It then transfers packaged code into nodes to process
the data in parallel.
 This approach takes advantage of data locality, where
nodes manipulate the data they have access to.
 This allows the dataset to be processed faster and more
efficiently than it would be in a more conventional
supercomputer architecture that relies on a parallel file
system where computation and data are distributed via
high-speed networking.

Hadoop
 The base Apache Hadoop framework is composed of the
following modules:
 Hadoop Common – contains libraries and utilities
needed by other Hadoop modules;
 Hadoop Distributed File System (HDFS) – a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the
cluster;
 Hadoop YARN – a platform responsible for managing
computing resources in clusters and using them for
scheduling users' applications,
 Hadoop MapReduce – an implementation of the
MapReduce programming model for large-scale data
processing.

Hadoop Distributed File System
(HDFS)
 The HDFS is a distributed, scalable, and portable file
system written in Java for the Hadoop framework.
 It provide shell commands and Java application
programming interface (API) methods that are similar to
other file systems.
 HDFS is highly fault-tolerant and is designed to be
deployed on low-cost hardware.
 HDFS provides high throughput access to application
data and is suitable for applications that have large
data sets.
 HDFS is a filesystem designed for storing very large files
with streaming data access patterns

Key Concepts
Here are some of the key concepts related to HDFS.
1. NameNode: HDFS works in a master-slave fashion. All the
metadata related to HDFS including the information about
data nodes, files stored on HDFS, and Replication, etc. are
stored and maintained on the NameNode. A NameNode
serves as the master and there is only one NameNode per
cluster.
2. DataNode: DataNode is the slave node and holds the user
data in the form of Data Blocks. There can be any number
of DataNodes in a Hadoop Cluster.
3. Data Block : A Data Block can be considered as the standard
unit of data/files stored on HDFS. Each incoming file is
broken into 64 MB by default(dependent on version)
4. Replication : Data blocks are replicated across different
nodes in the cluster to ensure a high degree of fault
tolerance. Replication enables the use of low cost
commodity hardware for the storage of data.

Yarn
 YARN is Hadoop’s cluster resource management system.

MapReduce
 MapReduce is a framework using which we can write
applications to process huge amounts of data, in
parallel, on large clusters of commodity hardware in a
reliable manner.
 Hadoop can run MapReduce programs written in various
languages
 MapReduce programs are inherently parallel
 MapReduce program executes in three stages, namely
map stage, shuffle stage, and reduce stage.

MapReduce
 Map stage : The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data
and creates several small chunks of data.
 Reduce stage : This stage is the combination of the Shufflestage
and the Reduce stage. The Reducer’s job is to process the data
that comes from the mapper. After processing, it produces a new
set of output, which will be stored in the HDFS.
 After completion of the given tasks, the cluster collects and
reduces the data to form an appropriate result, and sends it
back to the Hadoop server.

Advantages of Hadoop
 Scalable: Hadoop is a highly scalable storage platform,
because it can store and distribute very large data sets
across hundreds of inexpensive servers that operate in
parallel.
 Cost effective :Hadoop also offers a cost effective
storage solution for businesses' exploding data sets.
 Flexible :Hadoop enables businesses to easily access
new data sources and tap into different types of data to
generate value from that data.
 Fast : Hadoop's unique storage method is based on a
distributed file system that basically 'maps' data
wherever it is located on a cluster.
 Resilient to failure :A key advantage of using Hadoop is
its fault tolerance.

Big data Presentation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Big data Presentation

Similaire à Big data Presentation (20)

Dernier

Dernier (20)

Big data Presentation