BigData Analytics

BigData AnalyticsBigData Analytics
Incorporation Pvt. Ltd.
Presented By:-Presented By:-
Mayank Kumar Sharma
1

2AMSTECH Incorporation Pvt. Ltd.

Internet = Ocean of informationInternet = Ocean of information

What is BigData?
What makes data, “Big” Data?
Why BigData?

“Extremely large data sets that may be analyzed
computationally to reveal patterns, trends, and
associations, especially relating to human behavior
and interactions are known as BigDataBigData.”
OR
BigDataBigData is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications.
What is BigData?

“Gartner Definition(2012): "BigData is high
volume, high velocity, and/or high variety information
assets that require new forms of processing to enable
enhanced decision making, insight discovery and
process optimization.”
“No exact Definition, Only Experience.”
What is BigData?

Every day, we create 3.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone.
An example of big data might be petabytes (1,024
terabytes) or exabytes (1,024 petabytes) of data consisting
of billions to trillions of records of millions of people.
Storage capacity increases 23% on average annually.
Exponential growth during a decade starts from 2010.
What makes data, “Big” Data?

• Creates over 30 billion pieces of content per day.
• Stores 30 petabytes of data.
• 90 million tweets per day.

Why BigData?
To Manage Data Better.
[Abstraction has enabled numerous use cases where
data in a wide variety of formats]
Benefit From Speed, Capacity and Scalability of Cloud
Storage.
[Utilize substantially large data sets provide both the storage and
the computing power necessary crunch data for a specific period.]
End Users Can Visualize Data
[Data in easy-to-read charts, graphs and slideshows]

Why BigData?
Find New Business Opportunities.
[Social media, Business Intelligence]
Data Analysis Methods, Capabilities Will Evolve
[Utilize substantially large data sets provide both the storage and
the computing power necessary crunch data for a specific period.]

Why BigData?

Who uses BigData?
1. Banking
2. Education
3. Government
4. Health Care
5. Manufacturing
6. Retail
“ It’s important to remember that the primary value
from big data comes not from the data in its raw form, but
from the processing and analysis of it and the insights,
products, and services that emerge from analysis. “

BigData Challenges

Big data can be characterized by 3Vs,
which can be known as Volume, Velocity and
Variety.
Characteristics of Big Data:

Data Volume
 44x increase from 2009 2020
 From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
16
Exponential increase in
collected/generated data
Volume : BigData 3Vs
AMSTECH Incorporation Pvt. Ltd.

Various formats, types, and structures.
Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
Static data vs. streaming data
A single application can be generating/collecting many types
of data
17
To extract knowledge All these types of data need
to linked together
To extract knowledge All these types of data need
to linked together
Variety : BigData 3Vs

Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions  missing opportunities
Examples
 E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you
 Healthcare monitoring: sensors monitoring your activities and body
 any abnormal measurements require immediate reaction.
19
Velocity : BigData 3Vs

Shim, K., S., Lee, S., K. and Kim, M., S. “Application Traffic Classification
in Hadoop Distributed Computing Environment” published in Asia-Pacific
Network Operation and Management Symposium (APNOMS) 2014.
1. This research work proposed an application traffic
classification in Hadoop Distributed Computing Environment.
2. Traffic phenomena of current network have been changes and
conventional traffic analysis method are not adequate.
3. In the proposed solution, authors consider packet units of
traffic from campus network. Collected packets are converted
into Flow format through the flow generator. The flow is
defined by 5 –tuple analysis.
Research Study

Conclusion
4. Proposed method perform well in term of processing
speed through a comparison between the Hadoop based
system and a single server system.
5. On the other hand, it has certain drawbacks which are;
1. Adoption of Classification technique rather than
clustering.
2. Low analysis rate.

Existing Solution for Traffic Classification

BigData Analytics

BigData Analytics Use Cases
Real Time
Intelligence
Data
Discovery
Business
Intelligence
Data
Scientist Business
User
Consumer

1. Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment.
2. The Hadoop Distributed File System (HDFS) is designed
to store very large data sets reliably, and to stream those
data sets at high bandwidth to user applications.
3. By distributing storage and computation across many
servers, the resource can grow with demand while
remaining economical at every size.
BigData: Hadoop

4. An important characteristic of Hadoop is the partitioning
of data and computation across many (thousands) of
hosts, and executing application computations in parallel
close to their data.
5. A Hadoop cluster scales computation capacity, storage
capacity and IO bandwidth by simply adding commodity
servers.
6. In simple words, it is a scalable fault tolerant grid
operating system for data storage and processing with
high bandwidth and clustering storage.

Figure 2: HADOOP Components

Figure 3: HDFS Processing

1. NameNode is the centerpiece of HDFS.
2. NameNode is also known as the Master
3. NameNode only stores the metadata of HDFS – the
directory tree of all files in the file system, and tracks the
files across the cluster.
4. NameNode does not store the actual data or the dataset.
The data itself is actually stored in the DataNodes.
5. NameNode knows the list of the blocks and its location
for any given file in HDFS. With this information
NameNode knows how to construct the file from blocks.
Name Node

6. NameNode is so critical to HDFS and when the
NameNode is down, HDFS/Hadoop cluster is inaccessible
and considered down.
7. NameNode is a single point of failure in Hadoop cluster.
8. NameNode is usually configured with a lot of memory
(RAM). Because the block locations are help in main
memory.

1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the
NameNode along with the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability
of data or the cluster. NameNode will arrange for replication
for the blocks managed by the DataNode that is not
available.
6. DataNode is usually configured with a lot of hard disk space.
Because the actual data is stored in the DataNode.
DataNode

Operation series when writing a file

Operation series when reading a file

Hadoop ConfigurationHadoop Configuration

Thanks A LotThanks A Lot
Incorporation Pvt. Ltd.
By:
Mayank Kumar Sharma

BigData Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BigData Analytics

Similar to BigData Analytics (20)

Recently uploaded

Recently uploaded (20)

BigData Analytics