5. What is BigData?
What makes data, “Big” Data?
Why BigData?
5AMSTECH Incorporation Pvt. Ltd.
6. “Extremely large data sets that may be analyzed
computationally to reveal patterns, trends, and
associations, especially relating to human behavior
and interactions are known as BigDataBigData.”
OR
BigDataBigData is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications.
What is BigData?
6AMSTECH Incorporation Pvt. Ltd.
7. “Gartner Definition(2012): "BigData is high
volume, high velocity, and/or high variety information
assets that require new forms of processing to enable
enhanced decision making, insight discovery and
process optimization.”
“No exact Definition, Only Experience.”
What is BigData?
7AMSTECH Incorporation Pvt. Ltd.
8. Every day, we create 3.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone.
An example of big data might be petabytes (1,024
terabytes) or exabytes (1,024 petabytes) of data consisting
of billions to trillions of records of millions of people.
Storage capacity increases 23% on average annually.
Exponential growth during a decade starts from 2010.
What makes data, “Big” Data?
8AMSTECH Incorporation Pvt. Ltd.
9. • Creates over 30 billion pieces of content per day.
• Stores 30 petabytes of data.
• 90 million tweets per day.
9AMSTECH Incorporation Pvt. Ltd.
10. Why BigData?
To Manage Data Better.
[Abstraction has enabled numerous use cases where
data in a wide variety of formats]
Benefit From Speed, Capacity and Scalability of Cloud
Storage.
[Utilize substantially large data sets provide both the storage and
the computing power necessary crunch data for a specific period.]
End Users Can Visualize Data
[Data in easy-to-read charts, graphs and slideshows]
10AMSTECH Incorporation Pvt. Ltd.
11. Why BigData?
Find New Business Opportunities.
[Social media, Business Intelligence]
Data Analysis Methods, Capabilities Will Evolve
[Utilize substantially large data sets provide both the storage and
the computing power necessary crunch data for a specific period.]
11AMSTECH Incorporation Pvt. Ltd.
13. Who uses BigData?
1. Banking
2. Education
3. Government
4. Health Care
5. Manufacturing
6. Retail
“ It’s important to remember that the primary value
from big data comes not from the data in its raw form, but
from the processing and analysis of it and the insights,
products, and services that emerge from analysis. “
13AMSTECH Incorporation Pvt. Ltd.
15. Big data can be characterized by 3Vs,
which can be known as Volume, Velocity and
Variety.
Characteristics of Big Data:
15AMSTECH Incorporation Pvt. Ltd.
16. Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
16
Exponential increase in
collected/generated data
Volume : BigData 3Vs
AMSTECH Incorporation Pvt. Ltd.
17. Various formats, types, and structures.
Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
Static data vs. streaming data
A single application can be generating/collecting many types
of data
17
To extract knowledge All these types of data need
to linked together
To extract knowledge All these types of data need
to linked together
Variety : BigData 3Vs
AMSTECH Incorporation Pvt. Ltd.
19. Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction.
19
Velocity : BigData 3Vs
AMSTECH Incorporation Pvt. Ltd.
21. Shim, K., S., Lee, S., K. and Kim, M., S. “Application Traffic Classification
in Hadoop Distributed Computing Environment” published in Asia-Pacific
Network Operation and Management Symposium (APNOMS) 2014.
1. This research work proposed an application traffic
classification in Hadoop Distributed Computing Environment.
2. Traffic phenomena of current network have been changes and
conventional traffic analysis method are not adequate.
3. In the proposed solution, authors consider packet units of
traffic from campus network. Collected packets are converted
into Flow format through the flow generator. The flow is
defined by 5 –tuple analysis.
Research Study
21AMSTECH Incorporation Pvt. Ltd.
22. Conclusion
4. Proposed method perform well in term of processing
speed through a comparison between the Hadoop based
system and a single server system.
5. On the other hand, it has certain drawbacks which are;
1. Adoption of Classification technique rather than
clustering.
2. Low analysis rate.
22AMSTECH Incorporation Pvt. Ltd.
25. BigData Analytics Use Cases
Real Time
Intelligence
Data
Discovery
Business
Intelligence
Data
Scientist Business
User
Consumer
25AMSTECH Incorporation Pvt. Ltd.
26. 1. Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment.
2. The Hadoop Distributed File System (HDFS) is designed
to store very large data sets reliably, and to stream those
data sets at high bandwidth to user applications.
3. By distributing storage and computation across many
servers, the resource can grow with demand while
remaining economical at every size.
BigData: Hadoop
26AMSTECH Incorporation Pvt. Ltd.
27. 4. An important characteristic of Hadoop is the partitioning
of data and computation across many (thousands) of
hosts, and executing application computations in parallel
close to their data.
5. A Hadoop cluster scales computation capacity, storage
capacity and IO bandwidth by simply adding commodity
servers.
6. In simple words, it is a scalable fault tolerant grid
operating system for data storage and processing with
high bandwidth and clustering storage.
27AMSTECH Incorporation Pvt. Ltd.
31. 1. NameNode is the centerpiece of HDFS.
2. NameNode is also known as the Master
3. NameNode only stores the metadata of HDFS – the
directory tree of all files in the file system, and tracks the
files across the cluster.
4. NameNode does not store the actual data or the dataset.
The data itself is actually stored in the DataNodes.
5. NameNode knows the list of the blocks and its location
for any given file in HDFS. With this information
NameNode knows how to construct the file from blocks.
Name Node
31AMSTECH Incorporation Pvt. Ltd.
32. 6. NameNode is so critical to HDFS and when the
NameNode is down, HDFS/Hadoop cluster is inaccessible
and considered down.
7. NameNode is a single point of failure in Hadoop cluster.
8. NameNode is usually configured with a lot of memory
(RAM). Because the block locations are help in main
memory.
32AMSTECH Incorporation Pvt. Ltd.
33. 1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the
NameNode along with the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability
of data or the cluster. NameNode will arrange for replication
for the blocks managed by the DataNode that is not
available.
6. DataNode is usually configured with a lot of hard disk space.
Because the actual data is stored in the DataNode.
DataNode
33AMSTECH Incorporation Pvt. Ltd.