Big data

BIG DATA
Prepared by
Bhuvaneshwari.P
Research Scholar, VIT university,
Vellore

Introduction
DEFINITION
Big data is defined as the collection of large
and complex datasets that are difficult to
process using database system tools or
traditional data processing application
software.
Mainframe
(Kilobytes)
Client /Server
(Megabytes)
The Internet
(Gigabytes)
[Big data]
Mobile, Social
media…
(Zettabytes)

Characteristics of Big data
The characteristics of big data is specified with 5V’s:
1. Volume – It refers to vast amount of data generated
every second. [Kilo bytes->Mega->Giga->Tera->Petta-
>Exa->Zetta->Yotta]
2. Variety – It refers to the different kinds of data
generated from different sources.
3. Velocity – It refers to the speed of data generation ,
process, and moves around.
4. Value – It refers to bring out the correct meaning out
of available data.
5. Veracity – It refers to the uncertainty and
inconsistencies in the data.

Categories of Big data
Big data is categorized into three forms.
1.Structured – Data which can be stored and
processed in predefined format. Ex: Table,
RDBMS data.
2.Unstructured – Any data without structure or
unknown form. Ex: Output returned by google
search, audio, video, image.
3.Semi_ structured - This type of data contain both
the forms of data. Ex: JSON, CSV,XML, email .
Data types*-> emails, text messages, photos,
videos, logs, documents, transactions, click trails,
public records etc.,

Examples of big data
Some examples of big data
1.Social media: 500+ terabytes of data is
generated in facebook everyday, 100,000 +
data is created in tweets for every 60 sec, 300
hours of videos are uploaded in you tube per
minute.
2.Airlines: A single jet engine produce
10+terabytes of data in 30 minutes of a flight
time.

Cont..,
3. Stock Exchange- The New York stock
exchange generates about one terabyte of
new trade data everyday.
4. Mobile Phones- For every 60 seconds
698,445+ google searches, 11,000,000+
instant messages, and 168000000 emails are
generated by users.
5. Walmart handles more than 1 million
customer transaction every hour.

Sources of big data
1.Activity data- The basic activity like searches
are stored in web browser, the usage of
phone is stored by mobile phones, Credit card
company stores where customer buys and
shop stores what they buys.
2. Conversational data- Conversations in emails,
social media sites like facebook, twitter and so
on.

Cont.,
3. Photo and video data- The pictures and
videos taken from mobile phones, digital
camera, and CCTV are uploaded heavily in
youtube and social media sites every second.
4. Sensor data- The sensors embedded in all
devices produce huge amount of data. Ex: GPS
provide direction and speed of a vehicle.
5. IOT data- Smart TV, smart watch, smart fridge
etc. Ex: Traffic sensors send data to alarm
clock in smart watch

Typical Classification
I. Internal data – It supports daily business
operations such as organizational or
enterprise data ( Structured). Ex: Customer
data, Sales data, ERP,CRM etc.,
II. External data – It is analyzed for
competitors, market environment and
technology such as social data
(Unstructured). Ex: Internet, Government,
Business partners, Syndicate data suppliers
etc.,

Big data storage
Big data storage is concerned with storing and
managing data in a scalable way, satisfying
the needs of applications that require access
to the data.
 Some of the big data storage technologies are
1. Distributed file system- Store large amounts
of unstructured data in a reliable way on
commodity hardware

Cont.,
Hadoop File System (HDFS) is an integral part
of the Hadoop framework designed for large
data files and is well suited for quickly
ingesting data and bulk processing
2. NoSQL database - Database that stores and
retrieves data that is modeled in means other
than the tabular relations and it lacks ACID
transactions
 Supports both structured and unstructured
data

The data structures used are key-value, wide
column, graph, or document
Less functionality and more performance
It focus on scalability, performance, and high
availability
Flat files RDBMS NoSQL
No standard
Implementa
tion
Could not
handle big
data

3. NewSQL database - Provide the same
scalable performance of NoSQL systems
for Online Transaction Processing (OLTP) read-
write workloads while still maintaining
the ACID guarantees of a traditional database
system
4. Cloud storage – Service model in which data
is maintained, managed, backed up remotely
and made available to users over the Internet

Cont.,
Eliminates the acquisition and management
costs of buying and maintaining your own
storage infrastructure, increases agility,
provides global scale, and delivers "anywhere,
anytime" access to data
Users generally pay for their cloud data
storage on a per-consumption “Pay as per
use”

Data intelligence
Data intelligence - Analysis of various forms of
data in such a way that it can be used by
companies to expand their services or
investments
Transforming data into information,
information into knowledge, and knowledge
into value

Data integration and serialization
Data integration- Combining data residing in
different sources and providing users with a
unified view of them
Data serialization- It is the concept of
converting structured data into a format that
allows it to be shared or stored in such a way
that its original structure to be recovered.

Data monitoring
Data monitoring- It allows an organization to
proactively maintain a high, consistent
standard of data quality
• By checking data routinely as it is stored
within applications, organizations can avoid
the resource-intensive pre-processing of data
before it is moved
• With data monitoring, data quality checked at
creation time rather than before a move.

Data indexing
Data indexing- It is a data structure that is
added to a file to provide faster access to the
data.
• It reduces the number of blocks that the
DBMS has to check.
• It contains a search key and a pointer. Search
key - an attribute or set of attributes that is
used to look up the records in a file.
• Pointer - contains the address of where the
data is stored in memory.

Why Big data?
These are the factors leads to the emergence of
big data
1. Increase of storage capacity
2. Increase of processing power
3. Availability of data
4. Derive insights and drive growth
5. To be competitive

Benefits of Big Data Processing
1. Businesses gains intelligence while decision
making.
2. Better Customer service.
3. Early identification of risk in product/
services.
4. Improved operational efficiency – Product
recommendation.
5. Detecting fraudulent behavior.

Applications of Bigdata
 Smarter health care – Leverage the health
care system with easy access and efficient
outcome.
Multi channel sales and web display
advertisement
Finance
Intelligence Traffic management
Manufacturing
Fraud and risk detection
Telecom

Analysis Vs Analytics
Analysis - It is the process of breaking a complex
topic or substance into smaller parts in order
to gain a better understanding of it
 What happened in the past? It is the process
of examining, transforming and arranging raw
data in a specific way to generate useful
information from it
Analytics – It is sub component of analysis that
involves the use of tools and techniques to
find novel, valuable and exploitable patterns
(What will happen in future?)

Big data analytics
It is the process of
 Collecting, Storing, Organizing and
Analyzing the large set of heterogeneous data
for gaining insights, discover patterns,
correlations and other useful information
 Faster and better decision making
 Enhance performance, service or product
Cost effective and next generation products

Challenges/Opportunity
Unstructured Data (90%) Structured
Data (10%)
To analyze & extract
meaningful
information

Stages in Big data analytics
I. Identifying problem
II. Designing data requirements
III. Preprocessing data
IV. Visualizing data and
V. Performing analytics over data

Tradition Vs Big data analytics
Traditional Analytics Big data analytics
 Analytics with well known and smaller
in size data
Not well understood format with
largely semi structured or unstructured
data
 Built based on relational data models  It is retrieved from various sources
with almost flat and no relationship in
nature

Four types of analytics
1. Descriptive Analytics : What happened?
 It is a backward looking and reveal what has
occurred in the past with the present data
(Hindsight)
 Two types: 1) Measures of central tendency
(mean, mode, and median)
2) Measures of dispersion (range,
variance, and standard deviation)

2. Diagnostic Analytics : Why did this happen?
What went wrong?
3. Predictive Analytics : What is likely to
happen?
 It predict what could happen in the future
(Insight)
 Several models used are i) Forecasting, ii)
Simulation, iii) Regression, iv)Classification,
and v) Clustering

4. Prescriptive analytics – What should we do to
make it happen?
It suggest conclusions or actions that can be
taken based on the analysis
Techniques used are i) Linear programming,
ii)Integer programming, iii)Mixed integer
programming, and iv)Non linear programming

Approach in analytics development
 Identify the data source
Select the right tools and technology for
collect, store and organize data
 Understand the domain and process data
Build mathematical model for your analytics
Visualize and validate the result
Learn, adapt and rebuilt your analytical
model.

Big data analytics domain
 Web and E-Tailing
 Government
Retail
Tele communication
Health care
Finance and banking

Big data techniques
There are seven widely used big data analysis
techniques. They are
1. Association rule learning
2. Classification tree analysis
3. Genetic algorithms
4. Machine learning
5. Regression analysis
6. Sentiment analysis
7. Social network analysis

Association rule learning
 Rule based machine learning method for
discovering the interesting relations between
variables in large database.
In order to select interesting rules from the
set of all possible rules, constraints on various
measures of significance and interest are
used. The best known constraints are
minimum threshold on support and
confidence.

Cont.,
Support- Indication of how frequently the
item set appear in the data set.
Confidence – Indication of how often the rule
has been found to be true.
Example Rule for the supermarket
{ bread, butter} => {Milk}, It mean that if butter
and bread are bought, customers also buy
milk.

Algorithms for association rule
learning
Some of the familiar algorithms used for mining
frequent item sets are
1.Apriori algorithm- It uses
a) breadth- first search strategy to count the
support of item sets
b) candidate generation function which exploits
the downward closure property of a support

Equivalence class transformation
(ECLAT) algorithm
 Depth first search algorithm using set
intersection
Suitable for serial and parallel execution with
locality enhancing properties

Frequent Pattern (FP) Growth
algorithm
1st
phase - Algorithm counts number of
occurrence of items in dataset and store in
header table
2nd
phase – FP tree structure is built by
inserting instances. Items in each instance
have to be sorted by descending order of their
frequency in the dataset, so that the tree can
be processed quickly.

Classification Tree Analysis
It is a type of machine learning algorithm used
to classify the class of an object
Identifies a set of characteristics that best
differentiates individuals based on a
categorical outcome variable

Genetic Algorithms
Search based optimization technique based
on the concepts of natural selection and
genetics
In GAs, we have a pool or a population of
possible solutions to the given problem.
These solutions then undergo recombination
and mutation (like in natural genetics),
producing new children, and the process is
repeated over various generations.

Cont.,
 Each individual is assigned a fitness value
(based on its objective function value) and the
fitter individuals are given a higher chance to
mate and yield more “fitter” individuals
Part of evolutionary algorithms
 Three basic operators of GA: (i)
Reproduction, (ii) Mutation, and (iii)
Crossover

Machine Learning
 It is a method of data analysis that automates
analytical model building
It is an application of Artificial Intelligence
based on the idea that machines should be
able to learn and adapt through experience
Within the field of data analytics, machine
learning is a method used to devise complex
models and algorithms that lend themselves
to prediction

Cont.,
• Machine learning is a branch of science that
deals with programming the systems in such a
way that they automatically learn and
improve with experience.
• Learning means recognizing and
understanding the input data and making wise
decisions based on the supplied data.

Cont.,
• It is very difficult to cater to all the decisions
based on all possible inputs. To tackle this
problem, algorithms are developed. These
algorithms build knowledge from specific data
and past experience with the principles of
statistics, probability theory, logic,
combinatorial optimization, search,
reinforcement learning, and control theory.

Learning types
There are several ways to implement machine
learning techniques, however the most
commonly used ones
Supervised learning
Unsupervised learning
Semi supervised learning

Supervised learning
• Deals with learning a function from available training
data. Known input and output variable. Use an
algorithm to learn the mapping function from input
to output [Y=f(X)]
• Analyzes the training data and produces an inferred
function, which can be used for mapping new
examples
• Some supervised learning algorithms are neural
networks, Support Vector Machines (SVMs), and
Naive Bayes Classifiers, Random forest, Decision
Trees, Regression.
• Ex: classifying spam, voice recognization, regression

Unsupervised Learning
 Makes sense of unlabeled data without having any
predefined dataset for its training. Only input (X)
and no corresponding output variable
 Model the underlying structure or distribution in the
data in order to learn more about data
 It is most commonly used for clustering similar input
into logical groups
 Common approaches: K means, self organizing maps
and hierarchical clustering
 Techniques: Recommendation, Association,
Clustering

Semi Supervised Learning
Problems where you have a large amount of
input data (X) and only some of the data is
labeled
 Example: In photo archive where only some
of the images are labeled and the majority are
unlabeled

Regression Analysis
• It is a set of statistical processes for estimating
the relationships among variables
• Regression analysis helps one understand how
the typical value of the dependent variable (or
'criterion variable') changes when any one of
the independent variables is varied, while the
other independent variables are held fixed.
• Widely used for prediction and forecasting,
where its use has substantial overlap with the
field of machine learning.

Cont.,
• This technique is used for forecasting, time
series modeling and finding the causal effect
relationship between the variables. For
example, relationship between rash driving
and number of road accidents by a driver is
best studied through regression.

Sentiment Analysis/ Opinion
Mining
Using NLP, statistics, or machine learning
methods to extract, identify, or otherwise
characterize the sentiment content of a text
unit
Sentiment = feelings
Attitudes – Emotions – Opinions
Subjective impressions, not facts

*A common use case for this technology is to
discover how people feel about a particular
topic
Automated extraction of subjective content
from digital text and predicting the
subjectivity such as positive, negative or
neutral

Social Network Analysis
• Process of investigating social structures
through the use of networks and graph theory
• It is the mapping and measuring of
relationships and flows between people,
groups, organizations, computers, URLs, and
other connected information/knowledge
entities.
• The nodes in the network are the people and
groups while the links show relationships or
flows between the nodes.

Two types of SNA
• Egocentric Analysis
– Focuses on the individual and studies an
individual’s personal network and its affects
on that individual
• Sociocentric Analysis
– Focuses on large groups of people – Quantifies
relationships between people in a group
– Studies patterns of interactions and how these
patterns affect the group as a whole

Egocentric Analysis
• Examines local network structure
• Describes the network around a
single node (the ego)
– Number of other nodes (alters)
– Types of connections
• Extracts network features
• Uses these factors to predict health and
longevity, economic success, levels of
depression, access to new opportunities

Sociocentric Analysis
• Quantifies relationships and interactions
between a group of people
• Studies how interactions, patterns of
interactions, and network structure affect
– Concentration of power and resources
– Spread of disease
– Access to new ideas
– Group dynamics

Big data analytics tools and
technologies
Hadoop = HDFS + Map Reduce
HiveHBase
Flume
Oozie
Pig
Flume
Sqoop
Khufka
Storm
RHadoop
Chukwa

Future role of data
Now Future
DNS =
Data
Decision Support
System
Digital Nervous
System (DNS)
Data
Sense ActDecideInterpret

History of Hadoop
1996-2000 2003-04 2005-06 2010 2013
Yahoo
Big data problem faced by all search engines
Google
Google file system and Map reduce papers
Hadoop spawns
Cloud era
Apache
(Dough & Mike)
Next generation Hadoop / Yarn &
Mapreduce2

Hadoop
It is an open source framework used for
distributed storage and processing of dataset
of big data using MapReduce programming
model
• The core components are i) Hadoop
Common – contains libraries and utilities
needed by other Hadoop modules;

• Hadoop Distributed File System (HDFS) –
Stores data on commodity machines,
providing very high aggregate bandwidth
across the cluster
• Hadoop YARN – a platform responsible for
managing computing resources in clusters and
using them for scheduling users' applications
• Hadoop MapReduce – an implementation of
the MapReduce programming model for
large-scale data processing.

Distributed Computing
Use of commodity hardware and open source
software (Increase in number of
processers)against expensive proprietary
software on expensive hardware (Server)

Major Components of Hadoop
Framework
1. HDFS (Hadoop Distributed File System):
Inspired from Google file system
2. Map Reduce : Inspired from Google Map
Reduce
* Both work on cluster of systems, hierarchical
architecture
Hdfs
Map Reduce

File
Master
Node
(Name
Node)
A
A
B
C
C
A
A
C
A
B
C
B
A
C
B
Data Nodes
A block B block C block

Master Node: It monitors the data distributed
among data node
Data Node: Stores the data blocks
* Both are Hadoop daemons. Actually java
programs run on specific machines

Map Reduce
It is divided into 2 phases
1.Map - Mapper code is distributed among
machines and it work on the data which the
system holds (Data locality). The locally
computed results are aggregated and sent to
reducer.
Map Map Map Reduce

2. Reducer- Reducer algorithms are applied to
global data to produce the final result.
Programmers need to write only the map
logic and reduce logic. The correct distribution
of map code to map machines are handled by
Hadoop.

Hadoop Ecosystem
Yahoo Facebook
HDFS
Map Reduce Hbase
Pig Hive
Sqoop/Flume

Pig
 It is a tool uses scripting statements to
process the data
Simple data flow language which saves
development time and efforts
 Typically it was designed for data scientist
who have less programming skills
It is developed by yahoo

Hive
 It produces SQL type language tool which
runs on top of map reduce
Hive is develop by facebook for data scientist
who have less programming skills.
The code written on pig/ hive gets converted
into map reduce jobs and run on HDFS

Sqoop/ Flume
Inorder to facilitate the movement of data to
Rhadoop, sqoop/flume is used
Sqoop is used to move the data from
Relational database and Flume is used to
inject the data as it was created by external
source.
Hbase is a tool which provide features like
real time database to receive data from HDFS

Big data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big data

Similaire à Big data (20)

Dernier

Dernier (20)

Big data

Notes de l'éditeur