SlideShare une entreprise Scribd logo
1  sur  70
BIG DATA
Prepared by
Bhuvaneshwari.P
Research Scholar, VIT university,
Vellore
Introduction
DEFINITION
Big data is defined as the collection of large
and complex datasets that are difficult to
process using database system tools or
traditional data processing application
software.
Mainframe
(Kilobytes)
Client /Server
(Megabytes)
The Internet
(Gigabytes)
[Big data]
Mobile, Social
media…
(Zettabytes)
Characteristics of Big data
The characteristics of big data is specified with 5V’s:
1. Volume – It refers to vast amount of data generated
every second. [Kilo bytes->Mega->Giga->Tera->Petta-
>Exa->Zetta->Yotta]
2. Variety – It refers to the different kinds of data
generated from different sources.
3. Velocity – It refers to the speed of data generation ,
process, and moves around.
4. Value – It refers to bring out the correct meaning out
of available data.
5. Veracity – It refers to the uncertainty and
inconsistencies in the data.
Categories of Big data
Big data is categorized into three forms.
1.Structured – Data which can be stored and
processed in predefined format. Ex: Table,
RDBMS data.
2.Unstructured – Any data without structure or
unknown form. Ex: Output returned by google
search, audio, video, image.
3.Semi_ structured - This type of data contain both
the forms of data. Ex: JSON, CSV,XML, email .
Data types*-> emails, text messages, photos,
videos, logs, documents, transactions, click trails,
public records etc.,
Examples of big data
Some examples of big data
1.Social media: 500+ terabytes of data is
generated in facebook everyday, 100,000 +
data is created in tweets for every 60 sec, 300
hours of videos are uploaded in you tube per
minute.
2.Airlines: A single jet engine produce
10+terabytes of data in 30 minutes of a flight
time.
Cont..,
3. Stock Exchange- The New York stock
exchange generates about one terabyte of
new trade data everyday.
4. Mobile Phones- For every 60 seconds
698,445+ google searches, 11,000,000+
instant messages, and 168000000 emails are
generated by users.
5. Walmart handles more than 1 million
customer transaction every hour.
Sources of big data
1.Activity data- The basic activity like searches
are stored in web browser, the usage of
phone is stored by mobile phones, Credit card
company stores where customer buys and
shop stores what they buys.
2. Conversational data- Conversations in emails,
social media sites like facebook, twitter and so
on.
Cont.,
3. Photo and video data- The pictures and
videos taken from mobile phones, digital
camera, and CCTV are uploaded heavily in
youtube and social media sites every second.
4. Sensor data- The sensors embedded in all
devices produce huge amount of data. Ex: GPS
provide direction and speed of a vehicle.
5. IOT data- Smart TV, smart watch, smart fridge
etc. Ex: Traffic sensors send data to alarm
clock in smart watch
Typical Classification
I. Internal data – It supports daily business
operations such as organizational or
enterprise data ( Structured). Ex: Customer
data, Sales data, ERP,CRM etc.,
II. External data – It is analyzed for
competitors, market environment and
technology such as social data
(Unstructured). Ex: Internet, Government,
Business partners, Syndicate data suppliers
etc.,
Big data storage
Big data storage is concerned with storing and
managing data in a scalable way, satisfying
the needs of applications that require access
to the data.
 Some of the big data storage technologies are
1. Distributed file system- Store large amounts
of unstructured data in a reliable way on
commodity hardware
Cont.,
Hadoop File System (HDFS) is an integral part
of the Hadoop framework designed for large
data files and is well suited for quickly
ingesting data and bulk processing
2. NoSQL database - Database that stores and
retrieves data that is modeled in means other
than the tabular relations and it lacks ACID
transactions
 Supports both structured and unstructured
data
The data structures used are key-value, wide
column, graph, or document
Less functionality and more performance
It focus on scalability, performance, and high
availability
Flat files RDBMS NoSQL
No standard
Implementa
tion
Could not
handle big
data
3. NewSQL database - Provide the same
scalable performance of NoSQL systems
for Online Transaction Processing (OLTP) read-
write workloads while still maintaining
the ACID guarantees of a traditional database
system
4. Cloud storage – Service model in which data
is maintained, managed, backed up remotely
and made available to users over the Internet
Cont.,
Eliminates the acquisition and management
costs of buying and maintaining your own
storage infrastructure, increases agility,
provides global scale, and delivers "anywhere,
anytime" access to data
Users generally pay for their cloud data
storage on a per-consumption “Pay as per
use”
Data intelligence
Data intelligence - Analysis of various forms of
data in such a way that it can be used by
companies to expand their services or
investments
Transforming data into information,
information into knowledge, and knowledge
into value
Data integration and serialization
Data integration- Combining data residing in
different sources and providing users with a
unified view of them
Data serialization- It is the concept of
converting structured data into a format that
allows it to be shared or stored in such a way
that its original structure to be recovered.
Data monitoring
Data monitoring- It allows an organization to
proactively maintain a high, consistent
standard of data quality
• By checking data routinely as it is stored
within applications, organizations can avoid
the resource-intensive pre-processing of data
before it is moved
• With data monitoring, data quality checked at
creation time rather than before a move.
Data indexing
Data indexing- It is a data structure that is
added to a file to provide faster access to the
data.
• It reduces the number of blocks that the
DBMS has to check.
• It contains a search key and a pointer. Search
key - an attribute or set of attributes that is
used to look up the records in a file.
• Pointer - contains the address of where the
data is stored in memory.
Why Big data?
These are the factors leads to the emergence of
big data
1. Increase of storage capacity
2. Increase of processing power
3. Availability of data
4. Derive insights and drive growth
5. To be competitive
Benefits of Big Data Processing
1. Businesses gains intelligence while decision
making.
2. Better Customer service.
3. Early identification of risk in product/
services.
4. Improved operational efficiency – Product
recommendation.
5. Detecting fraudulent behavior.
Applications of Bigdata
 Smarter health care – Leverage the health
care system with easy access and efficient
outcome.
Multi channel sales and web display
advertisement
Finance
Intelligence Traffic management
Manufacturing
Fraud and risk detection
Telecom
Analysis Vs Analytics
Analysis - It is the process of breaking a complex
topic or substance into smaller parts in order
to gain a better understanding of it
 What happened in the past? It is the process
of examining, transforming and arranging raw
data in a specific way to generate useful
information from it
Analytics – It is sub component of analysis that
involves the use of tools and techniques to
find novel, valuable and exploitable patterns
(What will happen in future?)
Big data analytics
It is the process of
 Collecting, Storing, Organizing and
Analyzing the large set of heterogeneous data
for gaining insights, discover patterns,
correlations and other useful information
 Faster and better decision making
 Enhance performance, service or product
Cost effective and next generation products
Challenges/Opportunity
Unstructured Data (90%) Structured
Data (10%)
To analyze & extract
meaningful
information
Stages in Big data analytics
I. Identifying problem
II. Designing data requirements
III. Preprocessing data
IV. Visualizing data and
V. Performing analytics over data
Tradition Vs Big data analytics
Traditional Analytics Big data analytics
 Analytics with well known and smaller
in size data
Not well understood format with
largely semi structured or unstructured
data
 Built based on relational data models  It is retrieved from various sources
with almost flat and no relationship in
nature
Four types of analytics
1. Descriptive Analytics : What happened?
 It is a backward looking and reveal what has
occurred in the past with the present data
(Hindsight)
 Two types: 1) Measures of central tendency
(mean, mode, and median)
2) Measures of dispersion (range,
variance, and standard deviation)
2. Diagnostic Analytics : Why did this happen?
What went wrong?
3. Predictive Analytics : What is likely to
happen?
 It predict what could happen in the future
(Insight)
 Several models used are i) Forecasting, ii)
Simulation, iii) Regression, iv)Classification,
and v) Clustering
4. Prescriptive analytics – What should we do to
make it happen?
It suggest conclusions or actions that can be
taken based on the analysis
Techniques used are i) Linear programming,
ii)Integer programming, iii)Mixed integer
programming, and iv)Non linear programming
Approach in analytics development
 Identify the data source
Select the right tools and technology for
collect, store and organize data
 Understand the domain and process data
Build mathematical model for your analytics
Visualize and validate the result
Learn, adapt and rebuilt your analytical
model.
Big data analytics domain
 Web and E-Tailing
 Government
Retail
Tele communication
Health care
Finance and banking
Big data techniques
There are seven widely used big data analysis
techniques. They are
1. Association rule learning
2. Classification tree analysis
3. Genetic algorithms
4. Machine learning
5. Regression analysis
6. Sentiment analysis
7. Social network analysis
Association rule learning
 Rule based machine learning method for
discovering the interesting relations between
variables in large database.
In order to select interesting rules from the
set of all possible rules, constraints on various
measures of significance and interest are
used. The best known constraints are
minimum threshold on support and
confidence.
Cont.,
Support- Indication of how frequently the
item set appear in the data set.
Confidence – Indication of how often the rule
has been found to be true.
Example Rule for the supermarket
{ bread, butter} => {Milk}, It mean that if butter
and bread are bought, customers also buy
milk.
Algorithms for association rule
learning
Some of the familiar algorithms used for mining
frequent item sets are
1.Apriori algorithm- It uses
a) breadth- first search strategy to count the
support of item sets
b) candidate generation function which exploits
the downward closure property of a support
Equivalence class transformation
(ECLAT) algorithm
 Depth first search algorithm using set
intersection
Suitable for serial and parallel execution with
locality enhancing properties
Frequent Pattern (FP) Growth
algorithm
1st
phase - Algorithm counts number of
occurrence of items in dataset and store in
header table
2nd
phase – FP tree structure is built by
inserting instances. Items in each instance
have to be sorted by descending order of their
frequency in the dataset, so that the tree can
be processed quickly.
Classification Tree Analysis
It is a type of machine learning algorithm used
to classify the class of an object
Identifies a set of characteristics that best
differentiates individuals based on a
categorical outcome variable
Genetic Algorithms
Search based optimization technique based
on the concepts of natural selection and
genetics
In GAs, we have a pool or a population of
possible solutions to the given problem.
These solutions then undergo recombination
and mutation (like in natural genetics),
producing new children, and the process is
repeated over various generations.
Cont.,
 Each individual is assigned a fitness value
(based on its objective function value) and the
fitter individuals are given a higher chance to
mate and yield more “fitter” individuals
Part of evolutionary algorithms
 Three basic operators of GA: (i)
Reproduction, (ii) Mutation, and (iii)
Crossover
Machine Learning
 It is a method of data analysis that automates
analytical model building
It is an application of Artificial Intelligence
based on the idea that machines should be
able to learn and adapt through experience
Within the field of data analytics, machine
learning is a method used to devise complex
models and algorithms that lend themselves
to prediction
Cont.,
• Machine learning is a branch of science that
deals with programming the systems in such a
way that they automatically learn and
improve with experience.
• Learning means recognizing and
understanding the input data and making wise
decisions based on the supplied data.
Cont.,
• It is very difficult to cater to all the decisions
based on all possible inputs. To tackle this
problem, algorithms are developed. These
algorithms build knowledge from specific data
and past experience with the principles of
statistics, probability theory, logic,
combinatorial optimization, search,
reinforcement learning, and control theory.
Learning types
There are several ways to implement machine
learning techniques, however the most
commonly used ones
Supervised learning
Unsupervised learning
Semi supervised learning
Supervised learning
• Deals with learning a function from available training
data. Known input and output variable. Use an
algorithm to learn the mapping function from input
to output [Y=f(X)]
• Analyzes the training data and produces an inferred
function, which can be used for mapping new
examples
• Some supervised learning algorithms are neural
networks, Support Vector Machines (SVMs), and
Naive Bayes Classifiers, Random forest, Decision
Trees, Regression.
• Ex: classifying spam, voice recognization, regression
Unsupervised Learning
 Makes sense of unlabeled data without having any
predefined dataset for its training. Only input (X)
and no corresponding output variable
 Model the underlying structure or distribution in the
data in order to learn more about data
 It is most commonly used for clustering similar input
into logical groups
 Common approaches: K means, self organizing maps
and hierarchical clustering
 Techniques: Recommendation, Association,
Clustering
Semi Supervised Learning
Problems where you have a large amount of
input data (X) and only some of the data is
labeled
 Example: In photo archive where only some
of the images are labeled and the majority are
unlabeled
Regression Analysis
• It is a set of statistical processes for estimating
the relationships among variables
• Regression analysis helps one understand how
the typical value of the dependent variable (or
'criterion variable') changes when any one of
the independent variables is varied, while the
other independent variables are held fixed.
• Widely used for prediction and forecasting,
where its use has substantial overlap with the
field of machine learning.
Cont.,
• This technique is used for forecasting, time
series modeling and finding the causal effect
relationship between the variables. For
example, relationship between rash driving
and number of road accidents by a driver is
best studied through regression.
Sentiment Analysis/ Opinion
Mining
Using NLP, statistics, or machine learning
methods to extract, identify, or otherwise
characterize the sentiment content of a text
unit
Sentiment = feelings
Attitudes – Emotions – Opinions
Subjective impressions, not facts
*A common use case for this technology is to
discover how people feel about a particular
topic
Automated extraction of subjective content
from digital text and predicting the
subjectivity such as positive, negative or
neutral
Social Network Analysis
• Process of investigating social structures
through the use of networks and graph theory
• It is the mapping and measuring of
relationships and flows between people,
groups, organizations, computers, URLs, and
other connected information/knowledge
entities.
• The nodes in the network are the people and
groups while the links show relationships or
flows between the nodes.
Two types of SNA
• Egocentric Analysis
– Focuses on the individual and studies an
individual’s personal network and its affects
on that individual
• Sociocentric Analysis
– Focuses on large groups of people – Quantifies
relationships between people in a group
– Studies patterns of interactions and how these
patterns affect the group as a whole
Egocentric Analysis
• Examines local network structure
• Describes the network around a
single node (the ego)
– Number of other nodes (alters)
– Types of connections
• Extracts network features
• Uses these factors to predict health and
longevity, economic success, levels of
depression, access to new opportunities
Sociocentric Analysis
• Quantifies relationships and interactions
between a group of people
• Studies how interactions, patterns of
interactions, and network structure affect
– Concentration of power and resources
– Spread of disease
– Access to new ideas
– Group dynamics
Big data analytics tools and
technologies
Hadoop = HDFS + Map Reduce
HiveHBase
Flume
Oozie
Pig
Flume
Sqoop
Khufka
Storm
RHadoop
Chukwa
Future role of data
Now Future
DNS =
Data
Decision Support
System
Digital Nervous
System (DNS)
Data
Sense ActDecideInterpret
History of Hadoop
1996-2000 2003-04 2005-06 2010 2013
Yahoo
Big data problem faced by all search engines
Google
Google file system and Map reduce papers
Hadoop spawns
Cloud era
Apache
(Dough & Mike)
Next generation Hadoop / Yarn &
Mapreduce2
Hadoop
It is an open source framework used for
distributed storage and processing of dataset
of big data using MapReduce programming
model
• The core components are i) Hadoop
Common – contains libraries and utilities
needed by other Hadoop modules;
• Hadoop Distributed File System (HDFS) –
Stores data on commodity machines,
providing very high aggregate bandwidth
across the cluster
• Hadoop YARN – a platform responsible for
managing computing resources in clusters and
using them for scheduling users' applications
• Hadoop MapReduce – an implementation of
the MapReduce programming model for
large-scale data processing.
Distributed Computing
Use of commodity hardware and open source
software (Increase in number of
processers)against expensive proprietary
software on expensive hardware (Server)
Major Components of Hadoop
Framework
1. HDFS (Hadoop Distributed File System):
Inspired from Google file system
2. Map Reduce : Inspired from Google Map
Reduce
* Both work on cluster of systems, hierarchical
architecture
Hdfs
Map Reduce
File
Master
Node
(Name
Node)
A
A
B
C
C
A
A
C
A
B
C
B
A
C
B
Data Nodes
A block B block C block
Master Node: It monitors the data distributed
among data node
Data Node: Stores the data blocks
* Both are Hadoop daemons. Actually java
programs run on specific machines
Map Reduce
It is divided into 2 phases
1.Map - Mapper code is distributed among
machines and it work on the data which the
system holds (Data locality). The locally
computed results are aggregated and sent to
reducer.
Map Map Map Reduce
2. Reducer- Reducer algorithms are applied to
global data to produce the final result.
Programmers need to write only the map
logic and reduce logic. The correct distribution
of map code to map machines are handled by
Hadoop.
Hadoop Ecosystem
Yahoo Facebook
HDFS
Map Reduce Hbase
Pig Hive
Sqoop/Flume
Pig
 It is a tool uses scripting statements to
process the data
Simple data flow language which saves
development time and efforts
 Typically it was designed for data scientist
who have less programming skills
It is developed by yahoo
Hive
 It produces SQL type language tool which
runs on top of map reduce
Hive is develop by facebook for data scientist
who have less programming skills.
The code written on pig/ hive gets converted
into map reduce jobs and run on HDFS
Sqoop/ Flume
Inorder to facilitate the movement of data to
Rhadoop, sqoop/flume is used
Sqoop is used to move the data from
Relational database and Flume is used to
inject the data as it was created by external
source.
Hbase is a tool which provide features like
real time database to receive data from HDFS

Contenu connexe

Tendances

Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issuesEsar Qasmi
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
OLAP operations
OLAP operationsOLAP operations
OLAP operationskunj desai
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Database backup and recovery basics
Database backup and recovery basicsDatabase backup and recovery basics
Database backup and recovery basicsShahed Mohamed
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data AnalyticsTUSHAR GARG
 
Social Impacts & Trends of Data Mining
Social Impacts & Trends of Data MiningSocial Impacts & Trends of Data Mining
Social Impacts & Trends of Data MiningSushilDhakal4
 
The Advantages and Disadvantages of Big Data
The Advantages and Disadvantages of Big DataThe Advantages and Disadvantages of Big Data
The Advantages and Disadvantages of Big DataNicha Tatsaneeyapan
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processingVijayasankariS
 

Tendances (20)

Database security
Database securityDatabase security
Database security
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issues
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Database security
Database security Database security
Database security
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
OLAP operations
OLAP operationsOLAP operations
OLAP operations
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Overview of Big data(ppt)
Overview of Big data(ppt)Overview of Big data(ppt)
Overview of Big data(ppt)
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Database backup and recovery basics
Database backup and recovery basicsDatabase backup and recovery basics
Database backup and recovery basics
 
Big data
Big dataBig data
Big data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
 
Social Impacts & Trends of Data Mining
Social Impacts & Trends of Data MiningSocial Impacts & Trends of Data Mining
Social Impacts & Trends of Data Mining
 
The Advantages and Disadvantages of Big Data
The Advantages and Disadvantages of Big DataThe Advantages and Disadvantages of Big Data
The Advantages and Disadvantages of Big Data
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 

Similaire à Big data

Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxYashiBatra1
 
Business Analytics 1 Module 1.pdf
Business Analytics 1 Module 1.pdfBusiness Analytics 1 Module 1.pdf
Business Analytics 1 Module 1.pdfJayanti Pande
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Toolsijsrd.com
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
What is Big Data - Edvicon
What is Big Data - EdviconWhat is Big Data - Edvicon
What is Big Data - Edviconedviconin
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
 
Introduction to visualizing Big Data
Introduction to visualizing Big DataIntroduction to visualizing Big Data
Introduction to visualizing Big DataDawit Nida
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfvvpadhu
 

Similaire à Big data (20)

Data Science
Data ScienceData Science
Data Science
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Business Analytics 1 Module 1.pdf
Business Analytics 1 Module 1.pdfBusiness Analytics 1 Module 1.pdf
Business Analytics 1 Module 1.pdf
 
Big data Introduction
Big data IntroductionBig data Introduction
Big data Introduction
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
What is Big Data - Edvicon
What is Big Data - EdviconWhat is Big Data - Edvicon
What is Big Data - Edvicon
 
BD1.pptx
BD1.pptxBD1.pptx
BD1.pptx
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to visualizing Big Data
Introduction to visualizing Big DataIntroduction to visualizing Big Data
Introduction to visualizing Big Data
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 

Dernier

Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 

Dernier (20)

Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 

Big data

  • 1. BIG DATA Prepared by Bhuvaneshwari.P Research Scholar, VIT university, Vellore
  • 2. Introduction DEFINITION Big data is defined as the collection of large and complex datasets that are difficult to process using database system tools or traditional data processing application software. Mainframe (Kilobytes) Client /Server (Megabytes) The Internet (Gigabytes) [Big data] Mobile, Social media… (Zettabytes)
  • 3. Characteristics of Big data The characteristics of big data is specified with 5V’s: 1. Volume – It refers to vast amount of data generated every second. [Kilo bytes->Mega->Giga->Tera->Petta- >Exa->Zetta->Yotta] 2. Variety – It refers to the different kinds of data generated from different sources. 3. Velocity – It refers to the speed of data generation , process, and moves around. 4. Value – It refers to bring out the correct meaning out of available data. 5. Veracity – It refers to the uncertainty and inconsistencies in the data.
  • 4. Categories of Big data Big data is categorized into three forms. 1.Structured – Data which can be stored and processed in predefined format. Ex: Table, RDBMS data. 2.Unstructured – Any data without structure or unknown form. Ex: Output returned by google search, audio, video, image. 3.Semi_ structured - This type of data contain both the forms of data. Ex: JSON, CSV,XML, email . Data types*-> emails, text messages, photos, videos, logs, documents, transactions, click trails, public records etc.,
  • 5. Examples of big data Some examples of big data 1.Social media: 500+ terabytes of data is generated in facebook everyday, 100,000 + data is created in tweets for every 60 sec, 300 hours of videos are uploaded in you tube per minute. 2.Airlines: A single jet engine produce 10+terabytes of data in 30 minutes of a flight time.
  • 6. Cont.., 3. Stock Exchange- The New York stock exchange generates about one terabyte of new trade data everyday. 4. Mobile Phones- For every 60 seconds 698,445+ google searches, 11,000,000+ instant messages, and 168000000 emails are generated by users. 5. Walmart handles more than 1 million customer transaction every hour.
  • 7. Sources of big data 1.Activity data- The basic activity like searches are stored in web browser, the usage of phone is stored by mobile phones, Credit card company stores where customer buys and shop stores what they buys. 2. Conversational data- Conversations in emails, social media sites like facebook, twitter and so on.
  • 8. Cont., 3. Photo and video data- The pictures and videos taken from mobile phones, digital camera, and CCTV are uploaded heavily in youtube and social media sites every second. 4. Sensor data- The sensors embedded in all devices produce huge amount of data. Ex: GPS provide direction and speed of a vehicle. 5. IOT data- Smart TV, smart watch, smart fridge etc. Ex: Traffic sensors send data to alarm clock in smart watch
  • 9. Typical Classification I. Internal data – It supports daily business operations such as organizational or enterprise data ( Structured). Ex: Customer data, Sales data, ERP,CRM etc., II. External data – It is analyzed for competitors, market environment and technology such as social data (Unstructured). Ex: Internet, Government, Business partners, Syndicate data suppliers etc.,
  • 10. Big data storage Big data storage is concerned with storing and managing data in a scalable way, satisfying the needs of applications that require access to the data.  Some of the big data storage technologies are 1. Distributed file system- Store large amounts of unstructured data in a reliable way on commodity hardware
  • 11. Cont., Hadoop File System (HDFS) is an integral part of the Hadoop framework designed for large data files and is well suited for quickly ingesting data and bulk processing 2. NoSQL database - Database that stores and retrieves data that is modeled in means other than the tabular relations and it lacks ACID transactions  Supports both structured and unstructured data
  • 12. The data structures used are key-value, wide column, graph, or document Less functionality and more performance It focus on scalability, performance, and high availability Flat files RDBMS NoSQL No standard Implementa tion Could not handle big data
  • 13. 3. NewSQL database - Provide the same scalable performance of NoSQL systems for Online Transaction Processing (OLTP) read- write workloads while still maintaining the ACID guarantees of a traditional database system 4. Cloud storage – Service model in which data is maintained, managed, backed up remotely and made available to users over the Internet
  • 14. Cont., Eliminates the acquisition and management costs of buying and maintaining your own storage infrastructure, increases agility, provides global scale, and delivers "anywhere, anytime" access to data Users generally pay for their cloud data storage on a per-consumption “Pay as per use”
  • 15. Data intelligence Data intelligence - Analysis of various forms of data in such a way that it can be used by companies to expand their services or investments Transforming data into information, information into knowledge, and knowledge into value
  • 16. Data integration and serialization Data integration- Combining data residing in different sources and providing users with a unified view of them Data serialization- It is the concept of converting structured data into a format that allows it to be shared or stored in such a way that its original structure to be recovered.
  • 17. Data monitoring Data monitoring- It allows an organization to proactively maintain a high, consistent standard of data quality • By checking data routinely as it is stored within applications, organizations can avoid the resource-intensive pre-processing of data before it is moved • With data monitoring, data quality checked at creation time rather than before a move.
  • 18. Data indexing Data indexing- It is a data structure that is added to a file to provide faster access to the data. • It reduces the number of blocks that the DBMS has to check. • It contains a search key and a pointer. Search key - an attribute or set of attributes that is used to look up the records in a file. • Pointer - contains the address of where the data is stored in memory.
  • 19. Why Big data? These are the factors leads to the emergence of big data 1. Increase of storage capacity 2. Increase of processing power 3. Availability of data 4. Derive insights and drive growth 5. To be competitive
  • 20. Benefits of Big Data Processing 1. Businesses gains intelligence while decision making. 2. Better Customer service. 3. Early identification of risk in product/ services. 4. Improved operational efficiency – Product recommendation. 5. Detecting fraudulent behavior.
  • 21. Applications of Bigdata  Smarter health care – Leverage the health care system with easy access and efficient outcome. Multi channel sales and web display advertisement Finance Intelligence Traffic management Manufacturing Fraud and risk detection Telecom
  • 22. Analysis Vs Analytics Analysis - It is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it  What happened in the past? It is the process of examining, transforming and arranging raw data in a specific way to generate useful information from it Analytics – It is sub component of analysis that involves the use of tools and techniques to find novel, valuable and exploitable patterns (What will happen in future?)
  • 23. Big data analytics It is the process of  Collecting, Storing, Organizing and Analyzing the large set of heterogeneous data for gaining insights, discover patterns, correlations and other useful information  Faster and better decision making  Enhance performance, service or product Cost effective and next generation products
  • 24. Challenges/Opportunity Unstructured Data (90%) Structured Data (10%) To analyze & extract meaningful information
  • 25. Stages in Big data analytics I. Identifying problem II. Designing data requirements III. Preprocessing data IV. Visualizing data and V. Performing analytics over data
  • 26. Tradition Vs Big data analytics Traditional Analytics Big data analytics  Analytics with well known and smaller in size data Not well understood format with largely semi structured or unstructured data  Built based on relational data models  It is retrieved from various sources with almost flat and no relationship in nature
  • 27. Four types of analytics 1. Descriptive Analytics : What happened?  It is a backward looking and reveal what has occurred in the past with the present data (Hindsight)  Two types: 1) Measures of central tendency (mean, mode, and median) 2) Measures of dispersion (range, variance, and standard deviation)
  • 28. 2. Diagnostic Analytics : Why did this happen? What went wrong? 3. Predictive Analytics : What is likely to happen?  It predict what could happen in the future (Insight)  Several models used are i) Forecasting, ii) Simulation, iii) Regression, iv)Classification, and v) Clustering
  • 29. 4. Prescriptive analytics – What should we do to make it happen? It suggest conclusions or actions that can be taken based on the analysis Techniques used are i) Linear programming, ii)Integer programming, iii)Mixed integer programming, and iv)Non linear programming
  • 30. Approach in analytics development  Identify the data source Select the right tools and technology for collect, store and organize data  Understand the domain and process data Build mathematical model for your analytics Visualize and validate the result Learn, adapt and rebuilt your analytical model.
  • 31. Big data analytics domain  Web and E-Tailing  Government Retail Tele communication Health care Finance and banking
  • 32. Big data techniques There are seven widely used big data analysis techniques. They are 1. Association rule learning 2. Classification tree analysis 3. Genetic algorithms 4. Machine learning 5. Regression analysis 6. Sentiment analysis 7. Social network analysis
  • 33. Association rule learning  Rule based machine learning method for discovering the interesting relations between variables in large database. In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are used. The best known constraints are minimum threshold on support and confidence.
  • 34. Cont., Support- Indication of how frequently the item set appear in the data set. Confidence – Indication of how often the rule has been found to be true. Example Rule for the supermarket { bread, butter} => {Milk}, It mean that if butter and bread are bought, customers also buy milk.
  • 35. Algorithms for association rule learning Some of the familiar algorithms used for mining frequent item sets are 1.Apriori algorithm- It uses a) breadth- first search strategy to count the support of item sets b) candidate generation function which exploits the downward closure property of a support
  • 36. Equivalence class transformation (ECLAT) algorithm  Depth first search algorithm using set intersection Suitable for serial and parallel execution with locality enhancing properties
  • 37. Frequent Pattern (FP) Growth algorithm 1st phase - Algorithm counts number of occurrence of items in dataset and store in header table 2nd phase – FP tree structure is built by inserting instances. Items in each instance have to be sorted by descending order of their frequency in the dataset, so that the tree can be processed quickly.
  • 38. Classification Tree Analysis It is a type of machine learning algorithm used to classify the class of an object Identifies a set of characteristics that best differentiates individuals based on a categorical outcome variable
  • 39. Genetic Algorithms Search based optimization technique based on the concepts of natural selection and genetics In GAs, we have a pool or a population of possible solutions to the given problem. These solutions then undergo recombination and mutation (like in natural genetics), producing new children, and the process is repeated over various generations.
  • 40. Cont.,  Each individual is assigned a fitness value (based on its objective function value) and the fitter individuals are given a higher chance to mate and yield more “fitter” individuals Part of evolutionary algorithms  Three basic operators of GA: (i) Reproduction, (ii) Mutation, and (iii) Crossover
  • 41. Machine Learning  It is a method of data analysis that automates analytical model building It is an application of Artificial Intelligence based on the idea that machines should be able to learn and adapt through experience Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction
  • 42. Cont., • Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. • Learning means recognizing and understanding the input data and making wise decisions based on the supplied data.
  • 43. Cont., • It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem, algorithms are developed. These algorithms build knowledge from specific data and past experience with the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement learning, and control theory.
  • 44. Learning types There are several ways to implement machine learning techniques, however the most commonly used ones Supervised learning Unsupervised learning Semi supervised learning
  • 45. Supervised learning • Deals with learning a function from available training data. Known input and output variable. Use an algorithm to learn the mapping function from input to output [Y=f(X)] • Analyzes the training data and produces an inferred function, which can be used for mapping new examples • Some supervised learning algorithms are neural networks, Support Vector Machines (SVMs), and Naive Bayes Classifiers, Random forest, Decision Trees, Regression. • Ex: classifying spam, voice recognization, regression
  • 46. Unsupervised Learning  Makes sense of unlabeled data without having any predefined dataset for its training. Only input (X) and no corresponding output variable  Model the underlying structure or distribution in the data in order to learn more about data  It is most commonly used for clustering similar input into logical groups  Common approaches: K means, self organizing maps and hierarchical clustering  Techniques: Recommendation, Association, Clustering
  • 47. Semi Supervised Learning Problems where you have a large amount of input data (X) and only some of the data is labeled  Example: In photo archive where only some of the images are labeled and the majority are unlabeled
  • 48. Regression Analysis • It is a set of statistical processes for estimating the relationships among variables • Regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. • Widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning.
  • 49. Cont., • This technique is used for forecasting, time series modeling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.
  • 50. Sentiment Analysis/ Opinion Mining Using NLP, statistics, or machine learning methods to extract, identify, or otherwise characterize the sentiment content of a text unit Sentiment = feelings Attitudes – Emotions – Opinions Subjective impressions, not facts
  • 51. *A common use case for this technology is to discover how people feel about a particular topic Automated extraction of subjective content from digital text and predicting the subjectivity such as positive, negative or neutral
  • 52. Social Network Analysis • Process of investigating social structures through the use of networks and graph theory • It is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other connected information/knowledge entities. • The nodes in the network are the people and groups while the links show relationships or flows between the nodes.
  • 53. Two types of SNA • Egocentric Analysis – Focuses on the individual and studies an individual’s personal network and its affects on that individual • Sociocentric Analysis – Focuses on large groups of people – Quantifies relationships between people in a group – Studies patterns of interactions and how these patterns affect the group as a whole
  • 54. Egocentric Analysis • Examines local network structure • Describes the network around a single node (the ego) – Number of other nodes (alters) – Types of connections • Extracts network features • Uses these factors to predict health and longevity, economic success, levels of depression, access to new opportunities
  • 55. Sociocentric Analysis • Quantifies relationships and interactions between a group of people • Studies how interactions, patterns of interactions, and network structure affect – Concentration of power and resources – Spread of disease – Access to new ideas – Group dynamics
  • 56. Big data analytics tools and technologies Hadoop = HDFS + Map Reduce HiveHBase Flume Oozie Pig Flume Sqoop Khufka Storm RHadoop Chukwa
  • 57. Future role of data Now Future DNS = Data Decision Support System Digital Nervous System (DNS) Data Sense ActDecideInterpret
  • 58. History of Hadoop 1996-2000 2003-04 2005-06 2010 2013 Yahoo Big data problem faced by all search engines Google Google file system and Map reduce papers Hadoop spawns Cloud era Apache (Dough & Mike) Next generation Hadoop / Yarn & Mapreduce2
  • 59. Hadoop It is an open source framework used for distributed storage and processing of dataset of big data using MapReduce programming model • The core components are i) Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  • 60. • Hadoop Distributed File System (HDFS) – Stores data on commodity machines, providing very high aggregate bandwidth across the cluster • Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users' applications • Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.
  • 61. Distributed Computing Use of commodity hardware and open source software (Increase in number of processers)against expensive proprietary software on expensive hardware (Server)
  • 62. Major Components of Hadoop Framework 1. HDFS (Hadoop Distributed File System): Inspired from Google file system 2. Map Reduce : Inspired from Google Map Reduce * Both work on cluster of systems, hierarchical architecture Hdfs Map Reduce
  • 64. Master Node: It monitors the data distributed among data node Data Node: Stores the data blocks * Both are Hadoop daemons. Actually java programs run on specific machines
  • 65. Map Reduce It is divided into 2 phases 1.Map - Mapper code is distributed among machines and it work on the data which the system holds (Data locality). The locally computed results are aggregated and sent to reducer. Map Map Map Reduce
  • 66. 2. Reducer- Reducer algorithms are applied to global data to produce the final result. Programmers need to write only the map logic and reduce logic. The correct distribution of map code to map machines are handled by Hadoop.
  • 67. Hadoop Ecosystem Yahoo Facebook HDFS Map Reduce Hbase Pig Hive Sqoop/Flume
  • 68. Pig  It is a tool uses scripting statements to process the data Simple data flow language which saves development time and efforts  Typically it was designed for data scientist who have less programming skills It is developed by yahoo
  • 69. Hive  It produces SQL type language tool which runs on top of map reduce Hive is develop by facebook for data scientist who have less programming skills. The code written on pig/ hive gets converted into map reduce jobs and run on HDFS
  • 70. Sqoop/ Flume Inorder to facilitate the movement of data to Rhadoop, sqoop/flume is used Sqoop is used to move the data from Relational database and Flume is used to inject the data as it was created by external source. Hbase is a tool which provide features like real time database to receive data from HDFS

Notes de l'éditeur

  1. y
  2. Ng the