CONTENTS
1. Big Data
2. Data vs Big Data
3. Examples
4. Challenges
5. Big Data Analytics
6. Traditional vs Big Data analytics
7. Hadoop
8. Application
WHAT IS BIG DATA
Big data is a collection of data sets that are
large and complex in nature.
They grow both structured and unstructured
data that grow large so fast that they are not
manageable by traditional relational database
systems or conventional statistical tools.
DATA VS BIG DATA
Big data is just data with:
• More volume
• Faster data generation (velocity)
• Multiple data format (variety)
World's data volume to grow 40%
per year & 50 times by 2020 [1]
Data coming from various human
& machine activity
BIG DATA ANALYTICS
IN PRACTICE
1. The New York Stock Exchange generates about
one terabyte of new trade data per day.
2. Single Jet engine can generate 10+terabytes of
data in 30 minutes of a flight time. With many
thousand flights per day, generation of data
reaches up to many Petabytes.
3. Statistic shows that 500+terabytes of new data
gets ingested into the databases of social media
site Facebook, every day. This data is mainly
generated in terms of photo and video uploads,
message exchanges, putting comments etc.
CHALLENGES
More data = more storage space
• More storage = more money to spend (RDBMS server needs
very costly storage)
Data coming faster
• Speed up data processing or we’ll have backlog
Needs to handle various data structure
• How do we put JSON data format in standard RDBMS?
• Hey, we also have XML format from other sources
• Other system give us compressed data in gzip format
Agile business requirement
• On initial discussion, they only need 10 information, now they
ask for 25? Can we do that? We only put that 10 in our
database
TYPES OF BIG DATA
• Structured Data : Any data that can be stored,
accessed and processed in the form of fixed format is
termed as a 'structured' data.
• Un-Structured Data : Any data with unknown form or
the structure is classified as unstructured data.
• Semi-structured Data : Semi-structured data can
contain both the forms of data.
BENEFITS OF BIG
DATA PROCESSING
• Businesses can utilize outside intelligence while
taking decisions:- Access to social data from search
engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
• Improved customer service :- Traditional customer
feedback systems are getting replaced by new
systems designed with ‘Big Data’ technologies. In
these new systems, Big Data and natural language
processing technologies are being used to read and
evaluate consumer responses.
• Early identification of risk to the product/ services, if
any
• Better operational efficiency:-'Big Data' technologies
can be used for creating staging area or landing zone
for new data before identifying what data should be
moved to the data warehouse. In addition, such
integration of 'Big Data' technologies and data
warehouse helps organization to offload infrequently
accessed data.
BIG DATA ANALYTICS
Big data analytics is the process of examining large
and varied data sets -- i.e., big data -- to uncover
hidden patterns, unknown correlations, market trends,
customer preferences and other useful information
that can help organizations make more-informed
business decisions.
TRADITIONAL VS
BIG DATA ANALYTICS
Traditional analytics Big Data Analytics
Analytics using know data
which is well understood.
Not well understood data
format for it largely being
unstructured and semi
structured.
Build based on relational
data base model.
Big data comes in various
forms and formats from
multiple disconnected
system. They are almost flat
with no relationship.
4 TYPES OF
ANALYTICS
1. Descriptive : what happened ??
2. Diagnostic : why did it happened ??
3. Predictive : what is likely to happen ??
4. Prescriptive : what should I do about it ??
APPROACH TO ANALYTICS
1. Identify the data sources.
2. Select the right tools and technology to collect,
store, aggregate the data.
3. Understand the business domain.
4. Identify tools and technology to process the data.
5. Build mathematical models for the analytics .
6. Visualize.
7. Validate your result.
8. Learn, adopt, and rebuild your analytical model.
ANALYTICS TOOLS
Most used statistical programming tools are:
• IBM SPSS
• SAS
• R
• MATLAB
R and MATLAB have the most comprehensive
support of statistical functions.
HADOOP
Hadoop is a framework that allows for distributed
processing of large data sets across clusters of
commodity computers using a simple programming model
.
• Software framework that supports distributed
applications, licensed under the Apache v2 license.
• Hadoop was derived from Google's MapReduce and
Google File System papers.
• YAHOO is the largest contributor to the project
• Written in the Java programming language .
WHY USE HADOOP ?
• Need to compress data
• Nodes fail every day
• Common infrastructure
Efficient
Easy to use
Open Source
COMMON USES
• Searches
• Log processing
• Recommendation systems
• Analytics (Facebook, Linkedin)
• Image and video processing (NASA)
• Data retention
TECHNOLOGIES AND
TOOLS
Unstructured and semi-structured data types typically
don't fit well in traditional data warehouses that are
based on relational databases oriented to structured
data sets.
As a result, many organizations that collect, process
and analyze big data turn to NoSQL databases as well
as Hadoop and its companion tools, including:
MapReduce: a software framework that allows
developers to write programs that process massive
amounts of unstructured data in parallel across a
distributed cluster of processors or stand-alone
computers.
YARN: a cluster management technology and one
of the key features in second-generation Hadoop.
Spark: an open-source parallel processing
framework that enables users to run large-scale
data analytics applications across clustered
systems.
HBase: a column-oriented key/value data store
built to run on top of the Hadoop Distributed File
System (HDFS).
Hive: an open-source data warehouse system for
querying and analyzing large datasets stored in
Hadoop files.
Kafka: a distributed publish-subscribe messaging
system designed to replace traditional message
brokers.
Pig: an open-source technology that offers a
high-level mechanism for the parallel
programming of MapReduce jobs to be executed
on Hadoop clusters.
BIG DATA ANALYTICS
BENEFITS
• Driven by specialized analytics systems and
software, big data analytics can point the way to
various business benefits, including new revenue
opportunities, more effective marketing, better
customer service, improved operational efficiency
and competitive advantages over rivals.
• Big data analytics applications enable data
scientists, predictive modelers, statisticians and
other analytics professionals to analyze growing
volumes of structured transaction data, plus
other forms of data that are often left untapped by
conventional business intelligence (BI) and
analytics programs.
• On a broad scale, data analytics technologies and
techniques provide a means of analyzing data
sets and drawing conclusions about them to help
organizations make informed business decisions.
BIG DATA ANALYTICS
APPLICATION
• Government : The use and adoption of big data
within governmental processes allows efficiencies
in terms of cost, productivity, and innovation, but
does not come without its flaws.
• Manufacturing: Based on TCS 2013 Global Trend
Study, improvements in supply planning and
product quality provide the greatest benefit of big
data for manufacturing.
• Information Technology :Especially since 2015, big
data has come to prominence within Business
Operations as a tool to help employees work more
efficiently and streamline the collection and
distribution of Information Technology (IT).
• Education: A McKinsey Global Institute study found a
shortage of 1.5 million highly trained data
professionals and managers and a number of
universities including University of Tennessee and UC
Berkeley, have created masters programs to meet this
demand.