When writing this new paper, my main objective was to provide a clear understanding of where the term "Big Data" comes from, why is that term so popular now, what does it really mean and what can be its implication for businesses. Because the full power of Big Data can be revealed only by Analytics, i provided a description of a widely recognized and used analytical techniques to help you figure out how used in conjunction with Big Data, analytics can boost Business Performance.
i expected that by the end of this paper :
- you will smile the next time you read or hear at the terms big data, hadoop, or analytics :)
- you will understand the technologies that are behind the scene when one talks about "Big Data"
- you will know how to "make sense" of Big Data using Analytics
- you will get a basic idea of data mining techniques used in Business in general and in Big Data in particular
- you will be able to get every news about Big Data
Student profile product demonstration on grades, ability, well-being and mind...
Big Data 360 Overview: Analytics Techniques
1. BIG DATA: A 360° Overview
Juvénal CHOKOGOUE M
Consultant Business Analytics – Big Data
BD-DE-0005
11/23/2014
2. Module Overview
• The Business Challenge
• What this module Stands for ?
• Who is this module for ?
• Before the battle begins
• Anyway! What is Big Data ?
• Big Data and Analytics: How these two married together?
• Analytical Techniques for Mining Big Data
• The New Infrastructure for Data Management : Hadoop
• Big Data adoption : Now or Later ?
• The Next Steps
• What Should i remember ?
• Some Big Data Providers
• Bibliography & Resources
• About me
3. The Business Challenge
• Scaling operations up and down as
conditions change and ability to
Decrease “time to market” for decision-making
are become a critical
competitive differentiator in today’s
economy.
• Companies are gathering more and
more data to stay competitive.
• If they want to decrease their “time
to market”, they must make sense of the
intersection of all these different kind of
data they have gathered.
• Technically, when you are dealing
with so much data in so many different
forms, it is impossible to think about
data management in traditional ways.
• The challenges and opportunities
associated with this new kind of data
management problem is known today
as "Big Data"
4. What this module Stands for ?
Like in any other technological concept that pops up, Software Companies are
always fighting against definitions in order to sell their products, confusing and leaving
businesses a confuse idea of the concept and of where that concept fit in the issues they have
to face. Big Data, like any other concept such as Cloud Computing, Virtualization, Data mining
and so on, is just one of these concept.
i expected that by the end of this paper :
• you will smile the next time you read or hear at the terms big data, hadoop, or analytics :)
• you will understand what are behind the scene when one talks about "Big Data"
• you will know how one can "make sense" of Big Data using Analytics
• you will get a basic idea of data mining techniques used in Business and in Big Data
• you will be able to get every news about Big Data
So, Keep hearing…
5. What this module Stands for ?
Like in any other technological concept that pops up, Software Companies are always fighting
against definitions in order to sell their products, confusing and leaving businesses a confuse idea of the
concept and of where that concept fit in the issues they have to face. Big Data, like any other concept such
as Cloud Computing, Virtualization, Data mining and so on, is just one of these concept.
When writing this paper, my main objective was to provide really a 360 ° overview of Big Data,
that is a clear understanding of where the term "Big Data" comes from, why is that term so popular now,
what does it really mean and what can be its implication for businesses. Because Analytics is another term
that is associated to Big Data, i provided a description of a widely recognized and used analytical
techniques to help you figure out how used in conjunction with Big Data, analytics can boost Business
Performance.
So, please don't lend me words; this paper does not intent to as a “how-to” neither for a big
data project management, nor for big data application development, nor for Statistical Model Building.
Those will be the subject of other papers. Rather, i expected that by the end of this paper :
• you will smile the next time you read or hear at the terms big data, Hadoop, or analytics :)
• you will understand what are behind the scene when one talks about "Big Data"
• you will know how one can "make sense" of Big Data using Analytics
• you will get a basic idea of data mining techniques used in Business and in Big Data
• you will be able to get every updates about Big Data
So, Keep Reading…
6. Before the battle begins
information provided here is for informational purposes only and represents my current point of view as of
the date of this presentation. Due to changing conditions of market, information provided here can be
modify or obsolete, it should not be interpreted to be a commitment and I cannot guarantee its accuracy
after the date of this presentation.
Contents of websites provided here can be modify or change, or the website itself can be unavailable after
the publication of this presentation. So I can not MAKES warranties, express, implied or statutory, as to the
information in this presentation.
In this presentation, i choose to call the "Analyst" the person who is responsible for data management,
analytics, and programming Job. It is just a simplification that i adopted to avoid you of being worried by the
new jobs/terms created by Big Data and help you focus on the content of the paper.
Microsoft, SQL Server, Teradata, Oracle, Google, Hadoop, Cloudera, HortonWorks, SAS, EMC and other
names and products cited here are or may be registered Trademarks in the U.S. and/or in other countries.
Feel free to share this module with anyone you know, from your colleagues to your friends, but in this case,
don’t forget to mention the name of the author.
You can use and change the content of this module at your own but I will not be responsible of it content
in this case.
This module is not for sale, If you intend to use it to your own, please, don’t commercialize it !
8. • According to Gartner : "Big data is
high-volume, high-velocity and high-variety
information assets that demand
cost-effective, innovative forms of
information processing for enhanced
insight and decision making.“
(http://www.gartner.com/it-glossary/big-data/)
From all definitions provided for Big Data, the definition of Gartner
is the most widely adopted for describing Big Data. And from that definition,
one thing Is clear : when one uses the term Big Data, it is to designate data
that is large in volume , has a high velocity and is available in wide variety . This
is often refer to as the “3-V” or the 3 Dimension of Big Data.
9. Big Data and Analytics:
How these two married together?
10. Taken alone, Big data is technology-driven. If Businesses want to capitalize on their Big Data
paradigm, they have to find a way to combine their traditional business analysis techniques they used
in the past to query and dive through the data.
But with extremely wide variety of data comes new challenges. Most of traditional business analysis
techniques are not suitable for the new kind of data sources we have today and that is where
Analytics comes into play!
Analytics design the means by which businesses gain insight from data whatever its source, its size
and even its format.
11. All this said, you can now understand
that Big Data Analytics is the concept
that design the new means by which we
extract insights from data that are
extremely large, extremely varied and
extremely swift.
• However, Be aware that the
efficiency of Analytics depends
fundamentally on the question you want
to answer, and on the Quality of data.
Data quality issues must be consider
prior to analytics concern. As it is said in
the field: "Garbage in, Garbage out".
• Analytics techniques must be
handle with cautious and require a
formal training in the field. you may
consider to invest in acquiring an
analytics professional
12. Thirdly, analytics is not a "silver bullet"
that will always give you insights.
fourthly, Just Because You Have Insights
Does not Guarantee You Have The
Power To Act on Them, that is Analytics
can provide insights, but turning
insights from numbers into competitive
advantage may require changes that
your business can’t afford, or simply
doesn’t want to make. The Harvard
Business Review explores a case study
where through big data it was learned
“that he could increase profits
substantially by extending the time that
items were on the floor before and after
discounting. Implementing that
change, however, would have required a
complete redesign of the supply chain,
which the retailer was reluctant to
undertake.” (source
:https://hbr.org/2013/12/you-may-not-need-
big-data-after-all/ar/1)
Analytics does not replace your business intuition. It
just make you feel more confident about your choice.
you may at the end consider your experience and your
intuition as a manager to take the decision.
14. in this part, i am going to talk only about
some techniques i am certified in. These
techniques are used in most business
scenarios and have showed their proof long
ago.
These techniques are : Regression( Linear and
Logistic), Decision Trees, K-Means, Times
Series, Neural Network, Association Rules,
Naive Bayes and Survival Analysis. In addition,
i am going to present Text Analytics
fundementals, since in Big Data age, we are
generating more and more text data (tweets,
facebook comments..).
- Regression
regression focuses on the relationship
between an outcome and its input variables.
Here, we are predicting how changes in
individual drivers affect the outcome. the
outcome can be continuous or discrete. When
it is discrete, we are predicting the probability
that the outcome will occur. When it is
continuous, we are predicting the value of the
dependent variable given the independent
a survey from TDWI
15. - Decision Trees
Decision Trees are a flexible method very
commonly deployed in classification and
regression problems. Decision trees partition
large amount of data into smaller segments
by applying a series of rules in the form "if
condition THEN expression" (eg: if age less
than 30 and revenue greater than 36000 then
class = 'Rich'). Decision trees are visually
represented as upside-down trees with the
root at the top and branches emanating from
the root. There are two types of trees:
Classification Trees and Regression trees.
- K-Means
K-means is a clustering method, it enter in
the category of Exploratory Data Analysis
Methods called "Unsupervised Classification".
The goal is to group data based on similarities
in input variables with no target or specific
outcome. It is the preferred method for
segmentation & Profiling.
a survey from TDWI
16. -Times Series
Time Series Analysis provides a scientific methodology for
forecasting. Time Series Analysis is the analysis of a
phenomenon that has a temporary evolution. The main
objectives in Time Series Analysis are:
• To understand the underlying structure of the time series
by breaking it into trend, seasonality, and noise.
• Fit a mathematical model to forecast the future.
- Neural Network
Artificial Neural Network are class of flexible non-linear
models used for prediction problems. The power of the
neural network comes from the fact that they can
approximate virtually any continuous association between
the inputs and the target, whatever the kind of relationship
associate them. There are many kind of Neural Network,
but the most widely used is the Multi Layer Perceptron
(MLP).
- Association Rules
Also known as association rules discovery or Market
Basket Analysis or affinity analysis, association rule is a
popular data mining method for exploring associations
between items (data). It is an unsupervised method for in-database
mining over transactions in databases.
17. - Naive Bayes
Naive bayes is a "Classifier", that is it is used to classify or
assign labels to objects based on applying Bayes theorem
with strong naïve independence assumptions. Naive
Bayes is specifically suited for problems where you have a
categorical inputs with lot of levels.
- Survival Analysis
Survival analysis is a class of statistical methods for
studying the occurrence and timing of events. It is suitable
for problems where you want to know WHEN a specific
event will happen. . Most common approach to build a
survival model are the following : Life Tables, Kaplan-Meier
estimators, exponential regression, proportional hazards
regression, competing risk models and discrete-time
methods.
- text analytics fundamentals
Text analytics is the process of analyzing unstructured text,
extracting relevant information, and transforming it into
structured information that can then be leveraged in
various ways. The analysis and extraction processes take
advantage of techniques that originated from
computational linguistics (Natural Semantic Language),
statistics, and other computer science disciplines.
19. 6.1 The New data management strategy
• The centralized process for data processing is no more efficient
nowadays !
• To deal with Big Data, the idea is to distribute the storage of
data and parallelize the processing of that data across several
cluster of computers: the Cluster computing infrastructure.
• In cluster computing :
- data Files are stored redundantly.
- Computation are divided into tasks and parallelized
• The redundancy of the data on multiple hard disk is supported
via a new kind of file system called the "Distributed File System"
(DFS) and the parallelism of the processing is performed via a
new kind of programming model called "MapReduce".
• The Most popular (and yet mature) implementation of
MapReduce is called "Hadoop". Hadoop comes along with the
HDFS (Hadoop Distributed File System)
• Yes, you got it! You can use an implementation of MapReduce to
manage many large-scale data computations in a way that is
tolerant of hardware fault.
A cluster computing environment
Map Reduce Job Description
20. • Hadoop is a platform that implements
MapReduce and provide a redundant, reliable
and distributed file system optimized for large
files.
• In reality, Hadoop is just a set of Java classes
(theses classes can also be written into other
programming languages such as Python, C#,
C++,...) for HDFS types and MapReduce job
management.
• Theses classes allow the analyst to write
functions that will get insight from data
without having to worry about how his code is
distributed and parallelized in the cluster
environment.
• To get out the most of a Hadoop cluster , a set
of technologies and tools have been
developed. These set of tools forms today
what is convenient to call : the Hadoop
Ecosystem.
• The most foundational tools of the Hadoop
Ecosystem are the following: Pig, Hive, HBase,
Sqoop, Zookeeper & Mahout.
6.2 The Hadoop Ecosystem
21. - Pig
Pig is an interactive data flow (or script-based)
language and execution environment
for Hadoop. Pig provides a data flow
language called Pig Latin that allows to
express a series of operations to apply to an
input data to produce output.
- Hive
Hive is an interactive and batch query
language based on SQL for building
MapReduce jobs. It provides users who know
SQL with a simple SQL-like implementation
called HiveQL.
-HBase
HBase is a distributed, column-oriented
database that utilizes HDFS as its persistence
store and supports MapReduce and point
queries. It is capable of hosting very large
tables (billions of columns/rows) because it
is layered on Hadoop clusters of commodity
hardware.
eg of a Pig script : finding the Maximum
temperature by year
1 records = LOAD 'data/samples.txt AS (year:
chararray, temperature : int, quality: int);
2 filtered_records = FILTER records BY
temperature !=9999 AND (quality ==0 OR
quality == 4);
3 grouped_records = GROUP filtered_records BY
year ;
4 Max_temp = FOREACH grouped_records GENERATE
group, MAX (filtered_records.temperature)
5 DUMP max_temp ;
The same previous example written in HiveQL
1 CREATE TABLE records (year string,
temperature INT, quality INT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY 't' ;
2 LOAD DATA LOCAL 'data/sample.txt'
OVERWRITE INTO TABLE records ;
3 SELECT year, MAX(temperature) FROM records
WHERE temperature !=9999 AND (quality == 0
OR quality == 1) GROUP BY year ;
22. - Sqoop
Sqoop (SQL-to-Hadoop) efficiently transfers data
from Hadoop HDFS to structured Relational
Databases and vice-verça. Look at Sqoop as the
ETL (Extract - Transform - Load) for an Hadoop
environment.
- Zookeeper
Zookeeper provides a distributed configuration
service, a synchronization service and a naming
registry for distributed applications. Zookeeper is
Hadoop’s way of coordinating all the elements of
these distributed applications.
-Mahout
Mahout is a scalable machine learning and data
mining library for Hadoop. Look at Mahout as the
analytic software for an Hadoop environment.
Mahout provides data mining and machine
learning algorithms packaged in Java libraries to
perform 4 types of analysis in an Hadoop
environment: Recommendation mining,
classification, clustering and association rules.
24. The answer to this question must lie in the integration and the operationalization of analytics as a whole part
of the organization's business process. This suppose organization is data-driven. the big data approach is
mostly suited to addressing or solving business problems that are subject to one or more of the following
criteria:
1. Data throttling:
2. Computation-restricted throttling
3. Large data volumes
4. Significant data variety
5. Benefits from data parallelization
25. What Should I remember ?
• Even if we have always had a lot of data, the difference today is that significantly more of it
exists, and it varies in type and timeliness. To cope with this problem , you have to think
about managing data differently. That is where comes the "Big Data".
• Big Data is the name given to the data management challenges and opportunities that
emerge when dealing with data that is extremely large in volume, has extremely high
velocity and is extremely wide in variety.
• Big Data without Analytics is just data
• Just Because You Have Insights Doesn’t Guarantee You Have The Power To Act on Them.
• every problem is not suitable for Big Data
• MapReduce is a programming model that allow to manage large-scale data computations
in a way that is tolerant of hardware fault.
• Hadoop is a platform that implements MapReduce and provide a redundant, reliable and
distributed file system optimized for large files.
26. Some Big Data Providers
Here are some Big Data providers I personally know. There are some others.
- Cloudera, with its first commercial distribution of Hadoop
- HortonWorks, with its commercial distribution of Hadoop
- SAS Institute with its SAS on Hadoop platform, SAS High Performance Suite, SAS Grid
Computing and SAS Visual Analytics
- HP with its platform called HP Vertica
- EMC with its platform called GreenPlum Pivotal
27. Bibliography & Resources
http://www.cisjournal.org/archive/vol2no4/vol2no4_1.pdf
Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering
http://eprints.ecs.soton.ac.uk/18483/
Online applications : http://www.convo.co.uk/x02/
http://mahout.apache.org/
EMC Data Science & Big Data Analytics Training Module
https://education.emc.com/guest/campaign/data_science.aspx
SAS Official Predictive Modeling Training Course
https://support.sas.com/edu/schedules.html?id=1366&ctry=us
https://support.sas.com/edu/schedules.html?id=1220&ctry=US
Big Data for Dummies by Judith Hurwitz, Alan NUGENT, Dr. Fern Halper, Marcia Kaufman
ISBN : 978-1-118-50422-2 www.wiley.com
Gartner : http://www.gartner.com/it-glossary/big-data/
The Harvard Business Review :
https://hbr.org/2013/12/you-may-not-need-big-data-after-all/ar/1
MapReduce: Simplified Data Processing on Large Clusters (from Google)
http://static.googleusercontent.com/media/research.google.com/fr//archive/mapreduce-osdi04.pdf
Hadoop Apache Foundation
http://hadoop.apache.org/
TDWI : http://tdwi.org/
28. About Me
• I am a freelance/Consultant who help organisations leverage their data to improve their performance
through the right tool, the right methodology and the right technology. I have over 3 years of
experience and 5 Certifications. I am a highly certified SAS Professional and also a certified EMC²
Data Scientist.
Contact
Mail : jvc35@yahoo.fr
Twitter : @Juvenal_JVC
Linkedin : http://fr.linkedin.com/pub/juv%C3%A9nal-chokogoue/52/965/a8
Data Information Knowledge
Actionable
plans
Performance
29. Thank you for attending, I sincerely hope
this module will be helpful for you !
The Full version will be available soon !!!!