SlideShare une entreprise Scribd logo
1  sur  106
Evolution of Predictive
Analytics
What is Predictive Analytics?
Predictive analytics is the practice of extracting
insights from the existing data set with the help
data mining, statistical modeling and machine
learning techniques and using it to predict
unobserved/unknown events.
 Identifying cause-effect relationships across the
variables from the historical data.
 Discovering hidden insights and patterns with
the help of data mining techniques.
 Apply observed patterns to unknowns in the
Past, Present or Future.
What is Predictive Analytics
Predictive Analytics Process Cycle
Analytics & Predictive Analytics
 Analytics is the understanding of existing
(retrospective) data with the goal of
understanding trends via comparison
 Developing analytics is the first step towards
deriving predictive analytics
 Predictive Analytics are more sophisticated analytics
that “forward thinking” in nature
 used for gaining insights from mathematical
and/or financial modeling by enhancing
understanding, interpretation and judgment for
the purpose of good decision making
© 2011 Predictive Dashboards LLC 7
Comparative Study:: Analytics and Predictive Analytics
Attribute Analytics Predictive Analytics
Purpose:
Understand the Past
Observe Trends
Catalyst for Discussion
Gain Insights
Make Decisions
Take Action
View: Historical and Current Future Oriented
Metrics Type: Lagging Indicators Leading Indicators
Data Used: Raw & Compiled Information
Data Type: Structured Structured and Unstructured
Users: Middle & Senior Mgt
Analysts, End Users
C-Level & Senior Mgt
Strategists, Analysts, Mgrs
Benefits: Gaining an understanding
of data
Productivity Improvements
Gaining Information & Insights
Process Improvements
Benefits
Benefits of Analytics:
productivity gains through improved data-
gathering processes
results in less time required for producing
reports and metrics
beneficial Not scalable, not repeatable
Benefits of predictive analytics:
process improvement gains through
improve revenue generation & cost
structures
enhanced decision making
Beneficial, Scalable, repeatable
Common Predictive Analytics
• Regression:
 Predicting output variable using its cause-effect
relationship with input variables. OLS Regression, GLM,
Random forests, ANN etc.
• Classification:
Predicting the item class. Decision Tree, Logistic
Regression, ANN, SVM, Naïve Bayes classifier etc.
• Time Series Forecasting:
Predicting future time events given past history. AR,
MA, ARIMA, Triple Exponential Smoothing, Holt-
Winters etc.
Contd.,
Common Predictive Analytics
• Association rule mining:
Mining items occurring together. Apriori Algorithm.
• Clustering:
Finding natural groups or clusters in the data. K-
means, Hierarchical, Spectral, Density based EM
algorithm Clustering etc.
• Text mining:
Model and structure the information content of
textual sources. Sentiment Analysis, NLP
Regression
 A Regression Model defines three types of
regression models:
 linear,
 Polynomial or Multiple regression
 logistic regression. Or log linear regression
Linear Regression
• The simplest form of regression to visualize is
linear regression with a single predictor. A
linear regression technique can be used if the
relationship between x and y can be
approximated with a straight line
Y =  +  X
Nonlinear Regression
• the relationship between x and y cannot be
approximated with a straight line. In this case,
a nonlinear regression technique may be used.
Alternatively, the data could be preprocessed
to make the relationship linear.
Multivariate Regression
• Multivariate regression refers to regression
with multiple predictors (x1 , x2 , ..., xn). For
purposes of illustration
Y = b0 + b1 X1 + b2 X2.
Decision Tree Induction
• A decision tree is a structure that includes a
root node, branches, and leaf nodes.
• Each internal node denotes a test on an
attribute,
• Each branch denotes the outcome of a test,
and
• Each leaf node holds a class label.
• The topmost node in the tree is the root node.
Balanced Decision Tree
Decision Tree Issues
• Choosing Splitting Attributes
• Ordering of Splitting Attributes
• Splits (No. of Splits to take)
• Tree Structure (few levels are required for a
balanced tree)
• Stopping Criteria
– once training data is perfectly classified tree will stop
– Stopping earlier results overfitting
• Training Data
– Neither too small nor too big
• Pruning – improving of tree is needed by
removing sub-trees when required
Entropy
• Entorpy It is the measure of disorder in a data set
or amount of uncertainty in the dataset H(S)
H (S) =
• S – The current data set for which entropy is
being calculated (change for every iteration of
the ID3 algorithm
• X – Set of classes in S
• p(x) - It shows the ratio of number of elements in
a class x with reference to number of elements of
set S
Here our Partition is on Colour not on Shapes . After partition we get
After partition
24 stars, 25 Diamonds i.e., 24+25 = 49
Entropy
• Entropy is calculated for all stars and all
diamonds
Entropy
Entropy
Information Gain
• Information gain is a measure of the decrease
of disorder achieved by partitioning the
original data set
CART
• Create Binary Tree
• Uses entropy
• Formula to choose split point, s, for node t:
• PL,PR probability that a tuple in the training set will
be on the left or right side of the tree.
CART Example
• At the start, there are six choices for split
point (right branch on equality):
– P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0
– P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
• Split at 1.8
What is a Neural Network
• A computer system modeled on the human
brain and nervous system
• NN is usually organized in layers
• Layers are made up of number of inter-
connected nodes
• The connection strengths of neurons are
called weights, that are used to store the
acquired information (training examples)
Contd..Neural Networks
• During the learning process the weights are
modified in order to model the particular
learning task correctly on the training
examples.
Propagation
Tuple Input
Output
Input
CLASSIFICATION
(RULE BASED)
Classification Using Rules
• Perform classification using If-Then rules
• Classification Rule: r = <a,c>
Antecedent, Consequent
• May generate from from other techniques
(DT, NN) or generate directly.
• Algorithms: Gen, RX, 1R, PRISM
Generating Rules Example
CLASSIFICATION TREES
(ENSEMBLED METHODS)
Ensemble Methods
• Ensemble means collection of large number
of replicas (or mental copies or virtual
copies), of the microstate of the system
under macroscopic condition
or
• ensembling is a technique of combining two
or more algorithms of similar or dissimilar
types called base learners,
• that incorporates predictions from base
learners
Contd.,
• Eg:- A decision of an interviewr depnds on
the feedback of the all the interviewrs in
various rounds
Ensemble Methods
• Types of Ensembling:
– 3 types
–Averaging
• Taking the average of predictions from
models in regression
• Predicting of probabilites in classification
problem
Eg:
Ensemble Methods
Contd.,
• Types of Ensembling
–Majority vote
• Taking prediction with maximum vote or
• Recommendations from multiple models
Eg:
Ensemble Methods
Contd.,
• Types of Ensembling
–Weighted Average:
• Different weights are applied to predictions
from multiple models
• The average of those is taken which means
giving high or low importance for a specific
model in output
• Eg:
Ensemble Methods
Ensemble Modeling Techniques
• Mostly 3 Techniques are used. They are
–Bagging
–Boosting
–Stacking
Contd.,
Ensemble Modeling Techniques
• Mostly 3 Techniques are used. They are
–Bagging
• Referred as bootstrap aggregation
• BootStrap or Bootstrapping
–It is a sampling technique
–We choose ‘n’ observations from the
original dataset
–The probability of selecting each row
from the dataset is equal for all in each
iteration
Contd.,
Ensemble Modeling Techniques
Contd.,
• For Boot Strap
sample we have to
choose one row
• Here we have
selected Row2
Ensemble Modeling Techniques
Contd.,
• Row 2 exists in
Data even after
selection of it into
Bootstraped
samples
• Row 1 is again
selected from Data
into Bootstraped
sample again
Ensemble Modeling Techniques
Contd.,
• Now Bootstraped samples is ready for
growing trees.
• The above sample use majority vote or
averaging concepts to get final Prediction
• Bagging is mainly used to reduce variance
– Eg: Random Forest
Ensemble Modeling Techniques
• Mostly 3 Techniques are used. They are
–Bagging
• Referred as bootstrap aggregation
• BootStrap or Bootstrapping
–It is a sampling technique
–We choose ‘n’ observations from the
original dataset
–The probability of selecting each row
from the dataset is equal for all in each
iteration
Contd.,
Ensemble Modeling Techniques
• Mostly 3 Techniques are used. They are
–Boosting
• Avoids overfitting, in which first
algorithm is trained on the entire
dataset.
• Subsequent algorithms are built by
fitting the residuals of first algorithm,
• This gives higher weight for the poor
predicted observations in previous
model to avoid overfitting Contd.,
Ensemble Modeling Techniques
–Eg: XGBoost, GBM, ADABOOST, etc.,
• F1(x)=F0(x)+λ0h0(x)=5008.3+0.5∗4991.6=7504.1F1(x)=
F0(x)+λ0h0(x)=5008.3+0.5∗4991.6=7504.1.
• So this person earns $7504.1 per month according to
our model.
Ensemble Modeling Techniques
• Third Technique
–Stacking has two layers of machine
learning
Contd.,
• d1,d2,d3 receive
original inputs
feeatres (x)
• Top Layer f() takes the
output of bottom
layers (d1,d2,d3) and
predicts the output
(y)
Ensemble Modeling Techniques
• Key Principles for selecting models
– The individual models fulfill particular accuracy
criteria.
– The model predictions of various individual
models are not highly correlated with the
predictions of other models
– Note: This top layer model can also be
replaced by many other simpler formulas like:
• Averaging
• Majority vote
• Weighted Average
Contd.,
Ensemble Modeling Techniques
• Advantages
–Proven method of improving accuracy
–Key ingredient for winning almost all the
machine learning hackathons (it is a big
gathering of programmers code in a extreme manner
over a short period of time)
–Make model robust, stable and decent
performance on the test cases
–Ensembling can be used to capture
simple linear or non-linear complex
relationships in data Contd.,
Ensemble Modeling Techniques
• Dis-Advantages
–Reduces model interpretability
–Difficult to draw crucial business insights
at the end
–Time consuming activity, not suitable for
real time scenarios
–Selection of ensemble model is difficult
Contd.,
Association Rules
Association Rules
• Association Rule :
–“finding
–frequent patterns,
–associations,
–correlations, or casual structures
–among sets of items or objects in
transactional/relational databases or any
other informational repository”
– Cross-marketing
– Catalogue Design Contd.,
What is Association Rule
– Cross-marketing
– Catalogue Design
• Examples
• {bread}  {milk}
• {soda}  {chips}
• {bread}  {jam}
Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Contd.,
Support and Confidence
What is Association Rule
Contd.,
Goal of Association
Contd.,
• For a given set of transactions T, the goal of
Association Mining is to find the rules
having
– Support ≥ minsup threshnold
– Confidence ≥ minconf threshold
Association Rule Mining
Contd.,
Association Rule Techniques
Two step approach
• Find Large Itemsets or Frequent
itemset generation
• Generate all itemsets whose support ≥
minsup
• Rule generation
• Generate rules from frequent itemsets.
• Each rule is a binary partitioning of a
frequent item set
• Note: frequent item set generation is
computationally expensive
Contd.
,
Algorithm to Generate ARs
Contd.
,
Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5. i = i + 1;
6. Ci = Apriori-Gen(Li-1);
7. Count Ci to determine Li;
8. until no more large itemsets found;
Contd.
,
Working of Apriori Principle
Contd.
,
Apriori
Advantages/Disadvantages
 Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
 Disadvantages:
– Assumes transaction database is memory
resident.
– Requires up to m database scans.
Sequence Rules
Sequence Rules
• For a Database D, finding the maximal
sequences among the ‘n’ sequences, that
have certail user-specified minimum
support and confidence
Segmentation
Segmentation
• For a Database D, finding the maximal
sequences among the ‘n’ sequences, that
have certail user-specified minimum
support and confidence
Clustering Approaches
Sampling Compression
Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive
Types of Clustering
• Hierarchical – Nested set of clusters created.
• Partitional – One set of clusters created.
• Incremental – Each element handled one at a
time.
• Simultaneous – All elements handled
together.
• Overlapping/Non-overlapping
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples
• Recursive application of a standard clustering
algorithm can produce a hierarchical clustering
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
Hierarchical Clustering
• Agglomerative
– Bottom up approach
– Start with single-instance clusters
– At each step join the two closest clusters
– Design Decision: distance between clusters
• Eg: two closest instances in clusters vs distance
between means
• Divisive / deglomarative
– Top Down Approach
– Start with one universal cluster
– Find two clusters
– Proceed recursively on each subset
– Can be very fast
Both produce a Dendrogram
Top Down
Approach
Bottom up
Approach
Hierarchical Agglomerative Clustering
(HAC) or Agglomerative Clustering
• Starts with each doc in a separate cluster
–then repeatedly joins the closest pair of
clusters, until there is only one cluster.
• The history of merging forms a binary tree
or hierarchy.
K-Means Algorithm
K – Means Clustering
Now group this data into two clusters
K – Means Clustering
• Randomly initialize
two points called
centroids
• Colour the datasets
in red and blue
• Find the nearest
dataset for Red or
Blue coloured
centroid
• Move the centroids
K – Means Clustering
• After calculating
the mean of all
centroids for both
the colour i.e.,
Red and Blue
• Closest to the
both the colours
K – Means Clustering
K-means clustering
• Calculate the
average of blue
points and red
points and move
the centroid again
K-means Clustering
• Calculate the
average and mean
of blue and red
points then move
the centroid again
• Continue this steps
of iteration
• At a particular point
cluster centroids
will not change
K-Means
• Initial set of clusters randomly chosen.
• Iteratively, items are moved among sets of
clusters until the desired set is reached.
• High degree of similarity among elements
in a cluster is obtained.
• Given a cluster Ki={ti1,ti2,…,tim}, the cluster
mean is mi = (1/m)(ti1 + … + tim)
Social Network Analysis
(SNA)
Social network analysis is:
• a set of relational methods for systematically
understanding and identifying connections among actors
Introduction
• Actors (nodes, points, vertices):
- Individuals, Organizations, Events …
• Relations (lines, arcs, edges, ties): between pairs of actors.
- Undirected (symmetric) / Directed (asymmetric)
- Binary / Valued
Basic concepts
Network Components
1) Egocentered Networks
• Data on a respondent (ego) and the people they are connected
to.
Measures:
Size
Types of relations
Basic concepts
Types of network data:
Def: An ego-centered network, or Egonet
represents the one-hop neighbourhood of
the node of interest
Or
Egonet consists of particular node and its
immediate neighbours
2) Complete Networks
• Connections among all members of
a population.
• Data on all actors within a
particular (relevant) boundary.
• Never exactly complete (due to
missing data), but boundaries are set
• Ex: Friendships among workers in a company.
Measures:
Graph properties
Density
Sub-groups
Positions
Background
Types of network data:
The unit of interest in a network are the combined sets of
actors and their relations.
We represent actors with points and relations with lines.
Example:
Social Network data
a
b
c e
d
In general, a relation can be:
Undirected / Directed
Binary / Valued
a
b
c e
d
Undirected, binary Directed, binary
a
b
c e
d
a
b
c e
d
Undirected, Valued Directed, Valued
a
b
c e
d
1 3
4
21
Social Network data
From pictures to matrices
Undirected, binary Directed, binary
a b c d e
a
b
c
d
e
1
1
1 1 1
1 1
a b c d e
a
b
c
d
e
1
1 1
1 1 1
1 1
1 1
Basic Data Structures
Social Network data
a
b
c e
d
a
b
c e
d
d e
c
Indirect connections are what make networks
systems. One actor can reach another if there is
a path in the graph connecting them.
a
b
c e
d
f
b f
a
Connectivity
Measuring Networks
Distance is measured by the (weighted) number of
relations separating a pair, Using the shortest path.
Actor “a” is:
1 step from 4
2 steps from 5
3 steps from 4
4 steps from 3
5 steps from 1
Distance & number of paths
Measuring Networks
a
An information
network:
Email exchanges
within the
Reagan white
house, early
1980s
(source: Blanton,
1995)
Measuring Networks
Centrality refers to (one dimension of) location, identifying
where an actor resides in a network.
Centrality
Measuring Networks
Centrality is fairly straight forward: we want to
identify which nodes are in the ‘center’ of the
network. In the sense that they have many and
important connections.
Three standard centrality measures capture a wide
range of “importance” in a network:
Degree
Closeness
Betweenness
The most intuitive notion of centrality focuses on
degree. Degree is the number of lines, and the
actor with the most lines is the most important:
Centrality
Measuring Networks
Centrality
Measuring Networks
Relative measure of Degree Centrality:
1
),(
)(' 1


 
n
ppa
PC
ki
n
i
kD
Degree Centrality:
),()(
1
ki
n
i
kD ppaPC


A second measure is closeness centrality. An actor
is considered important if he/she is relatively close to all
other actors.
Closeness is based on the inverse of the
distance of each actor to every other actor in the
network.
Closeness Centrality:
Relative Closeness Centrality
Centrality
Measuring Networks
1
1
)],([)( 

 ki
n
i
kC ppdPC
),(
1
1
),(
)('
1
1
1
ki
n
i
ki
n
i
kC
ppd
n
n
ppd
PC



















Closeness Centrality
Centrality
Measuring Networks
Betweenness Centrality:
Model based on communication flow: A person who lies
on communication paths can control communication flow, and is
thus important. Betweenness centrality counts the number of
shortest paths between i and k that actor j resides on.
b
a
C d e f g h
Centrality
Measuring Networks
Centrality
Measuring Networks
Betweenness centrality can be defined in terms of probability (1/gij),
CB(pk) = iij(pk) = =
gij = number of geodesics that bond actors pi and pj.
gij(pk)= number of geodesics which bond pi and pj and content pk.
iij(pk) = probability that actor pk is in a geodesic randomly chosen among the
ones which join pi and pj.
Betweenness centrality is the sum of these probabilities (Freeman, 1979).
)(*
g
1
ij
kij pg
ij
kij
g
)(pg
Normalizad: C’B(pk) = CB(pk) / [(n-1)(n-2)/2]
Betweenness Centrality:
Centrality
Measuring Networks
If we want to measure the degree to which the graph as a whole is centralized, we look
at the dispersion of centrality:
Freeman’s general formula for centralization (which ranges from 0 to 1):
 
)]2)(1[(
)()(1
*




nn
pCpC
C
n
i iDD
D
Centralization
Measuring Networks
Degree Centralization Scores
Freeman: 1.0 Freeman: .02 Freeman: 0.0
Centralization
Measuring Networks
Density
Measuring Networks
The more actors are connected to one another, the more dense the network will
be.
Undirected network: n(n-1)/2 = 2n-1 possible pairs of actors.
Δ =
Directed network: n(n-1)*2/2 = 2n-2possible lines.
ΔD =
2/)1( nn
L
)1( nn
L
Freeman: .25 Freeman: .23 Freeman: 0.25
Density
Measuring Networks
UCINET
•The Standard network analysis program, runs in Windows
•Good for computing measures of network topography for single
nets
•Input-Output of data is a special 2-file format, but is now able to
read PAJEK files directly.
•Not optimal for large networks
•Available from:
Analytic Technologies
Social Network Software
PAJEK
•Program for analyzing and plotting very large networks
•Intuitive windows interface
•Started mainly a graphics program, but has expanded to a wide range of
analytic capabilities
•Can link to the R statistical package
•Free
•Available from: http://vlado.fmf.uni-lj.si/pub/networks/pajek/
Social Network Software
NetDraw
•Also very new, but by one of the best known names in
network analysis software.
•Free
Social Network Software

Contenu connexe

Tendances

Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Edureka!
 
Model evaluation - machine learning
Model evaluation - machine learningModel evaluation - machine learning
Model evaluation - machine learningSon Phan
 
Stock Market Prediction
Stock Market Prediction Stock Market Prediction
Stock Market Prediction SalmanShezad
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
Lesson 4 ar-ma
Lesson 4 ar-maLesson 4 ar-ma
Lesson 4 ar-maankit_ppt
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsManojit Nandi
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear RegressionSara Hooker
 
IRJET- Stock Price Prediction using Long Short Term Memory
IRJET-  	  Stock Price Prediction using Long Short Term MemoryIRJET-  	  Stock Price Prediction using Long Short Term Memory
IRJET- Stock Price Prediction using Long Short Term MemoryIRJET Journal
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An OverviewMachinePulse
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part Ijayroy
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Introduction to Business Analytics Part 1
Introduction to Business Analytics Part 1Introduction to Business Analytics Part 1
Introduction to Business Analytics Part 1Beamsync
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithmsShalitha Suranga
 

Tendances (20)

Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
 
Model evaluation - machine learning
Model evaluation - machine learningModel evaluation - machine learning
Model evaluation - machine learning
 
Stock Market Prediction
Stock Market Prediction Stock Market Prediction
Stock Market Prediction
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Lesson 4 ar-ma
Lesson 4 ar-maLesson 4 ar-ma
Lesson 4 ar-ma
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
IRJET- Stock Price Prediction using Long Short Term Memory
IRJET-  	  Stock Price Prediction using Long Short Term MemoryIRJET-  	  Stock Price Prediction using Long Short Term Memory
IRJET- Stock Price Prediction using Long Short Term Memory
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Introduction to Business Analytics Part 1
Introduction to Business Analytics Part 1Introduction to Business Analytics Part 1
Introduction to Business Analytics Part 1
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 

Similaire à Predictive analytics

ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptxhiblooms
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with TensorflowShubham Sharma
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptxssuser6654de1
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...TEJVEER SINGH
 
CSA 3702 machine learning module 1
CSA 3702 machine learning module 1CSA 3702 machine learning module 1
CSA 3702 machine learning module 1Nandhini S
 

Similaire à Predictive analytics (20)

ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptx
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with Tensorflow
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
 
Primer on major data mining algorithms
Primer on major data mining algorithmsPrimer on major data mining algorithms
Primer on major data mining algorithms
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
CSA 3702 machine learning module 1
CSA 3702 machine learning module 1CSA 3702 machine learning module 1
CSA 3702 machine learning module 1
 

Dernier

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 

Dernier (17)

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Predictive analytics

  • 1.
  • 3. What is Predictive Analytics? Predictive analytics is the practice of extracting insights from the existing data set with the help data mining, statistical modeling and machine learning techniques and using it to predict unobserved/unknown events.  Identifying cause-effect relationships across the variables from the historical data.  Discovering hidden insights and patterns with the help of data mining techniques.  Apply observed patterns to unknowns in the Past, Present or Future.
  • 4. What is Predictive Analytics
  • 6. Analytics & Predictive Analytics  Analytics is the understanding of existing (retrospective) data with the goal of understanding trends via comparison  Developing analytics is the first step towards deriving predictive analytics  Predictive Analytics are more sophisticated analytics that “forward thinking” in nature  used for gaining insights from mathematical and/or financial modeling by enhancing understanding, interpretation and judgment for the purpose of good decision making
  • 7. © 2011 Predictive Dashboards LLC 7 Comparative Study:: Analytics and Predictive Analytics Attribute Analytics Predictive Analytics Purpose: Understand the Past Observe Trends Catalyst for Discussion Gain Insights Make Decisions Take Action View: Historical and Current Future Oriented Metrics Type: Lagging Indicators Leading Indicators Data Used: Raw & Compiled Information Data Type: Structured Structured and Unstructured Users: Middle & Senior Mgt Analysts, End Users C-Level & Senior Mgt Strategists, Analysts, Mgrs Benefits: Gaining an understanding of data Productivity Improvements Gaining Information & Insights Process Improvements
  • 8. Benefits Benefits of Analytics: productivity gains through improved data- gathering processes results in less time required for producing reports and metrics beneficial Not scalable, not repeatable Benefits of predictive analytics: process improvement gains through improve revenue generation & cost structures enhanced decision making Beneficial, Scalable, repeatable
  • 9. Common Predictive Analytics • Regression:  Predicting output variable using its cause-effect relationship with input variables. OLS Regression, GLM, Random forests, ANN etc. • Classification: Predicting the item class. Decision Tree, Logistic Regression, ANN, SVM, Naïve Bayes classifier etc. • Time Series Forecasting: Predicting future time events given past history. AR, MA, ARIMA, Triple Exponential Smoothing, Holt- Winters etc. Contd.,
  • 10. Common Predictive Analytics • Association rule mining: Mining items occurring together. Apriori Algorithm. • Clustering: Finding natural groups or clusters in the data. K- means, Hierarchical, Spectral, Density based EM algorithm Clustering etc. • Text mining: Model and structure the information content of textual sources. Sentiment Analysis, NLP
  • 11. Regression  A Regression Model defines three types of regression models:  linear,  Polynomial or Multiple regression  logistic regression. Or log linear regression
  • 12. Linear Regression • The simplest form of regression to visualize is linear regression with a single predictor. A linear regression technique can be used if the relationship between x and y can be approximated with a straight line Y =  +  X
  • 13. Nonlinear Regression • the relationship between x and y cannot be approximated with a straight line. In this case, a nonlinear regression technique may be used. Alternatively, the data could be preprocessed to make the relationship linear.
  • 14. Multivariate Regression • Multivariate regression refers to regression with multiple predictors (x1 , x2 , ..., xn). For purposes of illustration Y = b0 + b1 X1 + b2 X2.
  • 15. Decision Tree Induction • A decision tree is a structure that includes a root node, branches, and leaf nodes. • Each internal node denotes a test on an attribute, • Each branch denotes the outcome of a test, and • Each leaf node holds a class label. • The topmost node in the tree is the root node.
  • 17. Decision Tree Issues • Choosing Splitting Attributes • Ordering of Splitting Attributes • Splits (No. of Splits to take) • Tree Structure (few levels are required for a balanced tree) • Stopping Criteria – once training data is perfectly classified tree will stop – Stopping earlier results overfitting • Training Data – Neither too small nor too big • Pruning – improving of tree is needed by removing sub-trees when required
  • 18. Entropy • Entorpy It is the measure of disorder in a data set or amount of uncertainty in the dataset H(S) H (S) = • S – The current data set for which entropy is being calculated (change for every iteration of the ID3 algorithm • X – Set of classes in S • p(x) - It shows the ratio of number of elements in a class x with reference to number of elements of set S
  • 19. Here our Partition is on Colour not on Shapes . After partition we get
  • 20. After partition 24 stars, 25 Diamonds i.e., 24+25 = 49
  • 21. Entropy • Entropy is calculated for all stars and all diamonds
  • 24.
  • 25. Information Gain • Information gain is a measure of the decrease of disorder achieved by partitioning the original data set
  • 26. CART • Create Binary Tree • Uses entropy • Formula to choose split point, s, for node t: • PL,PR probability that a tuple in the training set will be on the left or right side of the tree.
  • 27. CART Example • At the start, there are six choices for split point (right branch on equality): – P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 – P(1.6) = 0 – P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 – P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 – P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 – P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 • Split at 1.8
  • 28. What is a Neural Network • A computer system modeled on the human brain and nervous system • NN is usually organized in layers • Layers are made up of number of inter- connected nodes • The connection strengths of neurons are called weights, that are used to store the acquired information (training examples)
  • 29. Contd..Neural Networks • During the learning process the weights are modified in order to model the particular learning task correctly on the training examples.
  • 32. Classification Using Rules • Perform classification using If-Then rules • Classification Rule: r = <a,c> Antecedent, Consequent • May generate from from other techniques (DT, NN) or generate directly. • Algorithms: Gen, RX, 1R, PRISM
  • 35. Ensemble Methods • Ensemble means collection of large number of replicas (or mental copies or virtual copies), of the microstate of the system under macroscopic condition or • ensembling is a technique of combining two or more algorithms of similar or dissimilar types called base learners, • that incorporates predictions from base learners Contd.,
  • 36. • Eg:- A decision of an interviewr depnds on the feedback of the all the interviewrs in various rounds Ensemble Methods
  • 37. • Types of Ensembling: – 3 types –Averaging • Taking the average of predictions from models in regression • Predicting of probabilites in classification problem Eg: Ensemble Methods Contd.,
  • 38. • Types of Ensembling –Majority vote • Taking prediction with maximum vote or • Recommendations from multiple models Eg: Ensemble Methods Contd.,
  • 39. • Types of Ensembling –Weighted Average: • Different weights are applied to predictions from multiple models • The average of those is taken which means giving high or low importance for a specific model in output • Eg: Ensemble Methods
  • 40. Ensemble Modeling Techniques • Mostly 3 Techniques are used. They are –Bagging –Boosting –Stacking Contd.,
  • 41. Ensemble Modeling Techniques • Mostly 3 Techniques are used. They are –Bagging • Referred as bootstrap aggregation • BootStrap or Bootstrapping –It is a sampling technique –We choose ‘n’ observations from the original dataset –The probability of selecting each row from the dataset is equal for all in each iteration Contd.,
  • 42. Ensemble Modeling Techniques Contd., • For Boot Strap sample we have to choose one row • Here we have selected Row2
  • 43. Ensemble Modeling Techniques Contd., • Row 2 exists in Data even after selection of it into Bootstraped samples • Row 1 is again selected from Data into Bootstraped sample again
  • 44. Ensemble Modeling Techniques Contd., • Now Bootstraped samples is ready for growing trees. • The above sample use majority vote or averaging concepts to get final Prediction • Bagging is mainly used to reduce variance – Eg: Random Forest
  • 45. Ensemble Modeling Techniques • Mostly 3 Techniques are used. They are –Bagging • Referred as bootstrap aggregation • BootStrap or Bootstrapping –It is a sampling technique –We choose ‘n’ observations from the original dataset –The probability of selecting each row from the dataset is equal for all in each iteration Contd.,
  • 46. Ensemble Modeling Techniques • Mostly 3 Techniques are used. They are –Boosting • Avoids overfitting, in which first algorithm is trained on the entire dataset. • Subsequent algorithms are built by fitting the residuals of first algorithm, • This gives higher weight for the poor predicted observations in previous model to avoid overfitting Contd.,
  • 47. Ensemble Modeling Techniques –Eg: XGBoost, GBM, ADABOOST, etc., • F1(x)=F0(x)+λ0h0(x)=5008.3+0.5∗4991.6=7504.1F1(x)= F0(x)+λ0h0(x)=5008.3+0.5∗4991.6=7504.1. • So this person earns $7504.1 per month according to our model.
  • 48. Ensemble Modeling Techniques • Third Technique –Stacking has two layers of machine learning Contd., • d1,d2,d3 receive original inputs feeatres (x) • Top Layer f() takes the output of bottom layers (d1,d2,d3) and predicts the output (y)
  • 49. Ensemble Modeling Techniques • Key Principles for selecting models – The individual models fulfill particular accuracy criteria. – The model predictions of various individual models are not highly correlated with the predictions of other models – Note: This top layer model can also be replaced by many other simpler formulas like: • Averaging • Majority vote • Weighted Average Contd.,
  • 50. Ensemble Modeling Techniques • Advantages –Proven method of improving accuracy –Key ingredient for winning almost all the machine learning hackathons (it is a big gathering of programmers code in a extreme manner over a short period of time) –Make model robust, stable and decent performance on the test cases –Ensembling can be used to capture simple linear or non-linear complex relationships in data Contd.,
  • 51. Ensemble Modeling Techniques • Dis-Advantages –Reduces model interpretability –Difficult to draw crucial business insights at the end –Time consuming activity, not suitable for real time scenarios –Selection of ensemble model is difficult Contd.,
  • 53. Association Rules • Association Rule : –“finding –frequent patterns, –associations, –correlations, or casual structures –among sets of items or objects in transactional/relational databases or any other informational repository” – Cross-marketing – Catalogue Design Contd.,
  • 54. What is Association Rule – Cross-marketing – Catalogue Design • Examples • {bread}  {milk} • {soda}  {chips} • {bread}  {jam} Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Contd.,
  • 56. What is Association Rule Contd.,
  • 57. Goal of Association Contd., • For a given set of transactions T, the goal of Association Mining is to find the rules having – Support ≥ minsup threshnold – Confidence ≥ minconf threshold
  • 59. Association Rule Techniques Two step approach • Find Large Itemsets or Frequent itemset generation • Generate all itemsets whose support ≥ minsup • Rule generation • Generate rules from frequent itemsets. • Each rule is a binary partitioning of a frequent item set • Note: frequent item set generation is computationally expensive Contd. ,
  • 60. Algorithm to Generate ARs Contd. ,
  • 61. Apriori Algorithm 1. C1 = Itemsets of size one in I; 2. Determine all large itemsets of size 1, L1; 3. i = 1; 4. Repeat 5. i = i + 1; 6. Ci = Apriori-Gen(Li-1); 7. Count Ci to determine Li; 8. until no more large itemsets found; Contd. ,
  • 62. Working of Apriori Principle Contd. ,
  • 63. Apriori Advantages/Disadvantages  Advantages: – Uses large itemset property. – Easily parallelized – Easy to implement.  Disadvantages: – Assumes transaction database is memory resident. – Requires up to m database scans.
  • 65. Sequence Rules • For a Database D, finding the maximal sequences among the ‘n’ sequences, that have certail user-specified minimum support and confidence
  • 67. Segmentation • For a Database D, finding the maximal sequences among the ‘n’ sequences, that have certail user-specified minimum support and confidence
  • 68. Clustering Approaches Sampling Compression Clustering Hierarchical Partitional Categorical Large DB Agglomerative Divisive
  • 69. Types of Clustering • Hierarchical – Nested set of clusters created. • Partitional – One set of clusters created. • Incremental – Each element handled one at a time. • Simultaneous – All elements handled together. • Overlapping/Non-overlapping
  • 70. Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples • Recursive application of a standard clustering algorithm can produce a hierarchical clustering animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate
  • 71. Hierarchical Clustering • Agglomerative – Bottom up approach – Start with single-instance clusters – At each step join the two closest clusters – Design Decision: distance between clusters • Eg: two closest instances in clusters vs distance between means • Divisive / deglomarative – Top Down Approach – Start with one universal cluster – Find two clusters – Proceed recursively on each subset – Can be very fast Both produce a Dendrogram Top Down Approach Bottom up Approach
  • 72. Hierarchical Agglomerative Clustering (HAC) or Agglomerative Clustering • Starts with each doc in a separate cluster –then repeatedly joins the closest pair of clusters, until there is only one cluster. • The history of merging forms a binary tree or hierarchy.
  • 74. K – Means Clustering Now group this data into two clusters
  • 75. K – Means Clustering • Randomly initialize two points called centroids
  • 76. • Colour the datasets in red and blue • Find the nearest dataset for Red or Blue coloured centroid • Move the centroids K – Means Clustering
  • 77. • After calculating the mean of all centroids for both the colour i.e., Red and Blue • Closest to the both the colours K – Means Clustering
  • 78. K-means clustering • Calculate the average of blue points and red points and move the centroid again
  • 79. K-means Clustering • Calculate the average and mean of blue and red points then move the centroid again • Continue this steps of iteration • At a particular point cluster centroids will not change
  • 80. K-Means • Initial set of clusters randomly chosen. • Iteratively, items are moved among sets of clusters until the desired set is reached. • High degree of similarity among elements in a cluster is obtained. • Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim)
  • 82. Social network analysis is: • a set of relational methods for systematically understanding and identifying connections among actors Introduction
  • 83. • Actors (nodes, points, vertices): - Individuals, Organizations, Events … • Relations (lines, arcs, edges, ties): between pairs of actors. - Undirected (symmetric) / Directed (asymmetric) - Binary / Valued Basic concepts Network Components
  • 84. 1) Egocentered Networks • Data on a respondent (ego) and the people they are connected to. Measures: Size Types of relations Basic concepts Types of network data: Def: An ego-centered network, or Egonet represents the one-hop neighbourhood of the node of interest Or Egonet consists of particular node and its immediate neighbours
  • 85. 2) Complete Networks • Connections among all members of a population. • Data on all actors within a particular (relevant) boundary. • Never exactly complete (due to missing data), but boundaries are set • Ex: Friendships among workers in a company. Measures: Graph properties Density Sub-groups Positions Background Types of network data:
  • 86. The unit of interest in a network are the combined sets of actors and their relations. We represent actors with points and relations with lines. Example: Social Network data a b c e d
  • 87. In general, a relation can be: Undirected / Directed Binary / Valued a b c e d Undirected, binary Directed, binary a b c e d a b c e d Undirected, Valued Directed, Valued a b c e d 1 3 4 21 Social Network data
  • 88. From pictures to matrices Undirected, binary Directed, binary a b c d e a b c d e 1 1 1 1 1 1 1 a b c d e a b c d e 1 1 1 1 1 1 1 1 1 1 Basic Data Structures Social Network data a b c e d a b c e d
  • 89. d e c Indirect connections are what make networks systems. One actor can reach another if there is a path in the graph connecting them. a b c e d f b f a Connectivity Measuring Networks
  • 90. Distance is measured by the (weighted) number of relations separating a pair, Using the shortest path. Actor “a” is: 1 step from 4 2 steps from 5 3 steps from 4 4 steps from 3 5 steps from 1 Distance & number of paths Measuring Networks a
  • 91. An information network: Email exchanges within the Reagan white house, early 1980s (source: Blanton, 1995) Measuring Networks
  • 92. Centrality refers to (one dimension of) location, identifying where an actor resides in a network. Centrality Measuring Networks Centrality is fairly straight forward: we want to identify which nodes are in the ‘center’ of the network. In the sense that they have many and important connections. Three standard centrality measures capture a wide range of “importance” in a network: Degree Closeness Betweenness
  • 93. The most intuitive notion of centrality focuses on degree. Degree is the number of lines, and the actor with the most lines is the most important: Centrality Measuring Networks
  • 94. Centrality Measuring Networks Relative measure of Degree Centrality: 1 ),( )(' 1     n ppa PC ki n i kD Degree Centrality: ),()( 1 ki n i kD ppaPC  
  • 95. A second measure is closeness centrality. An actor is considered important if he/she is relatively close to all other actors. Closeness is based on the inverse of the distance of each actor to every other actor in the network. Closeness Centrality: Relative Closeness Centrality Centrality Measuring Networks 1 1 )],([)(    ki n i kC ppdPC ),( 1 1 ),( )(' 1 1 1 ki n i ki n i kC ppd n n ppd PC                   
  • 97. Betweenness Centrality: Model based on communication flow: A person who lies on communication paths can control communication flow, and is thus important. Betweenness centrality counts the number of shortest paths between i and k that actor j resides on. b a C d e f g h Centrality Measuring Networks
  • 98. Centrality Measuring Networks Betweenness centrality can be defined in terms of probability (1/gij), CB(pk) = iij(pk) = = gij = number of geodesics that bond actors pi and pj. gij(pk)= number of geodesics which bond pi and pj and content pk. iij(pk) = probability that actor pk is in a geodesic randomly chosen among the ones which join pi and pj. Betweenness centrality is the sum of these probabilities (Freeman, 1979). )(* g 1 ij kij pg ij kij g )(pg Normalizad: C’B(pk) = CB(pk) / [(n-1)(n-2)/2]
  • 100. If we want to measure the degree to which the graph as a whole is centralized, we look at the dispersion of centrality: Freeman’s general formula for centralization (which ranges from 0 to 1):   )]2)(1[( )()(1 *     nn pCpC C n i iDD D Centralization Measuring Networks
  • 101. Degree Centralization Scores Freeman: 1.0 Freeman: .02 Freeman: 0.0 Centralization Measuring Networks
  • 102. Density Measuring Networks The more actors are connected to one another, the more dense the network will be. Undirected network: n(n-1)/2 = 2n-1 possible pairs of actors. Δ = Directed network: n(n-1)*2/2 = 2n-2possible lines. ΔD = 2/)1( nn L )1( nn L
  • 103. Freeman: .25 Freeman: .23 Freeman: 0.25 Density Measuring Networks
  • 104. UCINET •The Standard network analysis program, runs in Windows •Good for computing measures of network topography for single nets •Input-Output of data is a special 2-file format, but is now able to read PAJEK files directly. •Not optimal for large networks •Available from: Analytic Technologies Social Network Software
  • 105. PAJEK •Program for analyzing and plotting very large networks •Intuitive windows interface •Started mainly a graphics program, but has expanded to a wide range of analytic capabilities •Can link to the R statistical package •Free •Available from: http://vlado.fmf.uni-lj.si/pub/networks/pajek/ Social Network Software
  • 106. NetDraw •Also very new, but by one of the best known names in network analysis software. •Free Social Network Software