SlideShare une entreprise Scribd logo
1  sur  85
Data mining techniques
and tasks
Prepared by :
Prabesh Pradhan
The process of collecting, searching through, and analyzing a large
amount of data in a database, as to discover patterns or relationships
• Extraction of useful patterns from data sources, e.g.,
databases, data warehouses, web.
• Patterns must be valid, novel, potentially useful, understandable.
DATA MINING
DATA MINING MODELS AND TASKS
Predictive:-
It makes prediction about values of data using
known results from different data or based on historical
data.
Descriptive:-
It identifies patterns or relationship in data,
it serves as a way to explore properties of data.
Data mining tasks
The process of collecting, searching through, and analyzing a large
amount of data in a database, as to discover patterns or relationships
 Extraction of useful patterns from data sources,e.g. databases, data
warehouses, web.
 Patterns must be valid, novel, potentially useful, understandable.
CLASSIFICATION
 Classification derives a model to determine the class of an object based on its
attributes.
 Given a collection of records, each record contains a set of attributes,
one of the attributes is the class.
 For eg: pattern recognition
• The value of attribute is examined as it varies over time.
• A time series plot is used to visualize time series.
• Ex:- stock exchange
TIME SERIES ANALAYSIS
• Clustering is the task of segmenting a diverse group into a
number of similar subgroups or clusters.
• Most similar data are grouped in clusters
• Ex:-Bank customer
CLUSTERING
• Abstraction or generalization of data resulting in a smaller set
which gives general overview of a data.
• alternatively,summary type information can be derived from
data.
SUMMARIZATION
Data mining techniques
 Association
 Classification
 Clustering
 Prediction
 Sequential Patterns
 Decision trees
Association
 The pattern is discovered based on a relationship between items in the same
transaction
 The association technique is also known as relation technique
 The association technique is used in market basket analysis to identify a set of
products that customers frequently purchase together.
Classification
 It classify each item in a set of data into one of a predefined set of classes or
groups.
 It makes use of mathematical techniques such as decision trees, linear
programming, neural network, and statistics
 We develop the software that can learn how to classify the data items into
groups.
 For example, we can apply classification in the application that “given all records
of employees who left the company, predict who will probably leave the
company in a future period.”
Clustering
 It makes a meaningful or useful cluster of objects which have similar
characteristics using the automatic technique.
 The clustering technique defines the classes and puts objects in each class, while
in the classification techniques, objects are assigned into predefined classes.
Prediction
 The prediction analysis technique can be used in the sale to predict profit for the
future if we consider the sale is an independent variable, profit could be a
dependent variable.
 It is based on the historical sale and profit data, we can draw a fitted regression
curve that is used for profit prediction
Sequential patterns
 It seeks to discover or identify similar patterns, regular events or trends in
transaction data over a business period.
 Eg: In sales, with historical transaction data, businesses can identify a set of items
that customers buy together different times in a year. Then businesses can use
this information to recommend customers buy it with better deals based on their
purchasing frequency in the past.
Classification and
prediction
Prepared by :
Sonahang Rai
Decision Tree
 It is a classification scheme which generates a tree and a set of rules from given
data set.
 A decision tree consist of root node, edges and leaf.
 Each internal node denotes a test on an attribute.
 Each branch node denotes outcome of the test.
 Each leaf node holds the class label.
Decision tree example
ID3 algorithm
i. Calculate the entropy of every attribute using the dataset.
ii. Split the set into subsets using the attribute for which entropy is minimum(or
information gain is maximum)
iii. Make a decision tree node containing that attribute
iv. Recurse on subset using remaining attributes
Entropy
 Info(D)=- 𝑖=1
𝑚
𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖)
where D=number of rows
𝑝𝑖 = class/D
 𝐼𝑛𝑓𝑜 𝐴(D)= 𝑗=1
𝑣 |𝐷 𝑗|
𝐷
∗ 𝐼𝑛𝑓𝑜(𝑑)
Information gain
 Gain(A)=Info(d)- 𝐼𝑛𝑓𝑜 𝐴(D)
ID3 example
Finding entropy and gain
 Finding info(d)
Info(D)= -
9
14
𝑙𝑜𝑔2(
9
14
)-
5
14
𝑙𝑜𝑔2(
5
14
)= 0.940 bits
 Finding for age
𝐼𝑛𝑓𝑜 𝑎𝑔𝑒(D)=
5
14
* (-
2
5
𝑙𝑜𝑔2(
2
5
)-
3
5
𝑙𝑜𝑔2(
3
5
))+
4
14
* (-
4
4
𝑙𝑜𝑔2(
4
4
)-
0
4
𝑙𝑜𝑔2(
0
4
))+
5
14
* (-
3
5
𝑙𝑜𝑔2(
3
5
)-
2
5
𝑙𝑜𝑔2(
2
5
))
= 0.694 bits
Thus, Gain(age)=Info(D)- 𝐼𝑛𝑓𝑜 𝑎𝑔𝑒(D)=0.945-0.694=0.246 bits
Similarly we can find gain of all other attributes.
Gain(income)=0.029,Gain(student)=0.151,Gain(credit_rating)=0.048
Among all the gains the gain of age is the maximum thus it is taken as the root node.
Rule Based Classification
 It uses the set of IF-THEN rules for the classification.
 IF-THEN rule is an expression of the form IF condition THEN conclusion.
 IF part is called rule antecedent or precondition, it can consist of one or more
attributes test.
 THEN part is called rule consequent, it consist a class prediction.
Ex:
R1:IF age=youth and student=yes THEN buys_computer=yes
Or
R1: (age=youth)Λ(student=yes)=>(buys_computer=yes)
 A rule R can be assessed by its coverage and accuracy
–Given a tuple X from a data D
–Let n cover: # of tuples covered by R
–n correct: # of tuples correctly classify by R
–|D|: # of tuples in D
Coverage(R)=
𝑛 𝑐𝑜𝑣𝑒𝑟𝑠
|𝐷|
Accuracy(R)=
𝑛 𝑐𝑜𝑟𝑟𝑒𝑐𝑡
𝑛 𝑐𝑜𝑣𝑒𝑟𝑠
R: IF age=youth AND student=yes THEN buys_computer=yes
|D| = 14
N cover = 2
N correct = 2
Coverage(R)=
2
14
=14.25%
Correct(R)=
2
2
=100%
Rules extraction from the decision tree
 One rule is created for each path from the root to a leaf node
 Each splitting criterion is logically AND to form the rule antecedent (IF part)
 Leaf node holds the class prediction for rule consequent (THEN part)
 Logical OR is implied between each of the extracted rules
Example of rules extraction from the decision tree
Rule’s
1. (age=young)˄(student=no)=>(buy_computer=no)
2. (age=young)˄(student=yes)=>(buy_computeryes)
3. (age=middle-aged)=>(buy_computer=yes)
4. (age=senior)˄(credit_rating=fair)=>(buy_computer=no)
5. (age=young)˄(credit_rating=excellent)=>(buy_computer=
yes)
Genetic algorithm
 It is used for finding optimized solutions to search problems.
 It is based on the theory of natural selection and evolution of biology.
Selection: Survival of the fittest
Evolution: Origin of species from a common descendent
 It is excellent in searching large and complex data sets.
 Gene: A part of chromosome.
 Chromosome: A set of gene.
 Population : No of individuals present with same
length of chromosome.
 Fitness: It is the value assigned to individual.
 Fitness function: Function which assigns fitness
value.
 Selection: Selecting values for next generation.
 Mutation: Changing a random gene.
Algorithm
1. Generate random population of n chromosomes
2. Evaluate the fitness of each chromosome in the population
3. Select two parent chromosomes from a population
according to their fitness
4. With a crossover probability cross over the parents to form
a new offspring.
5. With a mutation probability mutate new offspring .
6. If the end condition is satisfied, stop, and return the best
solution in current population
Linear Regression
 It is a data mining technique used to predict a range of numeric values (also
called continuous values), given a particular dataset.
 It is used to estimate a relationship between two variables.
 Here involves a response/dependent variable y and a single
predictor/independent variable x
y = 𝑤𝑜+ 𝑤1 x
where 𝑤𝑜 (y-intercept) and 𝑤1 (slope) are regression coefficients
X years experience Y salary(in $ 1000s)
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Non-linear regression,
Association and
frequent pattern
Prepared by :
Biplap Bhattarai
• Often the relationship between x and y cannot be approximated with a
straight line. In this case, a nonlinear regression technique may be used.
• Alternatively, the data could be preprocessed to make the relationship
linear.
• Most Non-linear models can be modeled by polynomial regression model
can be transformed into linear regression model.
• For Example:
Y=w0+w1x+w2x2 +w3x3 +w4x4
Non- Linear Regression:
• For Example:
Y=w0+w1x+w2x2 +w3x3 +w4x4
Convertible to Linear with new Variables:
X2= x2,x3=x3,x4=x4
Y=w0+w1x1+w2x2 +w3x3 +w4x4
Which is easily solved by method of least squares using software for
regression analysis.
Non- Linear Regression:
Association Rules:
Detects sets of attributes that frequently co-occur, and rules among them,
e.g. 90% of the people who buy cookies, also buy milk (60% of all grocery
shoppers buy both)
Frequent Pattern:
Frequent Pattern is a pattern (a set of items, subsequences) that occurs
frequently in a dataset. For example, a set of items, such as milk and bread,
that appear frequently together in a transaction data set, is a frequent
itemset. A subsequence, such as buying first a PC, then a digital camera, and
then a memory card, if it occurs frequently in a shopping history database, is
a (frequent) sequential pattern.
Data Mining Techniques
Association Rule Mining:
Association rule mining is a procedure which is meant to find frequent
patterns, correlations, associations or casual structures from datasets found
in various kind of databases (relational, transactional).
Data Mining Techniques
The Apriori Principle:
If an itemset is frequent, then all of its subsets must also be frequent.
Conversely, if an subset is infrequent, then all of its supersets must be
infrequent, too.
• Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation, and groups of
candidates are tested against the data.
• Apriori is designed to operate on database containing transactions (for
example, collections of items bought by customers, or details of a website
frequentation).
DEFINITION OF APRIORI ALGORITHM
• Apriori principle holds due to the following property of the support
measure:∀X,Y:(X⊂Y)→s(X)≥s(Y)
• For all x,y if x is the subset of y implies support of x is greater or equal to
support of y.
DEFINITION OF APRIORI ALGORITHM
• Frequent Itemsets: All the sets which contain the item with the minimum
support (denoted by Li For ith itemset)
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find lk, a set of candidate k-itemsets is generated by
joining Lk-1 with itself
KEY CONCEPTS
Given minimum required support s :
1. Search for all individual elements (1 element item set) that have a minimum
support of s
2. Repeat
2.1From the result of the previous search for i-element item-sets, search for
all i+1 element item-sets that have a minimum support of s
2.2 This becomes the sets of all frequent (i+1) element item-sets that are
interesting
3. Until item-set size reaches maximum
The Apriori Algorithm
The Frequent itemsets are
items in L1, L2, L3
Pros of the Apriori algorithm
• It is an easy-to-implement and easy-to-understand algorithm.
• It can be used on large itemsets.
Cons of the Apriori Algorithm
• Sometimes, it may need to find a large number of candidate rules which can
be computationally expensive.
• Calculating support is also expensive because it has to go through the entire
database.
The Apriori Algorithm
FP tree algorithm, which use to identify frequent patterns in the area of Data
Mining.
The Frequent Pattern Tree (FP-Tree) is a compact structure that stores
quantitative information about frequent patterns in a database.
Frequent Pattern Tree (FP Tree)
1. One root labeled as “null” with a set of item-prefix subtrees as children
2. Each node in the item-prefix subtree consists of three fields:
i. Item-name: registers which item is represented by the node;
ii. Count: the number of transactions represented by the portion of the path
reaching the node;
iii. Node-link: links to the next node in the FP-tree carrying the same item-
name, or null if there is none.
Frequent Pattern Tree (FP Tree)
1. Question :Find all frequent itemsets or frequent patterns in the following
database using FP-growth algorithm. Take minimum support as 30%.
Frequent Pattern Tree (FP Tree)
Step 1 - Calculate Minimum support
First should calculate the minimum support count. Question says minimum
support should be 30%. It calculate as follows:
Minimum support count(30/100 * 8) = 2.4
As a result, 2.4 appears but to empower the easy calculation it can be rounded
to to the ceiling value. Now,
Minimum support count is ceiling(30/100 * 8) = 3
Frequent Pattern Tree (FP Tree)
Step 2 - Find frequency of occurrence
Now time to find the frequency of occurrence of each item in the Table.
For example, item A occurs in row 1,row 2,row 3,row 4
and row 7.
Totally 5 times occurs in the Database table.
You can see the counted frequency of occurrence
of each item in Table below .
Frequent Pattern Tree (FP Tree)
Step 3 - Prioritize the items
In Table 2 you can see the numbers written in Red pen. Those are the priority of
each item according to it's frequency of occurrence.
Item B got the highest priority (1) due to it's highest number of occurrences.
At the same time you have opportunity to drop the items which not fulfill the
minimum support requirement .For instance, if Table contain F which has
frequency 1, then you can drop it.
Frequent Pattern Tree (FP Tree)
Step 4 -Order the items according to priority
As you see in the Table below new column added to the Table before. In the
Ordered Items column all the items are queued according to it's priority, which
mentioned in the Red ink in Table.
For example, in the case of ordering row 1, the highest priority item is B and
after that D, A and E respectively.
Frequent Pattern Tree (FP Tree)
Row 1:
Note that all FP trees have 'null' node as the root node. So draw the root node
first and attach the items of the row 1 one by one respectively. And write their
occurrences in front of it.
Frequent Pattern Tree (FP Tree)
Row 2: Then update the above tree by entering the items of row 2.
Then without creating another branch you can go through the previous branch
up to E and then you have to create new node after that for C.
(When you going through the branch second time you should erase one and
write two for indicating the two times you visit to that node. If you visit
through three times then write three after erase two.)
Figure 2 shows the FP tree after adding row 1 and row 2.
Frequent Pattern Tree (FP Tree)
Row 3: In row 3 you have to visit B,A,E and C respectively.
So you may think you can follow the same branch again by replacing the values
of B,A,E and C. But you can't do that you have opportunity to come through the
B. But can't connect B to existing A overtaking D. As a result you should draw
another A and connect it to B and then connect new E to that A and new C to
new E.
Frequent Pattern Tree (FP Tree)
Row 4: Then row 4 contain B,D,A. Now we can just rename the frequency of
occurrences in the existing branch. As B:4,D,A:3.
Row 5: In fifth raw have only item D. Now we have opportunity draw new
branch from 'null' node.
Frequent Pattern Tree (FP Tree)
Row 6:B and D appears in row 6. So just change the B:4 to B:5 and D:3 to D:4.
Row 7:
Attach two new nodes A and E to the D node which hanging on the null node.
Then mark D,A,E as D:2,A:1 and E:1.
Row 8 : Attach new node C to B. Change the traverse times.(B:6,C:1)
Frequent Pattern Tree (FP Tree)
Step 6 - Validation
After the five steps the final FP tree as follows:
How we know is this correct?
Now count the frequency of occurrence of each item of the FP tree and compare
it with Table. If both counts equal, then it is positive point to indicate your tree
is correct.
Frequent Pattern Tree (FP Tree)
• FP Growth Stands for frequent pattern growth
• It is a scalable technique for mining frequent pattern in a database
• After constructing the FP-Tree it’s possible to mine it to find the complete
set of frequent patterns
• FP growth improves Apriority to a big extent
• Frequent Item set Mining is possible without candidate generation
• Only “two scan” to the database is needed
FP-Growth
Simply a two step procedure
– Step 1: Build a compact data structure called the FP-tree
Built using 2 passes over the data-set.
– Step 2: Extracts frequent item sets directly from the FP- tree
FP-Growth Algorithm Process
FP-Growth Example:
Example: With Min Support 3
FP-Tree
FP-Growth Example:
For T: We Start with drawing tree whose end nodes are Ts, Keeping only
support of T
FP-Growth Example:
We Take out T one by one, and as we do so, we push its support to every node
up the chain to the root that was part of the same transaction in which T was:
FP-Growth Example:
As, D is infrequent (min support=3), so we remove D
Itemset={C,T},{W,T},{A,T},{C,W,T},
{C,A,T},{W,A,T},{T}
FP-Growth Example:
We Then, starting from main fp-tree, we consider each item that appears in the
tree and create new FP-trees for C, W, A, and D.
Market Basket Analysis
• Market Basket Analysis is a modelling technique based upon the theory that if
you buy a certain group of items, you are more (or less) likely to buy another
group of items.
• For example, if you are in an US, and purchase Diaper, Milk then you are
likely to buy beer.
• The set of items a customer buys is referred to as an itemset, and market
basket analysis seeks to find relationships between purchases.
Market Basket Analysis
• Typically the relationship will be in the form of a rule:
• IF {milk, diaper} THEN {beer}.
• The probability that a customer will buy milk without a diaper or vice versa is
referred to as the support for the rule. The conditional probability that a
customer will purchase beer is referred to as the confidence.
How is it used?
• In retailing, most purchases are bought on impulse. Market basket analysis
gives clues as to what a customer might have bought if the idea had occurred
to them .
• Market basket analysis can be used in deciding the location and promotion of
goods inside a store.
How is it used?
• In retailing, most purchases are bought on impulse. Market basket analysis
gives clues as to what a customer might have bought if the idea had occurred
to them .
• Market basket analysis can be used in deciding the location and promotion of
goods inside a store.
Types:
• There are two main types of MBA:
• Predictive MBA is used to classify cliques of item purchases, events and
services that largely occur in sequence.
• Differential MBA removes a high volume of insignificant results and can lead
to very in-depth results. It compares information between different stores,
demographics, seasons of the year, days of the week and other factors.
Clustering
Prepared by :
Trilok Pratap Kafle
Clustering
Cluster :-
Cluster is a group of objects that belongs to the same class.
similar objects are grouped in one cluster and dissimilar objects are
grouped in another cluster.
Clustering is the process of making a group of abstract(something different) objects into
classes of similar objects.
Application of cluster analysis
 Clustering can help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing
patterns.
 Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
 Clustering also helps in classifying documents on the web for information
discovery.
Clustering methods
 Partitioning method
 Hierarchical method
Partitioning method
 Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Types of Partitioning method
1. K-mean
2. K-medoids
K-mean
 K-Means is one of the most popular "clustering" algorithms.
 K-means stores K centroids that it uses to define clusters.
 A point is considered to be in a particular cluster if it is closer to that cluster's
centroid than any other centroid.
Steps in K-mean
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current partition.
Centroid is the centre (mean point) of the cluster.
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment.
=>2, 3, 6, 8, 9, 12, 15, 18, 22 – break into 3 clusters
– Cluster 1 - 2, 12, 18 – mean = 10.6
– Cluster 2 - 6, 9, 22 – mean = 12.3
– Cluster 3 – 3, 8, 15 – mean = 8.6
• Re-assign
– Cluster 1 - mean = 0
– Cluster 2 – 12, 15, 18, 22 - mean = 16.75
– Cluster 3 – 2, 3, 6, 8, 9 – mean = 5.6
• Re-assign
– Cluster 1 – 2 – mean = 2
– Cluster 2 – 12, 15, 18, 22 – mean = 16.75
– Cluster 3 = 3, 6, 8, 9 – mean = 6.5
Re-assign
– Cluster 1 – 2, 3 – mean = 2.5
– Cluster 2 – 12, 15, 18, 22 – mean = 16.75
– Cluster 3 – 6, 8, 9 – mean = 7.6
• Re-assign
– Cluster 1 – 2, 3 – mean = 2.5
– Cluster 2 – 12, 15, 18, 22 - mean = 16.75
– Cluster 3 – 6, 8, 9 – mean = 7.6
• No change, so we’re done
K-Mediods
 It is also a algorithm which breaks the data sets into group.
 It also attempt to minimize the distance between points labeled to be the cluster
of that cluster.
 Data point is taken as center and work with a generalization of the Manhattan
Norm.
Hierarchical Method
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm
that groups similar objects into groups called clusters. The endpoint is a set of
clusters, where each cluster is distinct from each other cluster, and the objects
within each cluster are broadly similar to each other.
For example, all files and folders on the hard disk are organized in a hierarchy.
There are two types of hierarchical clustering:-
1. Divisive
2. Agglomerative
Divisive Method
In divisive or top-down clustering method we assign all of the
observations to a single cluster and then partition the cluster to two
least similar clusters. Finally, we proceed recursively on each cluster
until there is one cluster for each observation. There is evidence that
divisive algorithms produce more accurate hierarchies than
agglomerative algorithms in some circumstances but is conceptually
more complex.
Agglomerative Method
In agglomerative or bottom-up clustering method we assign each observation to its
own cluster. Then, compute the similarity (e.g., distance) between each of the
clusters and join the two most similar clusters. Repeat the process until there is
only a single cluster left.
Before any clustering is performed, it is required to determine the proximity matrix
containing the distance between each point using a distance function. Then, the
matrix is updated to display the distance between each cluster. The following three
methods differ in how the distance between each cluster is measured.
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two
points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow
between their two closest points.
Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between
two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two furthest points.
Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r”
and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.
Data mining approaches and methods

Contenu connexe

Tendances

05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methodsProf.Nilesh Magar
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methodsKrish_ver2
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Least Squares Regression Method | Edureka
Least Squares Regression Method | EdurekaLeast Squares Regression Method | Edureka
Least Squares Regression Method | EdurekaEdureka!
 
Cross validation
Cross validationCross validation
Cross validationRidhaAfrawe
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 

Tendances (20)

05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Outlier Detection
Outlier DetectionOutlier Detection
Outlier Detection
 
Association rules
Association rulesAssociation rules
Association rules
 
Clustering
ClusteringClustering
Clustering
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
Random forest
Random forestRandom forest
Random forest
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Least Squares Regression Method | Edureka
Least Squares Regression Method | EdurekaLeast Squares Regression Method | Edureka
Least Squares Regression Method | Edureka
 
Decision tree
Decision treeDecision tree
Decision tree
 
Cross validation
Cross validationCross validation
Cross validation
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 

Similaire à Data mining approaches and methods

Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxPlacementsBCA
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxAsrithaKorupolu
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptxssuser6654de1
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analyticsDinakar nk
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Cluster2
Cluster2Cluster2
Cluster2work
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseasesijsrd.com
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...TEJVEER SINGH
 

Similaire à Data mining approaches and methods (20)

Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
Data Mining
Data MiningData Mining
Data Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Cluster2
Cluster2Cluster2
Cluster2
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Primer on major data mining algorithms
Primer on major data mining algorithmsPrimer on major data mining algorithms
Primer on major data mining algorithms
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
 

Dernier

Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 

Dernier (20)

Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 

Data mining approaches and methods

  • 1. Data mining techniques and tasks Prepared by : Prabesh Pradhan
  • 2. The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships • Extraction of useful patterns from data sources, e.g., databases, data warehouses, web. • Patterns must be valid, novel, potentially useful, understandable. DATA MINING
  • 3. DATA MINING MODELS AND TASKS
  • 4. Predictive:- It makes prediction about values of data using known results from different data or based on historical data. Descriptive:- It identifies patterns or relationship in data, it serves as a way to explore properties of data.
  • 5. Data mining tasks The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships  Extraction of useful patterns from data sources,e.g. databases, data warehouses, web.  Patterns must be valid, novel, potentially useful, understandable.
  • 6. CLASSIFICATION  Classification derives a model to determine the class of an object based on its attributes.  Given a collection of records, each record contains a set of attributes, one of the attributes is the class.  For eg: pattern recognition
  • 7. • The value of attribute is examined as it varies over time. • A time series plot is used to visualize time series. • Ex:- stock exchange TIME SERIES ANALAYSIS
  • 8. • Clustering is the task of segmenting a diverse group into a number of similar subgroups or clusters. • Most similar data are grouped in clusters • Ex:-Bank customer CLUSTERING
  • 9. • Abstraction or generalization of data resulting in a smaller set which gives general overview of a data. • alternatively,summary type information can be derived from data. SUMMARIZATION
  • 10. Data mining techniques  Association  Classification  Clustering  Prediction  Sequential Patterns  Decision trees
  • 11. Association  The pattern is discovered based on a relationship between items in the same transaction  The association technique is also known as relation technique  The association technique is used in market basket analysis to identify a set of products that customers frequently purchase together.
  • 12. Classification  It classify each item in a set of data into one of a predefined set of classes or groups.  It makes use of mathematical techniques such as decision trees, linear programming, neural network, and statistics  We develop the software that can learn how to classify the data items into groups.  For example, we can apply classification in the application that “given all records of employees who left the company, predict who will probably leave the company in a future period.”
  • 13. Clustering  It makes a meaningful or useful cluster of objects which have similar characteristics using the automatic technique.  The clustering technique defines the classes and puts objects in each class, while in the classification techniques, objects are assigned into predefined classes.
  • 14. Prediction  The prediction analysis technique can be used in the sale to predict profit for the future if we consider the sale is an independent variable, profit could be a dependent variable.  It is based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction
  • 15. Sequential patterns  It seeks to discover or identify similar patterns, regular events or trends in transaction data over a business period.  Eg: In sales, with historical transaction data, businesses can identify a set of items that customers buy together different times in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past.
  • 17. Decision Tree  It is a classification scheme which generates a tree and a set of rules from given data set.  A decision tree consist of root node, edges and leaf.  Each internal node denotes a test on an attribute.  Each branch node denotes outcome of the test.  Each leaf node holds the class label.
  • 19. ID3 algorithm i. Calculate the entropy of every attribute using the dataset. ii. Split the set into subsets using the attribute for which entropy is minimum(or information gain is maximum) iii. Make a decision tree node containing that attribute iv. Recurse on subset using remaining attributes
  • 20. Entropy  Info(D)=- 𝑖=1 𝑚 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖) where D=number of rows 𝑝𝑖 = class/D  𝐼𝑛𝑓𝑜 𝐴(D)= 𝑗=1 𝑣 |𝐷 𝑗| 𝐷 ∗ 𝐼𝑛𝑓𝑜(𝑑) Information gain  Gain(A)=Info(d)- 𝐼𝑛𝑓𝑜 𝐴(D)
  • 22. Finding entropy and gain  Finding info(d) Info(D)= - 9 14 𝑙𝑜𝑔2( 9 14 )- 5 14 𝑙𝑜𝑔2( 5 14 )= 0.940 bits  Finding for age 𝐼𝑛𝑓𝑜 𝑎𝑔𝑒(D)= 5 14 * (- 2 5 𝑙𝑜𝑔2( 2 5 )- 3 5 𝑙𝑜𝑔2( 3 5 ))+ 4 14 * (- 4 4 𝑙𝑜𝑔2( 4 4 )- 0 4 𝑙𝑜𝑔2( 0 4 ))+ 5 14 * (- 3 5 𝑙𝑜𝑔2( 3 5 )- 2 5 𝑙𝑜𝑔2( 2 5 )) = 0.694 bits Thus, Gain(age)=Info(D)- 𝐼𝑛𝑓𝑜 𝑎𝑔𝑒(D)=0.945-0.694=0.246 bits Similarly we can find gain of all other attributes. Gain(income)=0.029,Gain(student)=0.151,Gain(credit_rating)=0.048 Among all the gains the gain of age is the maximum thus it is taken as the root node.
  • 23. Rule Based Classification  It uses the set of IF-THEN rules for the classification.  IF-THEN rule is an expression of the form IF condition THEN conclusion.  IF part is called rule antecedent or precondition, it can consist of one or more attributes test.  THEN part is called rule consequent, it consist a class prediction. Ex: R1:IF age=youth and student=yes THEN buys_computer=yes Or R1: (age=youth)Λ(student=yes)=>(buys_computer=yes)
  • 24.  A rule R can be assessed by its coverage and accuracy –Given a tuple X from a data D –Let n cover: # of tuples covered by R –n correct: # of tuples correctly classify by R –|D|: # of tuples in D Coverage(R)= 𝑛 𝑐𝑜𝑣𝑒𝑟𝑠 |𝐷| Accuracy(R)= 𝑛 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑛 𝑐𝑜𝑣𝑒𝑟𝑠
  • 25. R: IF age=youth AND student=yes THEN buys_computer=yes |D| = 14 N cover = 2 N correct = 2 Coverage(R)= 2 14 =14.25% Correct(R)= 2 2 =100%
  • 26. Rules extraction from the decision tree  One rule is created for each path from the root to a leaf node  Each splitting criterion is logically AND to form the rule antecedent (IF part)  Leaf node holds the class prediction for rule consequent (THEN part)  Logical OR is implied between each of the extracted rules
  • 27. Example of rules extraction from the decision tree Rule’s 1. (age=young)˄(student=no)=>(buy_computer=no) 2. (age=young)˄(student=yes)=>(buy_computeryes) 3. (age=middle-aged)=>(buy_computer=yes) 4. (age=senior)˄(credit_rating=fair)=>(buy_computer=no) 5. (age=young)˄(credit_rating=excellent)=>(buy_computer= yes)
  • 28. Genetic algorithm  It is used for finding optimized solutions to search problems.  It is based on the theory of natural selection and evolution of biology. Selection: Survival of the fittest Evolution: Origin of species from a common descendent  It is excellent in searching large and complex data sets.
  • 29.  Gene: A part of chromosome.  Chromosome: A set of gene.  Population : No of individuals present with same length of chromosome.  Fitness: It is the value assigned to individual.  Fitness function: Function which assigns fitness value.  Selection: Selecting values for next generation.  Mutation: Changing a random gene.
  • 30. Algorithm 1. Generate random population of n chromosomes 2. Evaluate the fitness of each chromosome in the population 3. Select two parent chromosomes from a population according to their fitness 4. With a crossover probability cross over the parents to form a new offspring. 5. With a mutation probability mutate new offspring . 6. If the end condition is satisfied, stop, and return the best solution in current population
  • 31. Linear Regression  It is a data mining technique used to predict a range of numeric values (also called continuous values), given a particular dataset.  It is used to estimate a relationship between two variables.  Here involves a response/dependent variable y and a single predictor/independent variable x y = 𝑤𝑜+ 𝑤1 x where 𝑤𝑜 (y-intercept) and 𝑤1 (slope) are regression coefficients
  • 32. X years experience Y salary(in $ 1000s) 3 30 8 57 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83
  • 33. Non-linear regression, Association and frequent pattern Prepared by : Biplap Bhattarai
  • 34. • Often the relationship between x and y cannot be approximated with a straight line. In this case, a nonlinear regression technique may be used. • Alternatively, the data could be preprocessed to make the relationship linear. • Most Non-linear models can be modeled by polynomial regression model can be transformed into linear regression model. • For Example: Y=w0+w1x+w2x2 +w3x3 +w4x4 Non- Linear Regression:
  • 35. • For Example: Y=w0+w1x+w2x2 +w3x3 +w4x4 Convertible to Linear with new Variables: X2= x2,x3=x3,x4=x4 Y=w0+w1x1+w2x2 +w3x3 +w4x4 Which is easily solved by method of least squares using software for regression analysis. Non- Linear Regression:
  • 36. Association Rules: Detects sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy cookies, also buy milk (60% of all grocery shoppers buy both) Frequent Pattern: Frequent Pattern is a pattern (a set of items, subsequences) that occurs frequently in a dataset. For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set, is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. Data Mining Techniques
  • 37. Association Rule Mining: Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations or casual structures from datasets found in various kind of databases (relational, transactional). Data Mining Techniques
  • 38. The Apriori Principle: If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if an subset is infrequent, then all of its supersets must be infrequent, too. • Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data. • Apriori is designed to operate on database containing transactions (for example, collections of items bought by customers, or details of a website frequentation). DEFINITION OF APRIORI ALGORITHM
  • 39. • Apriori principle holds due to the following property of the support measure:∀X,Y:(X⊂Y)→s(X)≥s(Y) • For all x,y if x is the subset of y implies support of x is greater or equal to support of y. DEFINITION OF APRIORI ALGORITHM
  • 40. • Frequent Itemsets: All the sets which contain the item with the minimum support (denoted by Li For ith itemset) • Apriori Property: Any subset of frequent itemset must be frequent. • Join Operation: To find lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself KEY CONCEPTS
  • 41. Given minimum required support s : 1. Search for all individual elements (1 element item set) that have a minimum support of s 2. Repeat 2.1From the result of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s 2.2 This becomes the sets of all frequent (i+1) element item-sets that are interesting 3. Until item-set size reaches maximum The Apriori Algorithm
  • 42. The Frequent itemsets are items in L1, L2, L3
  • 43. Pros of the Apriori algorithm • It is an easy-to-implement and easy-to-understand algorithm. • It can be used on large itemsets. Cons of the Apriori Algorithm • Sometimes, it may need to find a large number of candidate rules which can be computationally expensive. • Calculating support is also expensive because it has to go through the entire database. The Apriori Algorithm
  • 44. FP tree algorithm, which use to identify frequent patterns in the area of Data Mining. The Frequent Pattern Tree (FP-Tree) is a compact structure that stores quantitative information about frequent patterns in a database. Frequent Pattern Tree (FP Tree)
  • 45. 1. One root labeled as “null” with a set of item-prefix subtrees as children 2. Each node in the item-prefix subtree consists of three fields: i. Item-name: registers which item is represented by the node; ii. Count: the number of transactions represented by the portion of the path reaching the node; iii. Node-link: links to the next node in the FP-tree carrying the same item- name, or null if there is none. Frequent Pattern Tree (FP Tree)
  • 46. 1. Question :Find all frequent itemsets or frequent patterns in the following database using FP-growth algorithm. Take minimum support as 30%. Frequent Pattern Tree (FP Tree)
  • 47. Step 1 - Calculate Minimum support First should calculate the minimum support count. Question says minimum support should be 30%. It calculate as follows: Minimum support count(30/100 * 8) = 2.4 As a result, 2.4 appears but to empower the easy calculation it can be rounded to to the ceiling value. Now, Minimum support count is ceiling(30/100 * 8) = 3 Frequent Pattern Tree (FP Tree)
  • 48. Step 2 - Find frequency of occurrence Now time to find the frequency of occurrence of each item in the Table. For example, item A occurs in row 1,row 2,row 3,row 4 and row 7. Totally 5 times occurs in the Database table. You can see the counted frequency of occurrence of each item in Table below . Frequent Pattern Tree (FP Tree)
  • 49. Step 3 - Prioritize the items In Table 2 you can see the numbers written in Red pen. Those are the priority of each item according to it's frequency of occurrence. Item B got the highest priority (1) due to it's highest number of occurrences. At the same time you have opportunity to drop the items which not fulfill the minimum support requirement .For instance, if Table contain F which has frequency 1, then you can drop it. Frequent Pattern Tree (FP Tree)
  • 50. Step 4 -Order the items according to priority As you see in the Table below new column added to the Table before. In the Ordered Items column all the items are queued according to it's priority, which mentioned in the Red ink in Table. For example, in the case of ordering row 1, the highest priority item is B and after that D, A and E respectively. Frequent Pattern Tree (FP Tree)
  • 51. Row 1: Note that all FP trees have 'null' node as the root node. So draw the root node first and attach the items of the row 1 one by one respectively. And write their occurrences in front of it. Frequent Pattern Tree (FP Tree)
  • 52. Row 2: Then update the above tree by entering the items of row 2. Then without creating another branch you can go through the previous branch up to E and then you have to create new node after that for C. (When you going through the branch second time you should erase one and write two for indicating the two times you visit to that node. If you visit through three times then write three after erase two.) Figure 2 shows the FP tree after adding row 1 and row 2. Frequent Pattern Tree (FP Tree)
  • 53. Row 3: In row 3 you have to visit B,A,E and C respectively. So you may think you can follow the same branch again by replacing the values of B,A,E and C. But you can't do that you have opportunity to come through the B. But can't connect B to existing A overtaking D. As a result you should draw another A and connect it to B and then connect new E to that A and new C to new E. Frequent Pattern Tree (FP Tree)
  • 54. Row 4: Then row 4 contain B,D,A. Now we can just rename the frequency of occurrences in the existing branch. As B:4,D,A:3. Row 5: In fifth raw have only item D. Now we have opportunity draw new branch from 'null' node. Frequent Pattern Tree (FP Tree)
  • 55. Row 6:B and D appears in row 6. So just change the B:4 to B:5 and D:3 to D:4. Row 7: Attach two new nodes A and E to the D node which hanging on the null node. Then mark D,A,E as D:2,A:1 and E:1. Row 8 : Attach new node C to B. Change the traverse times.(B:6,C:1) Frequent Pattern Tree (FP Tree)
  • 56. Step 6 - Validation After the five steps the final FP tree as follows: How we know is this correct? Now count the frequency of occurrence of each item of the FP tree and compare it with Table. If both counts equal, then it is positive point to indicate your tree is correct. Frequent Pattern Tree (FP Tree)
  • 57. • FP Growth Stands for frequent pattern growth • It is a scalable technique for mining frequent pattern in a database • After constructing the FP-Tree it’s possible to mine it to find the complete set of frequent patterns • FP growth improves Apriority to a big extent • Frequent Item set Mining is possible without candidate generation • Only “two scan” to the database is needed FP-Growth
  • 58. Simply a two step procedure – Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. – Step 2: Extracts frequent item sets directly from the FP- tree FP-Growth Algorithm Process
  • 59. FP-Growth Example: Example: With Min Support 3 FP-Tree
  • 60. FP-Growth Example: For T: We Start with drawing tree whose end nodes are Ts, Keeping only support of T
  • 61. FP-Growth Example: We Take out T one by one, and as we do so, we push its support to every node up the chain to the root that was part of the same transaction in which T was:
  • 62. FP-Growth Example: As, D is infrequent (min support=3), so we remove D Itemset={C,T},{W,T},{A,T},{C,W,T}, {C,A,T},{W,A,T},{T}
  • 63. FP-Growth Example: We Then, starting from main fp-tree, we consider each item that appears in the tree and create new FP-trees for C, W, A, and D.
  • 64. Market Basket Analysis • Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. • For example, if you are in an US, and purchase Diaper, Milk then you are likely to buy beer. • The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases.
  • 65. Market Basket Analysis • Typically the relationship will be in the form of a rule: • IF {milk, diaper} THEN {beer}. • The probability that a customer will buy milk without a diaper or vice versa is referred to as the support for the rule. The conditional probability that a customer will purchase beer is referred to as the confidence.
  • 66. How is it used? • In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them . • Market basket analysis can be used in deciding the location and promotion of goods inside a store.
  • 67. How is it used? • In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them . • Market basket analysis can be used in deciding the location and promotion of goods inside a store.
  • 68. Types: • There are two main types of MBA: • Predictive MBA is used to classify cliques of item purchases, events and services that largely occur in sequence. • Differential MBA removes a high volume of insignificant results and can lead to very in-depth results. It compares information between different stores, demographics, seasons of the year, days of the week and other factors.
  • 70. Clustering Cluster :- Cluster is a group of objects that belongs to the same class. similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster. Clustering is the process of making a group of abstract(something different) objects into classes of similar objects.
  • 71. Application of cluster analysis  Clustering can help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.  Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.  Clustering also helps in classifying documents on the web for information discovery.
  • 72. Clustering methods  Partitioning method  Hierarchical method
  • 73. Partitioning method  Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following requirements −  Each group contains at least one object.  Each object must belong to exactly one group. Types of Partitioning method 1. K-mean 2. K-medoids
  • 74. K-mean  K-Means is one of the most popular "clustering" algorithms.  K-means stores K centroids that it uses to define clusters.  A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
  • 75. Steps in K-mean 1. Partition objects into k nonempty subsets 2. Compute seed points as the centroids of the clusters of the current partition. Centroid is the centre (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment.
  • 76. =>2, 3, 6, 8, 9, 12, 15, 18, 22 – break into 3 clusters – Cluster 1 - 2, 12, 18 – mean = 10.6 – Cluster 2 - 6, 9, 22 – mean = 12.3 – Cluster 3 – 3, 8, 15 – mean = 8.6 • Re-assign – Cluster 1 - mean = 0 – Cluster 2 – 12, 15, 18, 22 - mean = 16.75 – Cluster 3 – 2, 3, 6, 8, 9 – mean = 5.6 • Re-assign – Cluster 1 – 2 – mean = 2 – Cluster 2 – 12, 15, 18, 22 – mean = 16.75 – Cluster 3 = 3, 6, 8, 9 – mean = 6.5 Re-assign – Cluster 1 – 2, 3 – mean = 2.5 – Cluster 2 – 12, 15, 18, 22 – mean = 16.75 – Cluster 3 – 6, 8, 9 – mean = 7.6 • Re-assign – Cluster 1 – 2, 3 – mean = 2.5 – Cluster 2 – 12, 15, 18, 22 - mean = 16.75 – Cluster 3 – 6, 8, 9 – mean = 7.6 • No change, so we’re done
  • 77. K-Mediods  It is also a algorithm which breaks the data sets into group.  It also attempt to minimize the distance between points labeled to be the cluster of that cluster.  Data point is taken as center and work with a generalization of the Manhattan Norm.
  • 78.
  • 79. Hierarchical Method Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other. For example, all files and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering:- 1. Divisive 2. Agglomerative
  • 80. Divisive Method In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumstances but is conceptually more complex.
  • 81.
  • 82. Agglomerative Method In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters. Repeat the process until there is only a single cluster left. Before any clustering is performed, it is required to determine the proximity matrix containing the distance between each point using a distance function. Then, the matrix is updated to display the distance between each cluster. The following three methods differ in how the distance between each cluster is measured.
  • 83. Single Linkage In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points. Complete Linkage In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.
  • 84. Average Linkage In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.

Notes de l'éditeur

  1. Correlations:- a mutual relationship or connection between two or more things. // interdependence of variable quantities. The causal relations between points are interpreted as describing which events in spacetime can influence which other events. A data set (or dataset) is a collection of data. Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. A transactional database is a database management system (DBMS) that has the capability to roll back or undo a database transaction or operation if it is not completed appropriately.
  2. Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth
  3. An impulse purchase or impulse buying is an unplanned decision to buy a product or service, made just before a purchase.
  4. An impulse purchase or impulse buying is an unplanned decision to buy a product or service, made just before a purchase.