2. Schedule:
1. Example of Datamining
2. What and Where is Datamining in the System
3. Datamining Techniques
Data preprocessing
Data Analysis
Data Visualization
3. How data look like?
X Y
3 3
3 1
2 2
4 6
2 3
6 7
7 5
5 6
Can we get some thing from this?
The row represents
an object and its
columns represent
its attributes
Ex: can we identify the group of these objects? YES
1. Example of Datamining
4. Now, forget the table, consider a row as a point then we have
0
2
4
6
8
0 2 4 6 8
X
Y
B
A
C
From each data point, we find its neighbors by scanning with a radius r .
For Example : A will have 2 Neighbors B and C , denoted: A{B,C}
r
D
A and D have same neighbors so they are considered as neighbors
Same for B {A,B,C,D} ,C{A,B,C,D}, D{B,C}
The points have neighborhood will be in the same group.
1. Example of Datamining
5. Finally we have 2 groups after considering all points
0
2
4
6
8
0 2 4 6 8
X
Y
What do we see here?
Data has not been classified into groups but we now have the groups
This is just an example of technique called CLUSTERING in DATAMINING
1. Example of Datamining
6. 2. What and Where is Datamining in the System
So. What exactly is Datamining?
Datamining is the set of tools and techniques to retrieve
hidden Knowledge/Rules from data
The name of datamining could make us to misunderstand
Data was there, we do not need to ‘mining’ it
For ore mining you need hammers and shovels
However, for datamining you need mathematic, statistic and
probability, machine learning, computer programming,
database techniques,...
7. 2. What and Where is Datamining in the System
Where is Datamining in the system?
Employee/Staff
Day by day, The staff using the software (Web/
Desktop/Mobile application) to generate data by recording
all of his/her business activities (customers, products,
order detail, contracts ,…) Database
Data is added to Database
Online transaction processing (OLTP)
Database
Database
….
Data from several data sources (OLTP) will be collected to a common repository
Data
warehouse
Integration
Service
Datamining service will access to the Data warehouse to process
Data Mining
8. 3. Datamining Techniques
What are the techniques in Datamining?
There are so many techniques can be applied in datamining
Basically we can classify them into 3 groups / phases
Data-Preprocessing
Data Analysis
Data Presentation
10. 3. Datamining Techniques
We can understand that:
The quality of collected data would be not good.
It is necessary to clean / format / transform .... Before analyzing
This is very important process. It is very hard to find an
abstract way to describe.
Data-Preprocessing
Here we will see few examples of data pre-processing
techniques:
• Similarity Measure
• Down Sampling
• Dimension Reduction
• Vectorization
11. 3. Datamining Techniques
How can we know which object are similar?
Data-Preprocessing Similarity Measure
A(x1,y1)
B(x2,y2)
C(x1,y1)
D2D1
Measure the distance between AB and AC
We see that D1 < D2 -> A is more similar with B than C
Every point can be represented as vector. Measure the angle between
pair of vectors: A and B, then A and C
We see that 𝜶 < 𝜷 -> A is more similar with B than C
𝜶
𝜷
12. 3. Datamining Techniques
What if, you have so many data, performing data analysis on all
of them may be not necessary and reducing performance ?
Data-Preprocessing Down Sampling
Just pick some of them to evaluate
Example: using a cell-size of 𝑔. Keep only object / cell
𝑔
𝑔
Origin Data Down Sampling
13. 3. Datamining Techniques
All example data have been presented to you are in 2
dimensions, 2 attributes (X,Y) . What if it was ~10.000 attributes
for each object
Data-Preprocessing Dimension Reduction
This could reduce the performance (and or accuracy) of data-
analysis algorithms . Somehow we need to reduce number of
dimensions
Principal component Analysis & Singular value Decomposition
are 2 of most effective methods to do this
14. 3. Datamining Techniques
Data-Preprocessing Dimension Reduction - PCA
PCA
X
Y
𝑃1
𝑃2
Origin Data Data projected to Principal Components
We Only keep 𝑘 Principal Components that have highest eigenvalues. On above
example. We can let 𝑘 = 1 then keep 𝑃1 instead of both 𝑃1 , 𝑃2
By this way the number of dimensions has been reduced
15. 3. Datamining Techniques
Data-Preprocessing Vectorization
Most of Data Analysis algorithms consider the input as set of
vectors, so we need to transform the collected data into set of
vectors.
Ex: Giving a document: “Mr A has not passed the exam this
year. He will do it again next year”
Some of important words will be extracted like “Mr A” , “not” ,
“pass” ,”exam” , “again” , “next” , “year”
Measure the frequency of each word, we get the vector that
represent the document
Mr A not pass exam again next year
1 1 1 1 1 1 2
17. 3. Datamining Techniques
There are so many techniques in this phase:
• Clustering
• Classification
• Regression
• Rule Bases
• ….
This is the most important phase, where we find all of
hidden knowledge/ rules in the data
Data Analysis
18. 3. Datamining Techniques
The process of clustering is to find ways to group objects
into groups (clusters)
Data Analysis Clustering
The objects in the same cluster are similar and otherwise
they are not similar.
There are 2 types of clustering : Partional & Hierarchical
In this presentation: we see an example of the most famous
clustering method : K-Mean
19. 3. Datamining Techniques
Data Analysis Clustering – K mean Algorithm
1. Randomly select K center (centroid) for K clusters (cluster).
2. Calculate the distance between objects (objects) to the K center
3. Group objects to the nearest group
4. Defining the new focus for the group
5. Repeat step 2 until no change of subject groups
21. 3. Datamining Techniques
Data Analysis Clustering – K mean Algorithm
Select K=2 centroids Compute the new position of
centroids
Finally centroids stop changing
The object belongs to the group of
its closest centroid
The key point of algorithm is to
select a good k
22. 3. Datamining Techniques
Data Analysis Classification
How can we identify the group of unclassified object ?
Sure! we can perform clustering to do this.
However, what if we know some of classified objects in
the past? Can we do better than Clustering? YES.
We can construct a prediction model to predict the group
of unclassified objects based on the classified objects
This process called CLASSIFICATION
23. 3. Datamining Techniques
Data Analysis Classification
The process of Classification can be described as below
Learning
Algorithm
Model
24. 3. Datamining Techniques
Data Analysis Classification - SVM
Support Vector Machine (SVM) is one of famous classification
method. It belongs to group of linear classifiers
For example: data classified in red and blue Training Data
𝑤 : normal vector
𝑏 : bias / distance from the line to origin
?
𝑥
𝑦 𝑤 + 𝑏 > 0 → blue
Classification Model?
𝑥
𝑦 𝑤 + 𝑏 < 0 → red
25. 3. Datamining Techniques
Data Analysis Regression
Use for prediction: but to predict the missing value of an
attribute
For example:
Y
X𝑥𝑖
𝑦𝑖
• How to find 𝑦𝑖 , if 𝑥𝑖 known?
• We can estimate the line
that describe The data
• Plug 𝑥𝑖 to line equation to
Find 𝑦𝑖
• This is just an example of
Linear Regression
26. 3. Datamining Techniques
Data Analysis Rule Base
Rule Base techniques : to find hidden patterns in the data
Example of rule base techniques:
• Customer normally buy rice always buy vegetable
• Young people want to more expensive phone than others
• People always buy laptop before buying cell-phone
Frequent Pattern
Gradual Pattern
Sequential Pattern
28. 3. Datamining Techniques
Data Visualization
Techniques to present knowledge that you retrieved to user
0
2
4
6
8
10
12
14
Series 3
Series 2
Series 1
Series 1 Series 2 Series 3
Category
1 4.3 2.4 2
Category
2 2.5 4.4 2
Category
3 3.5 1.8 3
Category
4 4.5 2.8 5