Introduction to data mining and machine learning

Data Mining & Machine
Learning
Tilani Gunawardena
PhD(UNIBAS), BSc.Eng(Pera), FHEA(UK), AMIE(SL)
2019/11/22

Data Mining: Process of discovering patterns in data
Data Mining
6

Machine Learning
◦ Grew out of work in AI
◦ New Capability for computers
Machine Learning is a science of getting computers to learn without being
explicitly programed
Learning= Improving with experience at some task
◦ Improve over task T
◦ With respect to P
◦ Based on experience E
Machine Learning
7

Database Mining
◦ Large datasets from growth of automation/web
◦ Ex: web click data, medical records, biology, engineering
Applications can’t program by hand
◦ Ex: Autonomous helicopter, handwriting recognition, most of NLP,
Computer vision
Self- customizing programs
◦ Amazon, Netflix product recommendation
Understand human Learning(brain, real AI)
Machine Learning
8

Types of Learning
◦ Supervised learning: Learn to predict
◦ correct answer for each example. Answer can be a numeric variable, categorical variable etc.
◦ Unsupervised learning: learn to understand and describe the data
◦ correct answers not given – just examples (e.g. – the same figures as above , without the labels)
◦ Reinforcement learning: learn to act
◦ occasional rewards
M M MF F F
9

The success of machine learning system also depends on the algorithms.
The algorithms control the search to find and build the knowledge
structures.
The learning algorithms should extract useful information from training
examples.
Algorithms
11

Supervised learning
◦ Prediction
◦ Classification (discrete labels), Regression (real values)
Unsupervised learning
◦ Clustering
◦ Probability distribution estimation
◦ Finding association (in features)
◦ Dimension reduction
Reinforcement learning
◦ Decision making (robot, chess machine)
Algorithms
12

• Problem of taking labeled dataset, gleaning
information from it so that you can label new data
sets
• Learn to predict output from input
• Function approximation
Supervised Learning
13

Predict housing prices
Supervised learning: example 1
Regression: predict continuous valued output(price)
14

Breast Cancer(malignant, benign)
This is classification problem : Discrete valued output(0 or 1)
15

1 attribute/feature : Tumor Size
16

2 attributes/features : Tumor Size and Age
17

1. Input: Credit history (# of loans, how much money
you make,…)
Out put : Lend money or not?
2. Input: picture , Output: Predict Bsc, Msc PhD
3. Input: picture, Output: Predict Age
4. Input: Large inventory of identical items, Output:
Predict how many items will sell over the next 3
months
5. Input: Customer accounts, Output: hacked or not
Q?
18

Find patterns and structure in data
Unsupervised Learning
19

Unsupervised Learning-examples
Organize computing clusters
◦Large data centers: what machines work
together?
Social network analysis
◦Given information which friends you send
email most /FB friends/Google+ circles
◦Can we automatically identify which are
cohesive groups of friends
20

Market Segmentation
◦ Customer data set and group customer into different market segments
Astronomical data analysis
◦ Clustering algorithms gives interesting & useful theories ex: how galaxies
are formed
21

1. Given email labeled as spam/not spam, learn a spam filter
2. Given set of news articles found on the web, group them into
set of articles about the same story
3. Given a database of customer data, automatically discover
market segments ad groups customer into different market
segments
4. Given a dataset of patients diagnosed as either having
diabetes or nor, learn to classify new patients as having
diabetes or not
Q?
22

Algorithms: K Nearest Neighbors
23

Simple Analogy..
Tell me about your friends(who your neighbors are) and I
will tell you who you are.
24

KNN – Different names
•K-Nearest Neighbors
•Memory-Based Reasoning
•Example-Based Reasoning
•Instance-Based Learning
•Lazy Learning
25

What is KNN?
A powerful classification algorithm used in pattern
recognition.
K nearest neighbors stores all available cases and classifies
new cases based on a similarity measure(e.g distance
function)
One of the top data mining algorithms used today.
A non-parametric lazy learning algorithm (An Instance-
based Learning method).
26

KNN: Classification Approach
An object (a new instance) is classified by a majority votes for its neighbor
classes.
The object is assigned to the most common class amongst its K nearest
neighbors.(measured by a distant function )
27

Distance Measure
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
29

Distance measure for Continuous Variables
30

Distance Between Neighbors
Calculate the distance between new example (E) and all examples in the
training set.
Euclidean distance between two examples.
◦ X = [x1,x2,x3,..,xn]
◦ Y = [y1,y2,y3,...,yn]
◦ The Euclidean distance between X and Y is defined as:
31


n
i
ii yxYXD
1
2
)(),(

K-Nearest Neighbor Algorithm
All the instances correspond to points in an n-dimensional feature space.
Each instance is represented with a set of numerical attributes.
Each of the training data consists of a set of vectors and a class label
associated with each vector.
Classification is done by comparing feature vectors of different K nearest
points.
Select the K-nearest examples to E in the training set.
Assign E to the most common class among its K-nearest neighbors.
32

`
Distance from John
sqrt [(35-37)2+(35-50)2 +(3-
2)2]=15.16
sqrt [(22-37)2+(50-50)2 +(2-
2)2]=15
sqrt [(63-37)2+(200-50)2 +(1-
2)2]=152.23
sqrt [(59-37)2+(170-50)2 +(1-
2)2]=122
sqrt [(25-37)2+(40-50)2 +(4-
2)2]=15.74
33
Customer Age Income No.
credit
cards
Class
George 35 35K 3 No
Rachel 22 50K 2 Yes
Steve 63 200K 1 No
Tom 59 170K 1 No
Anne 25 40K 4 Yes
John 37 50K 2 ? YES

How to choose K?
If K is too small it is sensitive to noise points.
Larger K works well. But too large K may include majority points
from other classes.
Rule of thumb is K < sqrt(n), n is number of examples.
34
X

X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
36

Strengths of KNN
Very simple and intuitive.
Can be applied to the data from any distribution.
Good classification if the number of samples is
large enough.
37

Weaknesses of KNN
Takes more time to classify a new example.
 need to calculate and compare distance from new
example to all other examples.
Choosing k may be tricky.
Need large number of samples for accuracy.
38

Grouping of records ,observations or cases into classes of
similar objects.
A cluster is a collection of records,
◦ Similar to one another
◦ Dissimilar to records in other clusters
What is Clustering?
40

There is no target variable for clustering
Clustering does not try to classify or predict the values of a target variable.
Instead, clustering algorithms seek to segment the entire data set into relatively
homogeneous subgroups or clusters,
◦ Where the similarity of the records within the cluster is maximized, and
◦ Similarity to records outside this cluster is minimized.
Difference between Clustering and
Classification
43

Between-cluster variation:
Within-cluster variation:
Goal of Clustering
between-cluster variation(BCV) is
large compared to the within-
cluster variation(WCV)
(Intra-cluster distance) the sum of distances between objects
in the same cluster are minimized
(Inter-cluster distance) while the distances between different
clusters are maximized
44

k-Means Clustering
Input: n objects (or points) and a number k
Algorithm
1) Randomly assign K records to be the initial cluster center locations
2) Assign each object to the group that has the closest centroid
3) When all objects have been assigned, recalculate the positions of the K
centroids
4) Repeat steps 2 to 3 until convergence or termination
46

Termination Conditions
The algorithm terminates when the centroids no longer
change.
The SSE(sum of squared errors) value is less than some small
threshold value 
Where p є Ci represents each data point in cluster i and mi
represent the centroid of cluster i.
SSE = d(p- mi )2
pÎci
å
i=1
k
å
48

Example 1:
Lets s suppose the following points are the delivery locations for
pizza
49

Lets locate three cluster centers randomly
50

Find the distance of points as shown
51

Assign the points to the nearest cluster center based on the
distance between each center and the point
52

Re-assign the cluster centres and locate nearest points
53

Re-assign the cluster centres and locate nearest points, calculate the
distance
54

How to decide k?
Unless the analyst has a prior knowledge of the number of underlying clusters,
therefore,
◦ Clustering solutions for each value of K is compared
◦ The value of K resulting in the smallest SSE being selected
58

Model’s Evaluation in the KDD Process

It is important that the test data is
not used in any way to create the
classifier.

63
Classification Step 1: Split data into train and test sets
Results Known
+
+
-
-
+
THE PAST
Data
Training set
Testing set

64
Classification Step 2: Build a model on a training set
Training set
Results Known
+
+
-
-
+
THE PAST
Data
Model Builder
Testing set

65
Classification Step 3: Evaluate on test set (Re-train?)
Data
Predictions
Y N
Results Known
Training set
Testing set
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-

Introduction to data mining and machine learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to data mining and machine learning

Similaire à Introduction to data mining and machine learning (20)

Plus de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Plus de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

Dernier

Dernier (20)

Introduction to data mining and machine learning