7. Machine Learning
◦ Grew out of work in AI
◦ New Capability for computers
Machine Learning is a science of getting computers to learn without being
explicitly programed
Learning= Improving with experience at some task
◦ Improve over task T
◦ With respect to P
◦ Based on experience E
Machine Learning
7
8. Database Mining
◦ Large datasets from growth of automation/web
◦ Ex: web click data, medical records, biology, engineering
Applications can’t program by hand
◦ Ex: Autonomous helicopter, handwriting recognition, most of NLP,
Computer vision
Self- customizing programs
◦ Amazon, Netflix product recommendation
Understand human Learning(brain, real AI)
Machine Learning
8
9. Types of Learning
◦ Supervised learning: Learn to predict
◦ correct answer for each example. Answer can be a numeric variable, categorical variable etc.
◦ Unsupervised learning: learn to understand and describe the data
◦ correct answers not given – just examples (e.g. – the same figures as above , without the labels)
◦ Reinforcement learning: learn to act
◦ occasional rewards
M M MF F F
9
11. The success of machine learning system also depends on the algorithms.
The algorithms control the search to find and build the knowledge
structures.
The learning algorithms should extract useful information from training
examples.
Algorithms
11
12. Supervised learning
◦ Prediction
◦ Classification (discrete labels), Regression (real values)
Unsupervised learning
◦ Clustering
◦ Probability distribution estimation
◦ Finding association (in features)
◦ Dimension reduction
Reinforcement learning
◦ Decision making (robot, chess machine)
Algorithms
12
13. • Problem of taking labeled dataset, gleaning
information from it so that you can label new data
sets
• Learn to predict output from input
• Function approximation
Supervised Learning
13
18. 1. Input: Credit history (# of loans, how much money
you make,…)
Out put : Lend money or not?
2. Input: picture , Output: Predict Bsc, Msc PhD
3. Input: picture, Output: Predict Age
4. Input: Large inventory of identical items, Output:
Predict how many items will sell over the next 3
months
5. Input: Customer accounts, Output: hacked or not
Q?
18
20. Unsupervised Learning-examples
Organize computing clusters
◦Large data centers: what machines work
together?
Social network analysis
◦Given information which friends you send
email most /FB friends/Google+ circles
◦Can we automatically identify which are
cohesive groups of friends
20
21. Market Segmentation
◦ Customer data set and group customer into different market segments
Astronomical data analysis
◦ Clustering algorithms gives interesting & useful theories ex: how galaxies
are formed
21
22. 1. Given email labeled as spam/not spam, learn a spam filter
2. Given set of news articles found on the web, group them into
set of articles about the same story
3. Given a database of customer data, automatically discover
market segments ad groups customer into different market
segments
4. Given a dataset of patients diagnosed as either having
diabetes or nor, learn to classify new patients as having
diabetes or not
Q?
22
26. What is KNN?
A powerful classification algorithm used in pattern
recognition.
K nearest neighbors stores all available cases and classifies
new cases based on a similarity measure(e.g distance
function)
One of the top data mining algorithms used today.
A non-parametric lazy learning algorithm (An Instance-
based Learning method).
26
27. KNN: Classification Approach
An object (a new instance) is classified by a majority votes for its neighbor
classes.
The object is assigned to the most common class amongst its K nearest
neighbors.(measured by a distant function )
27
31. Distance Between Neighbors
Calculate the distance between new example (E) and all examples in the
training set.
Euclidean distance between two examples.
◦ X = [x1,x2,x3,..,xn]
◦ Y = [y1,y2,y3,...,yn]
◦ The Euclidean distance between X and Y is defined as:
31
n
i
ii yxYXD
1
2
)(),(
32. K-Nearest Neighbor Algorithm
All the instances correspond to points in an n-dimensional feature space.
Each instance is represented with a set of numerical attributes.
Each of the training data consists of a set of vectors and a class label
associated with each vector.
Classification is done by comparing feature vectors of different K nearest
points.
Select the K-nearest examples to E in the training set.
Assign E to the most common class among its K-nearest neighbors.
32
33. `
Distance from John
sqrt [(35-37)2+(35-50)2 +(3-
2)2]=15.16
sqrt [(22-37)2+(50-50)2 +(2-
2)2]=15
sqrt [(63-37)2+(200-50)2 +(1-
2)2]=152.23
sqrt [(59-37)2+(170-50)2 +(1-
2)2]=122
sqrt [(25-37)2+(40-50)2 +(4-
2)2]=15.74
33
Customer Age Income No.
credit
cards
Class
George 35 35K 3 No
Rachel 22 50K 2 Yes
Steve 63 200K 1 No
Tom 59 170K 1 No
Anne 25 40K 4 Yes
John 37 50K 2 ? YES
34. How to choose K?
If K is too small it is sensitive to noise points.
Larger K works well. But too large K may include majority points
from other classes.
Rule of thumb is K < sqrt(n), n is number of examples.
34
X
36. X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
36
37. Strengths of KNN
Very simple and intuitive.
Can be applied to the data from any distribution.
Good classification if the number of samples is
large enough.
37
38. Weaknesses of KNN
Takes more time to classify a new example.
need to calculate and compare distance from new
example to all other examples.
Choosing k may be tricky.
Need large number of samples for accuracy.
38
40. Grouping of records ,observations or cases into classes of
similar objects.
A cluster is a collection of records,
◦ Similar to one another
◦ Dissimilar to records in other clusters
What is Clustering?
40
43. There is no target variable for clustering
Clustering does not try to classify or predict the values of a target variable.
Instead, clustering algorithms seek to segment the entire data set into relatively
homogeneous subgroups or clusters,
◦ Where the similarity of the records within the cluster is maximized, and
◦ Similarity to records outside this cluster is minimized.
Difference between Clustering and
Classification
43
44. Between-cluster variation:
Within-cluster variation:
Goal of Clustering
between-cluster variation(BCV) is
large compared to the within-
cluster variation(WCV)
(Intra-cluster distance) the sum of distances between objects
in the same cluster are minimized
(Inter-cluster distance) while the distances between different
clusters are maximized
44
46. k-Means Clustering
Input: n objects (or points) and a number k
Algorithm
1) Randomly assign K records to be the initial cluster center locations
2) Assign each object to the group that has the closest centroid
3) When all objects have been assigned, recalculate the positions of the K
centroids
4) Repeat steps 2 to 3 until convergence or termination
46
48. Termination Conditions
The algorithm terminates when the centroids no longer
change.
The SSE(sum of squared errors) value is less than some small
threshold value
Where p є Ci represents each data point in cluster i and mi
represent the centroid of cluster i.
SSE = d(p- mi )2
pÎci
å
i=1
k
å
48
49. Example 1:
Lets s suppose the following points are the delivery locations for
pizza
49
58. How to decide k?
Unless the analyst has a prior knowledge of the number of underlying clusters,
therefore,
◦ Clustering solutions for each value of K is compared
◦ The value of K resulting in the smallest SSE being selected
58
61. It is important that the test data is
not used in any way to create the
classifier.
62.
63. 63
Classification Step 1: Split data into train and test sets
Results Known
+
+
-
-
+
THE PAST
Data
Training set
Testing set
64. 64
Classification Step 2: Build a model on a training set
Training set
Results Known
+
+
-
-
+
THE PAST
Data
Model Builder
Testing set
65. 65
Classification Step 3: Evaluate on test set (Re-train?)
Data
Predictions
Y N
Results Known
Training set
Testing set
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-