4. Is this Classification?
● Is this bank transfer fraudulent?
● Is this patient healthy?
● Will you vote for me, for X, or for Y?
● Will these two people fit to each other?
● Is this an apple?
Def.: Given a number of examples, identify to which class a given
observation belongs to.
4
5. What do we need for apple classification?
Mass Width Height Is an Apple
192 8.4 7.3 1
342 9 9.4 0
186 7.2 9.2 0
152 7.6 7.3 1
Observations
Attributes Class
80 5.9 4.3 1
194 7.2 10.3 0
Training data
Test data
https://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/
5
9. A Nearest Neighbor Classifier
● Find the point in the training set that is nearest to the new point.
● If that nearest point is an apple, classify the new point as apple.
9
12. Chronic Kidney Disease Classification
Due to different value ranges, a difference
for the white blood cell count is significantly
more impactful than for Glucose.
⇒ Standard Units
Kidney Cross Section by Anmats is licensed
under CC BY 3.0
12
Glucose White Blood Cell Count Class
117 6700 1
70 12100 1
114 7200 0
131 6800 0
14. What if there is no clear decision boundary?
Does this patient have the chronic kidney disease?
14
15. K-Nearest Neighbors ill
healthy
● Find the k points in the training set that are nearest to the new point.
● If most nearest points are healthy, classify the new point as healthy.
15
18. Is Alice ill?
● Step 1: Find the distance between Alice and each point in the training
sample.
● Step 2: Sort the data table in increasing order of the distances.
● Step 3: Take the top k=4 rows of the sorted table.
● Step 4: Choose the majority class of these 4 rows.
18
19. Is Alice ill?
Step 0.1: Load training data.
Step 0.2: Select and standardize the attributes that we use for classification.
19
21. Applying a function to each row in a table
● We already can apply a function to each element in a column:
TableName.apply(FunctionName, 'ColumnName')
● Now, we want to apply a function to the entire row:
TableName.apply(FunctionName)
21
22. Is Alice ill?
● Step 1: Find the distance between Alice and each point in the training
sample.
22
23. Is Alice ill?
● Step 1: Find the distance between Alice and each point in the training
sample.
23
Glucose White Blood Cell Count Class Distance from Alice
-0.2215 -0.569768 1 0.88943
-0.9475 1.16268 1 2.16332
3.8412 -1.27558 1 4.84907
0.3963 0.809777 1 2.28585
0.6435 0.232293 1 2.0542
-0.5614 -0.505603 1 0.660906
24. Is Alice ill?
● Step 2: Sort the data table in increasing order of the distances.
24
Glucose White Blood Cell Count Class Distance from Alice
-0.94759 -0.98684 0 0.0540298
-0.82401 -0.98684 0 0.176477
-0.87035 -0.794345 0 0.243107
-0.71588 -0.85851 0 0.317401
-0.70043 -0.85851 0 0.331301
25. Is Alice ill?
● Step 3: Take the top k=4 rows of the sorted table.
25
Glucose White Blood Cell Count Class Distance from Alice
-0.94759 -0.98684 0 0.0540298
-0.82401 -0.98684 0 0.176477
-0.87035 -0.794345 0 0.243107
-0.71588 -0.85851 0 0.317401
26. Is Alice ill?
● Step 4: Choose the majority class of these 5 rows.
26
29. Training and Testing
● How good is the classifier?
● How well does the classifier predict data that it has not seen before?
29
Training by Luca_Episcopo is licensed under Pixabay
License
Game by pixabay.com is licensed under CC0 1.0
30. Generating test data (hold-out set)
● We can gather more data,
● or we randomly split the given data into two parts: training and testing.
30
31. Never test on training data!
Is Felix a good soccer player?
Felix scores 10 goals in the training session with his friends!
So, Felix is a good player!?
Well, in the game, Felix is super nervous and scores own goal.
31
32. Never train on test data!
If we train a 1-Nearest Neighbor classifier on the following data, would it make any
mistakes on the same data?
32
33. The Accuracy of the Classifier
https://www.inferentialthinking.com/chapters/17/5/Accuracy_of_the_Classifier.html
34. Naming Convention for Prediction Evaluation
Was the prediction correct? Which class did we predict?
True False Positive Negative
Example: Prediction = Apple, Ground Truth = Not Apple
⇒ False Positive
34
37. Which K?
Test Accuracy: 0.70Test Accuracy: 0.91Test Accuracy: 0.89
1-Nearest Neighbors 4-Nearest Neighbors 30-Nearest Neighbors
37
38. Summary
38
● Classification: Given a number of examples, identify to which class a given
observation belongs to.
● We can use the nearest neighbors of an observation to classify it.
● To evaluate a classification model, we split the data into training and test.
● To measure the success, we can use metrics, such as accuracy.