1. Development of an intelligent
system predictor of delinquency
profiles
2. What we're going to see
● Motivation and goals
● Review on Case-Based Reasoning
● A few learning techniques
● Most relevant error estimators
● Software implementation
● Implied technologies
● Testing the software
● Delinquency detection
● Planning of the project
● Future of the project
● Conclusions
3. Motivation
● Release a valuable project by taking advantage of the
recent knowledge learned.
Goals
● Development of a software in Ruby under CBR
capable of predicting customer profiles involving fraud.
● Testing the software
● Attempt of predicting with real cases provided by
Maderas Gomez S.A.
4. What is Case-Based Reasoning?
Case-Based Reasoning (CBR) is a name given to a reasoning
method that uses specific past experiences rather than a corpus
of general knowledge.
It is a form of problem solving by analogy in which a new problem
is solved by recognizing its similarity to a specific known problem,
then transferring the solution of the known problem to the new
one.
CBR systems consult their memory of previous episodes to help
address their current task, which could be:
● planning of a meal,
● classifying the disease of a patient,
● designing a circuit, etc.
5. Case-Based Reasoning Features
Possibly the simplest way of machine
learning
Training cases are simply stored
Each case is composed by a set of
attributes and one is assigned to
classification
Use those previous solved experiences to
resolve actual cases
May entail storing newly solved problems
into the case-base
6. Case-Based Reasoning Cycle
● At the highest level of generality, a general CBR cycle
may be described by the following four processes:
1. RETRIEVE the most similar case or cases
2. REUSE the information and knowledge in that
case to solve the problem
3. REVISE the proposed solution
4. RETAIN the parts of this experience likely to be
useful for future solving
• A new problem is solved by retrieving one or more
previously experienced cases, reusing the case in one
way or another, revising the solution based on reusing a
previous case, and retaining the new experience by
incorporating it into the existing knowledge-base (case-
base).
8. Learning Techniques
Decision Tree
Method for approximation of discrete-valued target
functions together with disjunctions (classification)
One of the most widely used methods for inductive
inference
Can be represented as if-then rules
Nearest Neighbor
All instances correspond to points in an n-dimensional
Euclidean space
Classification done by comparing feature vectors of
the different points
Target function may be discrete or real-valued
9. Decision Tree Example
Each internal node corresponds to a test
Each branch corresponds to a result of the test
Each leaf node assigns a classification
12. Error estimators
● There are many ways of estimating error.
The following ones are three of them:
– Hold-out
– K-fold cross-validation
– Leave one out
13. Hold-out Method
● The hold-out method splits the data into training data
and test data (usually 2/3 for train, 1/3 for test). Then
we build a classifier using the train data and test it
using the test data.
● Used with a large amount of instances
● Needs plenty information from each class
14. K-Fold Cross-Validation Method
● k-fold cross-validation avoids overlapping test sets:
– Step 1: data is split into k subsets of equal size
– Step 2: each subset in turn is used for testing
and the remainder for training
● The subsets are stratified before the cross-validation
● The estimates are averaged to yield an overall
estimate
15. Leave One Out Method
● Leave-One-Out is a particular form of cross-validation:
–Set a number of folds of training instances
– e.g., for n training cases, build a classifier n
times
● Makes best use of the data
● Very computationally expensive
16. Software Development
● Two different algorithms have been implemented:
• C4.5, which is an extension of Quinlan's ID3 algorithm
and generates a decision tree capable of
classification.
• K-Nearest Neighbor, which classifies instances based
on closest training examples in the feature space.
17. C4.5 implementation
● Entropy:
● Information gain:
● Data structures:
– Training cases → Vector of classes (filled iteratively) –
Each instances is a class
– Decision tree → Vector of classes (filled recursively) –
Each node is a class
18. C4.5 implementation (II)
● Pruning technique:
pre-pruning: Stop building a branch due to not reliable
information.
post-pruning: Discard inefficient branches, once the
decision tree is been completed.
The next formula estimates the error by taking into account the
pruning:
The next formula estimates the error without pruning:
So the condition is as follows:
if E(S) < BackUpError(S) then prune the node
19. C4.5 implementation (III)
● Continuous attributes:
• Each one of the continuous values are discretized
into nominal values by taking into account the
maximum and minimum of their attributes.
• Moreover three different ranges of discretization are
possible and configurable:
Two levels: [High,Low]
Three levels: [High, Middle, Low]
Four levels: [Very High, High, Low, Very Low]
• Thus the range of distinct tests is wider.
20. K-NN implementation
● Norm:
– Each continuous attribute of each instance is standardized
as follows:
● Sum of all different distances:
● Distance functions:
Minkowsky Sokal-Michener Overlap
21. K-NN implementation (II)
● Number of neighbors (k):
– This parameter is configurable
– Despite most common k are: 5, 7, 11 and 21. Nonetheless
it depends on the problem domain.
– It must be odd to avoid possible draws between number of
classifications
● Data structures:
– Training cases → Vector of classes (filled iteratively) –
Each instances is a class
– Distances → Vector of floats (filled iteratively)
26. Technologies
● Ruby ● Redcar
– Dynamic – Full features for
– Reflective
Ruby
– Imperative
– Still on development
– Of general-purpose
– Object-oriented ● Ubuntu 12.04
– Inspired by Perl and – Best O.S. to deploy
Smalltak
Ruby's virtual
machine
– Fast
– Easy-to-use
27. Experiments
● The rating of predictions is done by calculating the accuracy
as follows:
● The software is tested with:
– case bases extracted from UCI Machine Learning
Repository.
– the error estimator Leave One Out, which is a
particular case of K-Fold Cross-Validation. The
case bases are partitioned into 10 portions: K = 10
– 1.000 executions.
28. Hepatitis Detection Experiment
● Features of the case base:
Source Doctor Bojan Cestnik of Jozef Stefan Institute
Motive Classify if a patient suffers from hepatitis
Number of attributes 19
Type of attributes Categorical, integer and real
Number of instances 155
Missing values? Yes
Number of classes 2
Algorithm C4.5
Levels of discretization 4
Official accuracy ≈ 80%
29. Hepatitis Detection Experiment (II)
● Accuracy of the 1.000 executions:
max max
min min
➔ Average accuracy ≈ 78% ≈ 80 %
➔ Pretty good precision
30. Hepatitis Detection Experiment (III)
● Some important rules pulled out of the decision trees:
# Rule Classification
1 (ALBUMIN = Very High or Low) and LIVE
(PROTIME = Very Low) and (HISTOLOGY =
No)
2 (HISTOLOGY = No) and (PROTIME = Very LIVE
High)
3 (HISTOLOGY = Yes) and (PROSTIME = High) LIVE
and (ALBUMIN = Low)
4 (ALBUMIN = High) and (SGOT = Low) and DIE
(PROTIME = Very Low) and (HISTOLOGY =
No)
5 (ALBUMIN = Very Low) and (SGOT = Low) DIE
and (HISTOLOGY = Yes)
31. Vehicle Shape Experiment
● Features of the case base:
Source Pete Mowforth i Barry Shepherd of Turing
Institute
Motive Classify a vehicle silhouette into four different
kinds according to several characteristics
Number of attributes 18
Type of attributes Integer
Number of instances 946
Missing values? No
Number of classes 4
Algorithm K-Nearest Neighbor
K 7
Official accuracy None
32. Vehicle Shape Experiment (II)
● Accuracy of the 1.000 executions using Euclidean distance:
max Pretty high
min min
➔ Average accuracy ≈ 69-70% → not bad
➔ Results under k=21 give maximums of higher value but its
average accuracy remains equal
33. Delinquency Detection
● The rating is done similarly as the previous experiments:
● Dataset provided by a catalan SME called Maderas Gomez
S.A.
● Error estimator: Hold-out
– 70% of dataset → Training
– 30% of dataset → Test
● Variable amount of executions
34. Delinquency Detection (II)
● Features of the case base:
Source Maderas Gomez, S.A.
Motive Label customer profiles in payment delinquents
and non-delinquents.
Number of attributes 5
Type of attributes Integer and float
Number of instances 770
Missing values? Yes
Number of classes 2
Algorithm C4.5 and K-Nearest Neighbor
K 5, 11
Levels of discretization 2, 4
Official accuracy None
➔ Unfortunately all attributes are continuous
35. Delinquency Detection (II)
● Accuracy of :
• 50 executions
• C4.5 algorithm
• 2 levels of discretization
max
Pretty high
min
➔ Average accuracy ≈ 95-96%
36. Delinquency Detection (III)
● Accuracy of :
• 100 executions
• C4.5 algorithm
• 4 levels of discretization
max max
min
➔ Average accuracy ≈ 94-96%
37. Delinquency Detection (IV)
● Accuracy of :
• 50 executions
• 5-Nearest Neighbor
• Euclidean function distance
max Pretty high
min
➔ Average accuracy ≈ 94-95%
38. Delinquency Detection (V)
● Accuracy of :
• 50 executions
• 11-Nearest Neighbor
• Euclidean function distance
max
min
➔ Average accuracy ≈ 94% → a little worse than with k=5
39. Delinquency Detection (V)
● As for the rules pulled up off the decision tree:
# Rule Classification
1 (DIFERENCIA = Very High) and (FORMA DE DELINQUENT
PLAZO = Very Low) and (F.P REAL = Very Low)
2 (CONSUMIDO = Very Low) and (CONCEDIDO DELINQUENT
= Very High) and (DIFERENCIA = Very Low)
and (FORMA DE PLAZO = Very High) and (FP.
REAL = Very Low)
40. Planning of the project
Research -100 h
Designing - 80 h
Implementation - 300 h
Experiments - 75 h
Report - 75 h
41. Future Of The Project
● Implementation of a funcionality capable of drawing a Voronoi
Diagram for k-Nearest Neighbor algorithm.
● Embed the system core (KNN and Decision Tree subsystems)
in to a Web environment.
● Obtain new and better information related to customers of the
same business and see if we get more reliable results.
● Apply the software upon other sorts of field.
42. Conclusions
● Despite knowing that the software works good, it may be
suspicious of getting accuracies as high as the last ones
shown along delinquency prediction slides. I suspect the
attributes don't provide the most suitable information.
● If Maderas Gomez S.A. wants to try to predict possible
delinquency more accurately then must start to gather as
much information related to the clients as possible.
● Ruby is a very powerful programming language which can be
extrapolated to many fields that computation touches and in
the next years it will be one of the most important.
● As a personal point, Machine Learning has drawn my
attention to even devoting my professional career in such
field.