SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Development of an intelligent
system predictor of delinquency
           profiles
What we're going to see
●   Motivation and goals
●   Review on Case-Based Reasoning
●   A few learning techniques
●   Most relevant error estimators
●   Software implementation
●   Implied technologies
●   Testing the software
●   Delinquency detection
●   Planning of the project
●   Future of the project
●   Conclusions
Motivation
●   Release a valuable project by taking advantage of the
    recent knowledge learned.



                           Goals
●   Development of a software in Ruby under CBR
    capable of predicting customer profiles involving fraud.
●   Testing the software
●   Attempt of predicting with real cases provided by
    Maderas Gomez S.A.
What is Case-Based Reasoning?
Case-Based Reasoning (CBR) is a name given to a reasoning
method that uses specific past experiences rather than a corpus
of general knowledge.
It is a form of problem solving by analogy in which a new problem
is solved by recognizing its similarity to a specific known problem,
then transferring the solution of the known problem to the new
one.
CBR systems consult their memory of previous episodes to help
address their current task, which could be:
●   planning of a meal,
●   classifying the disease of a patient,
●   designing a circuit, etc.
Case-Based Reasoning Features
   Possibly the simplest way of machine
    learning
          Training cases are simply stored
          Each case is composed by a set of
            attributes and one is assigned to
            classification
          Use those previous solved experiences to
            resolve actual cases
   May entail storing newly solved problems
    into the case-base
Case-Based Reasoning Cycle
●   At the highest level of generality, a general CBR cycle
    may be described by the following four processes:
         1. RETRIEVE the most similar case or cases
         2. REUSE the information and knowledge in that
              case to solve the problem
         3. REVISE the proposed solution
         4. RETAIN the parts of this experience likely to be
              useful for future solving
•   A new problem is solved by retrieving one or more
    previously experienced cases, reusing the case in one
    way or another, revising the solution based on reusing a
    previous case, and retaining the new experience by
    incorporating it into the existing knowledge-base (case-
    base).
CBR Common Applications
   Help-desk
   Diagnosis
Learning Techniques
   Decision Tree
           Method for approximation of discrete-valued target
             functions together with disjunctions (classification)
           One of the most widely used methods for inductive
             inference
           Can be represented as if-then rules
   Nearest Neighbor
           All instances correspond to points in an n-dimensional
               Euclidean space
           Classification done by comparing feature vectors of
              the different points
           Target function may be discrete or real-valued
Decision Tree Example




   Each internal node corresponds to a test
   Each branch corresponds to a result of the test
   Each leaf node assigns a classification
1-Nearest Neighbor Example
3-Nearest Neighbor Example
Error estimators
●   There are many ways of estimating error.
    The following ones are three of them:
       –   Hold-out
       –   K-fold cross-validation
       –   Leave one out
Hold-out Method
●   The hold-out method splits the data into training data
    and test data (usually 2/3 for train, 1/3 for test). Then
    we build a classifier using the train data and test it
    using the test data.




●   Used with a large amount of instances
●   Needs plenty information from each class
K-Fold Cross-Validation Method
●   k-fold cross-validation avoids overlapping test sets:
          – Step 1: data is split into k subsets of equal size
          – Step 2: each subset in turn is used for testing
              and the remainder for training
●   The subsets are stratified before the cross-validation
●   The estimates are averaged to yield an overall
    estimate
Leave One Out Method
●   Leave-One-Out is a particular form of cross-validation:
         –Set a number of folds of training instances
        – e.g., for n training cases, build a classifier n
            times
●   Makes best use of the data
●   Very computationally expensive
Software Development
●   Two different algorithms have been implemented:
        •   C4.5, which is an extension of Quinlan's ID3 algorithm
              and generates a decision tree capable of
              classification.
        •   K-Nearest Neighbor, which classifies instances based
              on closest training examples in the feature space.
C4.5 implementation
●   Entropy:



●   Information gain:



●   Data structures:
         –   Training cases → Vector of classes (filled iteratively) –
               Each instances is a class
         –   Decision tree → Vector of classes (filled recursively) –
               Each node is a class
C4.5 implementation (II)
●   Pruning technique:
             pre-pruning: Stop building a branch due to not reliable
                information.
             post-pruning: Discard inefficient branches, once the
                decision tree is been completed.


   The next formula estimates the error by taking into account the
    pruning:


   The next formula estimates the error without pruning:


   So the condition is as follows:
              if E(S) < BackUpError(S) then prune the node
C4.5 implementation (III)
●   Continuous attributes:
         •   Each one of the continuous values are discretized
               into nominal values by taking into account the
               maximum and minimum of their attributes.
         •   Moreover three different ranges of discretization are
              possible and configurable:
                    Two levels: [High,Low]
                    Three levels: [High, Middle, Low]


                    Four levels: [Very High, High, Low, Very Low]


         •   Thus the range of distinct tests is wider.
K-NN implementation
●   Norm:
         –   Each continuous attribute of each instance is standardized
               as follows:




●   Sum of all different distances:

●   Distance functions:
               Minkowsky        Sokal-Michener         Overlap
K-NN implementation (II)
●   Number of neighbors (k):
         –   This parameter is configurable
         –   Despite most common k are: 5, 7, 11 and 21. Nonetheless
               it depends on the problem domain.
         –   It must be odd to avoid possible draws between number of
                classifications
●   Data structures:
         –   Training cases → Vector of classes (filled iteratively) –
               Each instances is a class
         –   Distances → Vector of floats (filled iteratively)
System Schema



                  Stats File
                0;valor0
                1;valor1
                2;valor2
                3;valor3
                ...
                N;valorN
Project File Structure
                                                     Project
                                                     description




                                          Input Files
K-Nearest Neighbor   Decision Tree
Subsystem            Subsystem       File from which the
                                     system starts up
Subsystem KNN Class Diagram
Subsystem DT Class Diagram
Technologies
●   Ruby                          ●   Redcar
       –   Dynamic                         –   Full features for
       –   Reflective
                                                 Ruby
       –   Imperative
                                           –   Still on development
       –   Of general-purpose
       –   Object-oriented        ●   Ubuntu 12.04
       –   Inspired by Perl and            –   Best O.S. to deploy
              Smalltak
                                                 Ruby's virtual
                                                 machine
                                           –   Fast
                                           –   Easy-to-use
Experiments
●   The rating of predictions is done by calculating the accuracy
    as follows:




●   The software is tested with:
         –   case bases extracted from UCI Machine Learning
              Repository.
         –   the error estimator Leave One Out, which is a
               particular case of K-Fold Cross-Validation. The
               case bases are partitioned into 10 portions: K = 10
         –   1.000 executions.
Hepatitis Detection Experiment
●   Features of the case base:

Source                     Doctor Bojan Cestnik of Jozef Stefan Institute

Motive                     Classify if a patient suffers from hepatitis
Number of attributes       19
Type of attributes         Categorical, integer and real
Number of instances        155
Missing values?            Yes
Number of classes          2
Algorithm                  C4.5
Levels of discretization   4
Official accuracy          ≈ 80%
Hepatitis Detection Experiment (II)
●   Accuracy of the 1.000 executions:

             max                        max




                   min      min


➔   Average accuracy ≈ 78% ≈ 80 %
         ➔   Pretty good precision
Hepatitis Detection Experiment (III)
●   Some important rules pulled out of the decision trees:

      # Rule                                     Classification

      1 (ALBUMIN = Very High or Low) and         LIVE
         (PROTIME = Very Low) and (HISTOLOGY =
         No)
      2 (HISTOLOGY = No) and (PROTIME = Very     LIVE
         High)
      3 (HISTOLOGY = Yes) and (PROSTIME = High) LIVE
         and (ALBUMIN = Low)
      4 (ALBUMIN = High) and (SGOT = Low) and    DIE
         (PROTIME = Very Low) and (HISTOLOGY =
         No)
      5 (ALBUMIN = Very Low) and (SGOT = Low)    DIE
         and (HISTOLOGY = Yes)
Vehicle Shape Experiment
●   Features of the case base:

Source                  Pete Mowforth i Barry Shepherd of Turing
                        Institute
Motive                  Classify a vehicle silhouette into four different
                        kinds according to several characteristics
Number of attributes    18
Type of attributes      Integer
Number of instances     946
Missing values?         No
Number of classes       4
Algorithm               K-Nearest Neighbor
K                       7
Official accuracy       None
Vehicle Shape Experiment (II)
●   Accuracy of the 1.000 executions using Euclidean distance:

                   max                                Pretty high




             min                        min


➔   Average accuracy ≈ 69-70% → not bad
         ➔   Results under k=21 give maximums of higher value but its
               average accuracy remains equal
Delinquency Detection
●   The rating is done similarly as the previous experiments:




●   Dataset provided by a catalan SME called Maderas Gomez
    S.A.
●   Error estimator: Hold-out
         –   70% of dataset → Training
         –   30% of dataset → Test
●   Variable amount of executions
Delinquency Detection (II)
●   Features of the case base:

Source                     Maderas Gomez, S.A.
Motive                     Label customer profiles in payment delinquents
                           and non-delinquents.
Number of attributes       5
Type of attributes         Integer and float
Number of instances        770
Missing values?            Yes
Number of classes          2
Algorithm                  C4.5 and K-Nearest Neighbor
K                          5, 11
Levels of discretization   2, 4
Official accuracy          None
➔   Unfortunately all attributes are continuous
Delinquency Detection (II)
●   Accuracy of :
         •   50 executions
         •   C4.5 algorithm
         •   2 levels of discretization
       max
                                      Pretty high




                                          min


➔   Average accuracy ≈ 95-96%
Delinquency Detection (III)
●   Accuracy of :
         •   100 executions
         •   C4.5 algorithm
         •   4 levels of discretization

             max                                max




                                          min


➔   Average accuracy ≈ 94-96%
Delinquency Detection (IV)
●   Accuracy of :
            •   50 executions
            •   5-Nearest Neighbor
            •   Euclidean function distance

      max                                     Pretty high




                                                     min



➔   Average accuracy ≈ 94-95%
Delinquency Detection (V)
●   Accuracy of :
       •   50 executions
       •   11-Nearest Neighbor
       •   Euclidean function distance

                                                       max




                                 min



➔   Average accuracy ≈ 94% → a little worse than with k=5
Delinquency Detection (V)
●   As for the rules pulled up off the decision tree:



#   Rule                                          Classification
1   (DIFERENCIA = Very High) and (FORMA DE        DELINQUENT
    PLAZO = Very Low) and (F.P REAL = Very Low)

2   (CONSUMIDO = Very Low) and (CONCEDIDO         DELINQUENT
    = Very High) and (DIFERENCIA = Very Low)
    and (FORMA DE PLAZO = Very High) and (FP.
    REAL = Very Low)
Planning of the project




                     Research -100 h

                     Designing - 80 h

                     Implementation - 300 h

                     Experiments - 75 h

                     Report - 75 h
Future Of The Project
●   Implementation of a funcionality capable of drawing a Voronoi
    Diagram for k-Nearest Neighbor algorithm.
●   Embed the system core (KNN and Decision Tree subsystems)
    in to a Web environment.
●   Obtain new and better information related to customers of the
    same business and see if we get more reliable results.
●   Apply the software upon other sorts of field.
Conclusions
●   Despite knowing that the software works good, it may be
    suspicious of getting accuracies as high as the last ones
    shown along delinquency prediction slides. I suspect the
    attributes don't provide the most suitable information.
●   If Maderas Gomez S.A. wants to try to predict possible
    delinquency more accurately then must start to gather as
    much information related to the clients as possible.
●   Ruby is a very powerful programming language which can be
    extrapolated to many fields that computation touches and in
    the next years it will be one of the most important.
●   As a personal point, Machine Learning has drawn my
    attention to even devoting my professional career in such
    field.

Contenu connexe

Tendances

2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learning2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learningVan Thanh
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea
 
Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learningbutest
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...IDES Editor
 
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Universitat Politècnica de Catalunya
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overviewjins0618
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료taeseon ryu
 
ラビットチャレンジ 深層学習Day1 day2レポート
ラビットチャレンジ 深層学習Day1 day2レポートラビットチャレンジ 深層学習Day1 day2レポート
ラビットチャレンジ 深層学習Day1 day2レポートKazuyukiMasada
 
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image SegmentationDeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image SegmentationNamHyuk Ahn
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep LearningOswald Campesato
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 

Tendances (18)

2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learning2010 deep learning and unsupervised feature learning
2010 deep learning and unsupervised feature learning
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learning
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
 
Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...
 
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
 
Lesson 33
Lesson 33Lesson 33
Lesson 33
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
 
ラビットチャレンジ 深層学習Day1 day2レポート
ラビットチャレンジ 深層学習Day1 day2レポートラビットチャレンジ 深層学習Day1 day2レポート
ラビットチャレンジ 深層学習Day1 day2レポート
 
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image SegmentationDeconvNet, DecoupledNet, TransferNet in Image Segmentation
DeconvNet, DecoupledNet, TransferNet in Image Segmentation
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Siguccs20101026
Siguccs20101026Siguccs20101026
Siguccs20101026
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep Learning
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 

Similaire à Computer Engineer Master Project

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersSeunghyun Hwang
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Mahdi Hosseini Moghaddam
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Fatimakhan325
 
Design & implementation of machine learning algorithm in (2)
Design & implementation of machine learning algorithm in (2)Design & implementation of machine learning algorithm in (2)
Design & implementation of machine learning algorithm in (2)saurabh Kumar Chaudhary
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & MetricsSanghamitra Deb
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Lu Jiang
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Env2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningEnv2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningGUANGYUAN PIAO
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Learning on Deep Learning
Learning on Deep LearningLearning on Deep Learning
Learning on Deep LearningShelley Lambert
 

Similaire à Computer Engineer Master Project (20)

Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
11 whiteboxtesting
11 whiteboxtesting11 whiteboxtesting
11 whiteboxtesting
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
 
Design & implementation of machine learning algorithm in (2)
Design & implementation of machine learning algorithm in (2)Design & implementation of machine learning algorithm in (2)
Design & implementation of machine learning algorithm in (2)
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...Leveraging high level and low-level features for multimedia event detection.2...
Leveraging high level and low-level features for multimedia event detection.2...
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Env2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningEnv2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep Learning
 
ExplainableAI.pptx
ExplainableAI.pptxExplainableAI.pptx
ExplainableAI.pptx
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Learning on Deep Learning
Learning on Deep LearningLearning on Deep Learning
Learning on Deep Learning
 

Computer Engineer Master Project

  • 1. Development of an intelligent system predictor of delinquency profiles
  • 2. What we're going to see ● Motivation and goals ● Review on Case-Based Reasoning ● A few learning techniques ● Most relevant error estimators ● Software implementation ● Implied technologies ● Testing the software ● Delinquency detection ● Planning of the project ● Future of the project ● Conclusions
  • 3. Motivation ● Release a valuable project by taking advantage of the recent knowledge learned. Goals ● Development of a software in Ruby under CBR capable of predicting customer profiles involving fraud. ● Testing the software ● Attempt of predicting with real cases provided by Maderas Gomez S.A.
  • 4. What is Case-Based Reasoning? Case-Based Reasoning (CBR) is a name given to a reasoning method that uses specific past experiences rather than a corpus of general knowledge. It is a form of problem solving by analogy in which a new problem is solved by recognizing its similarity to a specific known problem, then transferring the solution of the known problem to the new one. CBR systems consult their memory of previous episodes to help address their current task, which could be: ● planning of a meal, ● classifying the disease of a patient, ● designing a circuit, etc.
  • 5. Case-Based Reasoning Features  Possibly the simplest way of machine learning  Training cases are simply stored  Each case is composed by a set of attributes and one is assigned to classification  Use those previous solved experiences to resolve actual cases  May entail storing newly solved problems into the case-base
  • 6. Case-Based Reasoning Cycle ● At the highest level of generality, a general CBR cycle may be described by the following four processes: 1. RETRIEVE the most similar case or cases 2. REUSE the information and knowledge in that case to solve the problem 3. REVISE the proposed solution 4. RETAIN the parts of this experience likely to be useful for future solving • A new problem is solved by retrieving one or more previously experienced cases, reusing the case in one way or another, revising the solution based on reusing a previous case, and retaining the new experience by incorporating it into the existing knowledge-base (case- base).
  • 7. CBR Common Applications  Help-desk  Diagnosis
  • 8. Learning Techniques  Decision Tree  Method for approximation of discrete-valued target functions together with disjunctions (classification)  One of the most widely used methods for inductive inference  Can be represented as if-then rules  Nearest Neighbor  All instances correspond to points in an n-dimensional Euclidean space  Classification done by comparing feature vectors of the different points  Target function may be discrete or real-valued
  • 9. Decision Tree Example  Each internal node corresponds to a test  Each branch corresponds to a result of the test  Each leaf node assigns a classification
  • 12. Error estimators ● There are many ways of estimating error. The following ones are three of them: – Hold-out – K-fold cross-validation – Leave one out
  • 13. Hold-out Method ● The hold-out method splits the data into training data and test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data. ● Used with a large amount of instances ● Needs plenty information from each class
  • 14. K-Fold Cross-Validation Method ● k-fold cross-validation avoids overlapping test sets: – Step 1: data is split into k subsets of equal size – Step 2: each subset in turn is used for testing and the remainder for training ● The subsets are stratified before the cross-validation ● The estimates are averaged to yield an overall estimate
  • 15. Leave One Out Method ● Leave-One-Out is a particular form of cross-validation: –Set a number of folds of training instances – e.g., for n training cases, build a classifier n times ● Makes best use of the data ● Very computationally expensive
  • 16. Software Development ● Two different algorithms have been implemented: • C4.5, which is an extension of Quinlan's ID3 algorithm and generates a decision tree capable of classification. • K-Nearest Neighbor, which classifies instances based on closest training examples in the feature space.
  • 17. C4.5 implementation ● Entropy: ● Information gain: ● Data structures: – Training cases → Vector of classes (filled iteratively) – Each instances is a class – Decision tree → Vector of classes (filled recursively) – Each node is a class
  • 18. C4.5 implementation (II) ● Pruning technique: pre-pruning: Stop building a branch due to not reliable information. post-pruning: Discard inefficient branches, once the decision tree is been completed.  The next formula estimates the error by taking into account the pruning:  The next formula estimates the error without pruning:  So the condition is as follows: if E(S) < BackUpError(S) then prune the node
  • 19. C4.5 implementation (III) ● Continuous attributes: • Each one of the continuous values are discretized into nominal values by taking into account the maximum and minimum of their attributes. • Moreover three different ranges of discretization are possible and configurable:  Two levels: [High,Low]  Three levels: [High, Middle, Low]  Four levels: [Very High, High, Low, Very Low] • Thus the range of distinct tests is wider.
  • 20. K-NN implementation ● Norm: – Each continuous attribute of each instance is standardized as follows: ● Sum of all different distances: ● Distance functions: Minkowsky Sokal-Michener Overlap
  • 21. K-NN implementation (II) ● Number of neighbors (k): – This parameter is configurable – Despite most common k are: 5, 7, 11 and 21. Nonetheless it depends on the problem domain. – It must be odd to avoid possible draws between number of classifications ● Data structures: – Training cases → Vector of classes (filled iteratively) – Each instances is a class – Distances → Vector of floats (filled iteratively)
  • 22. System Schema Stats File 0;valor0 1;valor1 2;valor2 3;valor3 ... N;valorN
  • 23. Project File Structure Project description Input Files K-Nearest Neighbor Decision Tree Subsystem Subsystem File from which the system starts up
  • 26. Technologies ● Ruby ● Redcar – Dynamic – Full features for – Reflective Ruby – Imperative – Still on development – Of general-purpose – Object-oriented ● Ubuntu 12.04 – Inspired by Perl and – Best O.S. to deploy Smalltak Ruby's virtual machine – Fast – Easy-to-use
  • 27. Experiments ● The rating of predictions is done by calculating the accuracy as follows: ● The software is tested with: – case bases extracted from UCI Machine Learning Repository. – the error estimator Leave One Out, which is a particular case of K-Fold Cross-Validation. The case bases are partitioned into 10 portions: K = 10 – 1.000 executions.
  • 28. Hepatitis Detection Experiment ● Features of the case base: Source Doctor Bojan Cestnik of Jozef Stefan Institute Motive Classify if a patient suffers from hepatitis Number of attributes 19 Type of attributes Categorical, integer and real Number of instances 155 Missing values? Yes Number of classes 2 Algorithm C4.5 Levels of discretization 4 Official accuracy ≈ 80%
  • 29. Hepatitis Detection Experiment (II) ● Accuracy of the 1.000 executions: max max min min ➔ Average accuracy ≈ 78% ≈ 80 % ➔ Pretty good precision
  • 30. Hepatitis Detection Experiment (III) ● Some important rules pulled out of the decision trees: # Rule Classification 1 (ALBUMIN = Very High or Low) and LIVE (PROTIME = Very Low) and (HISTOLOGY = No) 2 (HISTOLOGY = No) and (PROTIME = Very LIVE High) 3 (HISTOLOGY = Yes) and (PROSTIME = High) LIVE and (ALBUMIN = Low) 4 (ALBUMIN = High) and (SGOT = Low) and DIE (PROTIME = Very Low) and (HISTOLOGY = No) 5 (ALBUMIN = Very Low) and (SGOT = Low) DIE and (HISTOLOGY = Yes)
  • 31. Vehicle Shape Experiment ● Features of the case base: Source Pete Mowforth i Barry Shepherd of Turing Institute Motive Classify a vehicle silhouette into four different kinds according to several characteristics Number of attributes 18 Type of attributes Integer Number of instances 946 Missing values? No Number of classes 4 Algorithm K-Nearest Neighbor K 7 Official accuracy None
  • 32. Vehicle Shape Experiment (II) ● Accuracy of the 1.000 executions using Euclidean distance: max Pretty high min min ➔ Average accuracy ≈ 69-70% → not bad ➔ Results under k=21 give maximums of higher value but its average accuracy remains equal
  • 33. Delinquency Detection ● The rating is done similarly as the previous experiments: ● Dataset provided by a catalan SME called Maderas Gomez S.A. ● Error estimator: Hold-out – 70% of dataset → Training – 30% of dataset → Test ● Variable amount of executions
  • 34. Delinquency Detection (II) ● Features of the case base: Source Maderas Gomez, S.A. Motive Label customer profiles in payment delinquents and non-delinquents. Number of attributes 5 Type of attributes Integer and float Number of instances 770 Missing values? Yes Number of classes 2 Algorithm C4.5 and K-Nearest Neighbor K 5, 11 Levels of discretization 2, 4 Official accuracy None ➔ Unfortunately all attributes are continuous
  • 35. Delinquency Detection (II) ● Accuracy of : • 50 executions • C4.5 algorithm • 2 levels of discretization max Pretty high min ➔ Average accuracy ≈ 95-96%
  • 36. Delinquency Detection (III) ● Accuracy of : • 100 executions • C4.5 algorithm • 4 levels of discretization max max min ➔ Average accuracy ≈ 94-96%
  • 37. Delinquency Detection (IV) ● Accuracy of : • 50 executions • 5-Nearest Neighbor • Euclidean function distance max Pretty high min ➔ Average accuracy ≈ 94-95%
  • 38. Delinquency Detection (V) ● Accuracy of : • 50 executions • 11-Nearest Neighbor • Euclidean function distance max min ➔ Average accuracy ≈ 94% → a little worse than with k=5
  • 39. Delinquency Detection (V) ● As for the rules pulled up off the decision tree: # Rule Classification 1 (DIFERENCIA = Very High) and (FORMA DE DELINQUENT PLAZO = Very Low) and (F.P REAL = Very Low) 2 (CONSUMIDO = Very Low) and (CONCEDIDO DELINQUENT = Very High) and (DIFERENCIA = Very Low) and (FORMA DE PLAZO = Very High) and (FP. REAL = Very Low)
  • 40. Planning of the project Research -100 h Designing - 80 h Implementation - 300 h Experiments - 75 h Report - 75 h
  • 41. Future Of The Project ● Implementation of a funcionality capable of drawing a Voronoi Diagram for k-Nearest Neighbor algorithm. ● Embed the system core (KNN and Decision Tree subsystems) in to a Web environment. ● Obtain new and better information related to customers of the same business and see if we get more reliable results. ● Apply the software upon other sorts of field.
  • 42. Conclusions ● Despite knowing that the software works good, it may be suspicious of getting accuracies as high as the last ones shown along delinquency prediction slides. I suspect the attributes don't provide the most suitable information. ● If Maderas Gomez S.A. wants to try to predict possible delinquency more accurately then must start to gather as much information related to the clients as possible. ● Ruby is a very powerful programming language which can be extrapolated to many fields that computation touches and in the next years it will be one of the most important. ● As a personal point, Machine Learning has drawn my attention to even devoting my professional career in such field.