SlideShare a Scribd company logo
1 of 33
Data Mining Methodology
          Kevin Swingler
       University of Stirling
   Lecturer, Computing Science
        kms@cs.stir.ac.uk
What is Data Mining?
• Generally, methods of using large quantities of data
  and appropriate algorithms to allow a computer to
  ‘learn’ to perform a task
• Task oriented:
   – Predict outcomes or forecast the future
   – Classify objects as belonging to one of several categories
   – Separate data into clusters of similar objects
• Most methods produce a model of the data that
  performs the task

                                                                  2
Some Examples
• Predicting patterns of drug side-effects
• Spotting credit card or insurance fraud
• Controlling complex machinery
• Predicting the outcome of medical
  interventions
• Predicting the price of stocks and shares or
  exchange rates
• Knowing when a cow is most fertile (really!)
                                                 3
Examples in LIS
• Text Mining
  – Automatically determine what an article is ‘about’
  – Classify attitudes in social media
• Demand Prediction
  – Predicting demand for resources such as new books or
    journals or buildings
• Search and Recommend
  – Analysis of borrowing history to make recommendations
  – Links analysis for citation clustering


                                                            4
Data Sources
• In House – Data you own
  – Borrow records
  – Search histories
  – Catalogue data
• Bought in
  – Demographic data about customers
  – Demographic data about the locality around a
    library

                                                   5
Methods
• Techniques for data mining are based on
  mathematics and statistics, but are
  implemented in easy to use software
  packages
• Where methodology is important is in pre-
  processing the data, choosing the techniques,
  and interpreting the results


                                                  6
CRISP DM Standard
• CRoss Industry Standard Process for Data
  Mining




                                             7
Data Preparation
• Clean the data
  – Remove rows with missing values
  – Remove rows with obvious data entry errors – e.g.
    Age = 200
  – Recode obvious data entry inconsistencies – e.g. If
    Gender = M or F, but occasionally Male
  – Remove rows with minority values
  – Select which variables to use in the model


                                                      8
Data Quantity
• Choose the variables to be used for the model
• Look at the distributions of the chosen values
• Look at the level of noise in the data
• Look at the degree of linearity in the data
• Decide whether or not there are sufficient
  examples in the data
• Treat unbalanced data


                                                   9
Consider Error Costs
• Imagine a system that classifies input patterns
  into one of several possible categories
• Sometimes it will get things wrong, how often
  depends on the problem:
  – Direct mail targeting – very often
  – Credit risk assessment – quite often
  – Medical reasoning – very infrequently



                                                10
Error Costs
• An error in one direction can cost more than
  an error in the opposite direction
  – Recommending a blood test based on a false
    positive is better than missing an infection due to
    a false negative
  – Missing a case of insurance fraud is more costly
    than flagging a claim to be double checked
• The balance of examples in each case can be
  manipulated to reflect the cost

                                                          11
Check Points
• Data quantity and quality: do you have
  sufficient good data for the task?
  – How many variables are there?
  – How complex is the task?
  – Is the data’s distribution appropriate?
     • Outliers
     • Balance
     • Value set size


                                              12
Distributions
• A frequency distribution is a count of how
  often each variable contains each value in a
  data set
• For discrete numbers and categorical values,
  this is simply a count of each value
• For continuous numbers, the count is of how
  many values fall into each of a set of sub-
  ranges

                                                 13
Plotting Distributions
• The easiest way to visualise a distribution is to
  plot it in a histogram:




                                                  14
Features of a Distribution
                 to Look For
•   Outliers
•   Minority values
•   Data Balance
•   Data entry errors




                                       15
Outliers
• A small number of values that are much larger
  or much smaller than all the others
• Can disrupt the data mining process and give
  misleading results
• You should either remove them or, if they are
  important, collect more data to reflect this
  aspect of the world you are modelling
• Could be data entry errors

                                              16
Minority Values
• Values that only appear infrequently in the data
• Do they appear often enough to contribute to the
  model?
• Might be worth removing them from the data or
  collecting more data where they are represented
• Are they needed in the finished system?
• Could they be the result of data entry errors?



                                                     17
Minority Values
             600


             500


             400


             300


             200


             100


               0
                    Male     Female      M         F




What does this chart tell you about the gender variable in a data set?
What should you do before modelling or mining the data?

                                                                         18
Flat and Wide Variables
• Variables where all the values are minority values
  have a flat, wide distribution – one or two of each
  possible value
• Such variables are of little use in data mining because
  the goal of DM is to find general patterns from
  specific data
• No such patterns can exist if each data point is
  completely different
• Such variables should be excluded from a model

                                                        19
Data Balance
• Imagine I want to predict whether or not a
  prospective customer will respond to a mailing
  campaign
• I collect the data, put it into a data mining
  algorithm, which learns and reports a success
  rate of 98%
• Sounds good, but when I put a new set of
  prospects through to see who to mail, what
  happens?

                                               20
A Problem
• … the system predicts ‘No’ for every single
  prospect.
• With a response rate on a campaign of 2%,
  then the system is right 98% of the time if it
  always says ‘No’.
• So it never chooses anybody to target in the
  campaign


                                                   21
A Solution
• One data pre-processing solution is to balance the number of
  examples of each target class in the output variable
• In our previous example: 50% customers and 50% non-
  customers
• That way, any gain in accuracy over 50% would certainly be
  due to patterns in the data, not the prior distribution
• This is not always easy to achieve – you might need to throw
  away a lot of data to balance the examples, or build several
  models on balanced subsets
• Not always necessary – if an event is rare because its cause is
  rare, then the problem won’t arise


                                                                22
Data Quantity
• How much data do you need?
• How long is a piece of string?
• Data must be sufficient to:
  – Represent the dynamics of the system to be
    modelled
  – Cover all situations likely to be encountered when
    predictions are needed
  – Compensate for any noise in the data

                                                     23
Model Building
• Choose a number of techniques suitable to
  the task:
  – Neural network for prediction or classification
  – Decision tree for classification
  – Rule induction for classification
  – Bayesian network for classification
  – K-Means for clustering



                                                      24
Train Models
• For each technique:
  – Run a series of experiments with different
    parameters
  – Each experiment should use around 70% of the
    data for training and the rest for testing
  – When a good solution is found, use cross
    validation (10 fold is a good choice) to verify the
    result


                                                          25
Cross Validation
• Split the data into ten subsets, then train 10
  models – each one using 9 of the 10 subsets
  as training data and the 10th as test. The score
  is the average of all 10.
• This is a more accurate representation of how
  well the data may be modelled, as it reduces
  the risk of getting a lucky test set


                                                     26
Assess Models
• You can measure the success of your model in a
  number of ways
   – Mean Squared error – not always meaningful
   – Percentage correct for classification
   – Confusion matrix for classification

               Output= True        False
               True        80      30
               False       20      90

                                                   27
Probability Outputs
• Most classification techniques provide a score
  with the classification – either a probability or
  some other measure
• This can be used:
  – Allow an answer of “unsure” for cases where no
    single class has a high enough probability
  – Weighting outputs to allow for unequal cost of
    outcomes
  – Lift charts and ROC curves

                                                     28
Generalisation and Over Fitting
• Most data mining models have a degree of
  complexity that can be controlled by the
  designer
• The goal is to find the degree of complexity
  that is best suited to the data
• A model that is too simple over generalises
• A model that is too complex over fits
• Both have an adverse effect on performance
                                                 29
Gen-Spec Trade Off
• Adding to the complexity of the model fits the
  training data better at the expense of higher
  test error




                                               30
Repeat or Finish
• The result of the data mining will leave you
  with either a model that works or the need to
  improve
• More data may need to be collected
• Different variables might be tried
• The process can loop several times before a
  satisfactory answer is found


                                                  31
Understanding and Using the Results
• The resulting model has the ability to perform
  the task it was set, so can be embedded in an
  automated system
• Some techniques produce models that are
  human readable and allow insights into the
  structure of the data
• Some are almost impossible to extract
  knowledge from

                                               32
33

More Related Content

Viewers also liked

Viewers also liked (11)

Reasons to migrate to Delphi XE
Reasons to migrate to Delphi XEReasons to migrate to Delphi XE
Reasons to migrate to Delphi XE
 
Leveraging Big Data in Scholarly Communication Space
Leveraging Big Data in Scholarly Communication SpaceLeveraging Big Data in Scholarly Communication Space
Leveraging Big Data in Scholarly Communication Space
 
What has happened to Foresight in the UK?
What has happened to Foresight in the UK?What has happened to Foresight in the UK?
What has happened to Foresight in the UK?
 
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the Future
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the FutureNATO Workshop on Pre-Detection of Lone Wolf Terrorists of the Future
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the Future
 
Weak Signals and Wild Cards
Weak Signals and Wild CardsWeak Signals and Wild Cards
Weak Signals and Wild Cards
 
Foresight General Concept & Methodology
Foresight General Concept & Methodology Foresight General Concept & Methodology
Foresight General Concept & Methodology
 
Horizon Scanning – Know the future of science today
Horizon Scanning – Know the future of science todayHorizon Scanning – Know the future of science today
Horizon Scanning – Know the future of science today
 
Dr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon ScanningDr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon Scanning
 
Introduction to Horizon Scanning 2016
Introduction to Horizon Scanning 2016Introduction to Horizon Scanning 2016
Introduction to Horizon Scanning 2016
 
A Brief Overview of Strategic Foresight - Workshop Slides for SSE-O
A Brief Overview of Strategic Foresight - Workshop Slides for SSE-OA Brief Overview of Strategic Foresight - Workshop Slides for SSE-O
A Brief Overview of Strategic Foresight - Workshop Slides for SSE-O
 
The foresight framework: Structuring a Foresight Project
The foresight framework: Structuring a Foresight ProjectThe foresight framework: Structuring a Foresight Project
The foresight framework: Structuring a Foresight Project
 

Similar to Kevin Swingler: Introduction to Data Mining

DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
Akash527744
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
RichardGroom
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 

Similar to Kevin Swingler: Introduction to Data Mining (20)

DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Predictive Analysis
Predictive AnalysisPredictive Analysis
Predictive Analysis
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 

More from Library and Information Science Research Coalition

More from Library and Information Science Research Coalition (20)

Research into practice: library and information research resources briefing
Research into practice: library and information research resources briefingResearch into practice: library and information research resources briefing
Research into practice: library and information research resources briefing
 
Research into practice: The present situation
Research into practice:The present situationResearch into practice:The present situation
Research into practice: The present situation
 
DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012
 
DREaM 5: Library and information science practitioner researcher excellence a...
DREaM 5: Library and information science practitioner researcher excellence a...DREaM 5: Library and information science practitioner researcher excellence a...
DREaM 5: Library and information science practitioner researcher excellence a...
 
DREaM 5: Facets of DREaM
DREaM 5: Facets of DREaMDREaM 5: Facets of DREaM
DREaM 5: Facets of DREaM
 
DREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and futureDREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and future
 
DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...
 
We have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods networkWe have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods network
 
Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012
 
Dr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from PsychologyDr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from Psychology
 
Welcome to DREaM3
Welcome to DREaM3Welcome to DREaM3
Welcome to DREaM3
 
Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)
 
Mike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to WebometricsMike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to Webometrics
 
Thomas Haigh: Techniques from History
Thomas Haigh: Techniques from HistoryThomas Haigh: Techniques from History
Thomas Haigh: Techniques from History
 
Thomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 taskThomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 task
 
Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...
 
LIS DREaM 2: Social Network Analysis Workshop Exercise Results
LIS DREaM 2: Social Network Analysis Workshop Exercise ResultsLIS DREaM 2: Social Network Analysis Workshop Exercise Results
LIS DREaM 2: Social Network Analysis Workshop Exercise Results
 
DREaM Event 2: Paul Lynch
DREaM Event 2: Paul LynchDREaM Event 2: Paul Lynch
DREaM Event 2: Paul Lynch
 
DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)
 
DREaM Event 2: Charles Oppenheim (Cases)
DREaM Event 2: Charles Oppenheim (Cases)DREaM Event 2: Charles Oppenheim (Cases)
DREaM Event 2: Charles Oppenheim (Cases)
 

Recently uploaded

The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

How to Manage Closest Location in Odoo 17 Inventory
How to Manage Closest Location in Odoo 17 InventoryHow to Manage Closest Location in Odoo 17 Inventory
How to Manage Closest Location in Odoo 17 Inventory
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
 
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
Exploring Gemini AI and Integration with MuleSoft | MuleSoft Mysore Meetup #45
 
e-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopale-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopal
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptx
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
 
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
 

Kevin Swingler: Introduction to Data Mining

  • 1. Data Mining Methodology Kevin Swingler University of Stirling Lecturer, Computing Science kms@cs.stir.ac.uk
  • 2. What is Data Mining? • Generally, methods of using large quantities of data and appropriate algorithms to allow a computer to ‘learn’ to perform a task • Task oriented: – Predict outcomes or forecast the future – Classify objects as belonging to one of several categories – Separate data into clusters of similar objects • Most methods produce a model of the data that performs the task 2
  • 3. Some Examples • Predicting patterns of drug side-effects • Spotting credit card or insurance fraud • Controlling complex machinery • Predicting the outcome of medical interventions • Predicting the price of stocks and shares or exchange rates • Knowing when a cow is most fertile (really!) 3
  • 4. Examples in LIS • Text Mining – Automatically determine what an article is ‘about’ – Classify attitudes in social media • Demand Prediction – Predicting demand for resources such as new books or journals or buildings • Search and Recommend – Analysis of borrowing history to make recommendations – Links analysis for citation clustering 4
  • 5. Data Sources • In House – Data you own – Borrow records – Search histories – Catalogue data • Bought in – Demographic data about customers – Demographic data about the locality around a library 5
  • 6. Methods • Techniques for data mining are based on mathematics and statistics, but are implemented in easy to use software packages • Where methodology is important is in pre- processing the data, choosing the techniques, and interpreting the results 6
  • 7. CRISP DM Standard • CRoss Industry Standard Process for Data Mining 7
  • 8. Data Preparation • Clean the data – Remove rows with missing values – Remove rows with obvious data entry errors – e.g. Age = 200 – Recode obvious data entry inconsistencies – e.g. If Gender = M or F, but occasionally Male – Remove rows with minority values – Select which variables to use in the model 8
  • 9. Data Quantity • Choose the variables to be used for the model • Look at the distributions of the chosen values • Look at the level of noise in the data • Look at the degree of linearity in the data • Decide whether or not there are sufficient examples in the data • Treat unbalanced data 9
  • 10. Consider Error Costs • Imagine a system that classifies input patterns into one of several possible categories • Sometimes it will get things wrong, how often depends on the problem: – Direct mail targeting – very often – Credit risk assessment – quite often – Medical reasoning – very infrequently 10
  • 11. Error Costs • An error in one direction can cost more than an error in the opposite direction – Recommending a blood test based on a false positive is better than missing an infection due to a false negative – Missing a case of insurance fraud is more costly than flagging a claim to be double checked • The balance of examples in each case can be manipulated to reflect the cost 11
  • 12. Check Points • Data quantity and quality: do you have sufficient good data for the task? – How many variables are there? – How complex is the task? – Is the data’s distribution appropriate? • Outliers • Balance • Value set size 12
  • 13. Distributions • A frequency distribution is a count of how often each variable contains each value in a data set • For discrete numbers and categorical values, this is simply a count of each value • For continuous numbers, the count is of how many values fall into each of a set of sub- ranges 13
  • 14. Plotting Distributions • The easiest way to visualise a distribution is to plot it in a histogram: 14
  • 15. Features of a Distribution to Look For • Outliers • Minority values • Data Balance • Data entry errors 15
  • 16. Outliers • A small number of values that are much larger or much smaller than all the others • Can disrupt the data mining process and give misleading results • You should either remove them or, if they are important, collect more data to reflect this aspect of the world you are modelling • Could be data entry errors 16
  • 17. Minority Values • Values that only appear infrequently in the data • Do they appear often enough to contribute to the model? • Might be worth removing them from the data or collecting more data where they are represented • Are they needed in the finished system? • Could they be the result of data entry errors? 17
  • 18. Minority Values 600 500 400 300 200 100 0 Male Female M F What does this chart tell you about the gender variable in a data set? What should you do before modelling or mining the data? 18
  • 19. Flat and Wide Variables • Variables where all the values are minority values have a flat, wide distribution – one or two of each possible value • Such variables are of little use in data mining because the goal of DM is to find general patterns from specific data • No such patterns can exist if each data point is completely different • Such variables should be excluded from a model 19
  • 20. Data Balance • Imagine I want to predict whether or not a prospective customer will respond to a mailing campaign • I collect the data, put it into a data mining algorithm, which learns and reports a success rate of 98% • Sounds good, but when I put a new set of prospects through to see who to mail, what happens? 20
  • 21. A Problem • … the system predicts ‘No’ for every single prospect. • With a response rate on a campaign of 2%, then the system is right 98% of the time if it always says ‘No’. • So it never chooses anybody to target in the campaign 21
  • 22. A Solution • One data pre-processing solution is to balance the number of examples of each target class in the output variable • In our previous example: 50% customers and 50% non- customers • That way, any gain in accuracy over 50% would certainly be due to patterns in the data, not the prior distribution • This is not always easy to achieve – you might need to throw away a lot of data to balance the examples, or build several models on balanced subsets • Not always necessary – if an event is rare because its cause is rare, then the problem won’t arise 22
  • 23. Data Quantity • How much data do you need? • How long is a piece of string? • Data must be sufficient to: – Represent the dynamics of the system to be modelled – Cover all situations likely to be encountered when predictions are needed – Compensate for any noise in the data 23
  • 24. Model Building • Choose a number of techniques suitable to the task: – Neural network for prediction or classification – Decision tree for classification – Rule induction for classification – Bayesian network for classification – K-Means for clustering 24
  • 25. Train Models • For each technique: – Run a series of experiments with different parameters – Each experiment should use around 70% of the data for training and the rest for testing – When a good solution is found, use cross validation (10 fold is a good choice) to verify the result 25
  • 26. Cross Validation • Split the data into ten subsets, then train 10 models – each one using 9 of the 10 subsets as training data and the 10th as test. The score is the average of all 10. • This is a more accurate representation of how well the data may be modelled, as it reduces the risk of getting a lucky test set 26
  • 27. Assess Models • You can measure the success of your model in a number of ways – Mean Squared error – not always meaningful – Percentage correct for classification – Confusion matrix for classification Output= True False True 80 30 False 20 90 27
  • 28. Probability Outputs • Most classification techniques provide a score with the classification – either a probability or some other measure • This can be used: – Allow an answer of “unsure” for cases where no single class has a high enough probability – Weighting outputs to allow for unequal cost of outcomes – Lift charts and ROC curves 28
  • 29. Generalisation and Over Fitting • Most data mining models have a degree of complexity that can be controlled by the designer • The goal is to find the degree of complexity that is best suited to the data • A model that is too simple over generalises • A model that is too complex over fits • Both have an adverse effect on performance 29
  • 30. Gen-Spec Trade Off • Adding to the complexity of the model fits the training data better at the expense of higher test error 30
  • 31. Repeat or Finish • The result of the data mining will leave you with either a model that works or the need to improve • More data may need to be collected • Different variables might be tried • The process can loop several times before a satisfactory answer is found 31
  • 32. Understanding and Using the Results • The resulting model has the ability to perform the task it was set, so can be embedded in an automated system • Some techniques produce models that are human readable and allow insights into the structure of the data • Some are almost impossible to extract knowledge from 32
  • 33. 33