SlideShare une entreprise Scribd logo
1  sur  33
Data Mining Methodology
          Kevin Swingler
       University of Stirling
   Lecturer, Computing Science
        kms@cs.stir.ac.uk
What is Data Mining?
• Generally, methods of using large quantities of data
  and appropriate algorithms to allow a computer to
  ‘learn’ to perform a task
• Task oriented:
   – Predict outcomes or forecast the future
   – Classify objects as belonging to one of several categories
   – Separate data into clusters of similar objects
• Most methods produce a model of the data that
  performs the task

                                                                  2
Some Examples
• Predicting patterns of drug side-effects
• Spotting credit card or insurance fraud
• Controlling complex machinery
• Predicting the outcome of medical
  interventions
• Predicting the price of stocks and shares or
  exchange rates
• Knowing when a cow is most fertile (really!)
                                                 3
Examples in LIS
• Text Mining
  – Automatically determine what an article is ‘about’
  – Classify attitudes in social media
• Demand Prediction
  – Predicting demand for resources such as new books or
    journals or buildings
• Search and Recommend
  – Analysis of borrowing history to make recommendations
  – Links analysis for citation clustering


                                                            4
Data Sources
• In House – Data you own
  – Borrow records
  – Search histories
  – Catalogue data
• Bought in
  – Demographic data about customers
  – Demographic data about the locality around a
    library

                                                   5
Methods
• Techniques for data mining are based on
  mathematics and statistics, but are
  implemented in easy to use software
  packages
• Where methodology is important is in pre-
  processing the data, choosing the techniques,
  and interpreting the results


                                                  6
CRISP DM Standard
• CRoss Industry Standard Process for Data
  Mining




                                             7
Data Preparation
• Clean the data
  – Remove rows with missing values
  – Remove rows with obvious data entry errors – e.g.
    Age = 200
  – Recode obvious data entry inconsistencies – e.g. If
    Gender = M or F, but occasionally Male
  – Remove rows with minority values
  – Select which variables to use in the model


                                                      8
Data Quantity
• Choose the variables to be used for the model
• Look at the distributions of the chosen values
• Look at the level of noise in the data
• Look at the degree of linearity in the data
• Decide whether or not there are sufficient
  examples in the data
• Treat unbalanced data


                                                   9
Consider Error Costs
• Imagine a system that classifies input patterns
  into one of several possible categories
• Sometimes it will get things wrong, how often
  depends on the problem:
  – Direct mail targeting – very often
  – Credit risk assessment – quite often
  – Medical reasoning – very infrequently



                                                10
Error Costs
• An error in one direction can cost more than
  an error in the opposite direction
  – Recommending a blood test based on a false
    positive is better than missing an infection due to
    a false negative
  – Missing a case of insurance fraud is more costly
    than flagging a claim to be double checked
• The balance of examples in each case can be
  manipulated to reflect the cost

                                                          11
Check Points
• Data quantity and quality: do you have
  sufficient good data for the task?
  – How many variables are there?
  – How complex is the task?
  – Is the data’s distribution appropriate?
     • Outliers
     • Balance
     • Value set size


                                              12
Distributions
• A frequency distribution is a count of how
  often each variable contains each value in a
  data set
• For discrete numbers and categorical values,
  this is simply a count of each value
• For continuous numbers, the count is of how
  many values fall into each of a set of sub-
  ranges

                                                 13
Plotting Distributions
• The easiest way to visualise a distribution is to
  plot it in a histogram:




                                                  14
Features of a Distribution
                 to Look For
•   Outliers
•   Minority values
•   Data Balance
•   Data entry errors




                                       15
Outliers
• A small number of values that are much larger
  or much smaller than all the others
• Can disrupt the data mining process and give
  misleading results
• You should either remove them or, if they are
  important, collect more data to reflect this
  aspect of the world you are modelling
• Could be data entry errors

                                              16
Minority Values
• Values that only appear infrequently in the data
• Do they appear often enough to contribute to the
  model?
• Might be worth removing them from the data or
  collecting more data where they are represented
• Are they needed in the finished system?
• Could they be the result of data entry errors?



                                                     17
Minority Values
             600


             500


             400


             300


             200


             100


               0
                    Male     Female      M         F




What does this chart tell you about the gender variable in a data set?
What should you do before modelling or mining the data?

                                                                         18
Flat and Wide Variables
• Variables where all the values are minority values
  have a flat, wide distribution – one or two of each
  possible value
• Such variables are of little use in data mining because
  the goal of DM is to find general patterns from
  specific data
• No such patterns can exist if each data point is
  completely different
• Such variables should be excluded from a model

                                                        19
Data Balance
• Imagine I want to predict whether or not a
  prospective customer will respond to a mailing
  campaign
• I collect the data, put it into a data mining
  algorithm, which learns and reports a success
  rate of 98%
• Sounds good, but when I put a new set of
  prospects through to see who to mail, what
  happens?

                                               20
A Problem
• … the system predicts ‘No’ for every single
  prospect.
• With a response rate on a campaign of 2%,
  then the system is right 98% of the time if it
  always says ‘No’.
• So it never chooses anybody to target in the
  campaign


                                                   21
A Solution
• One data pre-processing solution is to balance the number of
  examples of each target class in the output variable
• In our previous example: 50% customers and 50% non-
  customers
• That way, any gain in accuracy over 50% would certainly be
  due to patterns in the data, not the prior distribution
• This is not always easy to achieve – you might need to throw
  away a lot of data to balance the examples, or build several
  models on balanced subsets
• Not always necessary – if an event is rare because its cause is
  rare, then the problem won’t arise


                                                                22
Data Quantity
• How much data do you need?
• How long is a piece of string?
• Data must be sufficient to:
  – Represent the dynamics of the system to be
    modelled
  – Cover all situations likely to be encountered when
    predictions are needed
  – Compensate for any noise in the data

                                                     23
Model Building
• Choose a number of techniques suitable to
  the task:
  – Neural network for prediction or classification
  – Decision tree for classification
  – Rule induction for classification
  – Bayesian network for classification
  – K-Means for clustering



                                                      24
Train Models
• For each technique:
  – Run a series of experiments with different
    parameters
  – Each experiment should use around 70% of the
    data for training and the rest for testing
  – When a good solution is found, use cross
    validation (10 fold is a good choice) to verify the
    result


                                                          25
Cross Validation
• Split the data into ten subsets, then train 10
  models – each one using 9 of the 10 subsets
  as training data and the 10th as test. The score
  is the average of all 10.
• This is a more accurate representation of how
  well the data may be modelled, as it reduces
  the risk of getting a lucky test set


                                                     26
Assess Models
• You can measure the success of your model in a
  number of ways
   – Mean Squared error – not always meaningful
   – Percentage correct for classification
   – Confusion matrix for classification

               Output= True        False
               True        80      30
               False       20      90

                                                   27
Probability Outputs
• Most classification techniques provide a score
  with the classification – either a probability or
  some other measure
• This can be used:
  – Allow an answer of “unsure” for cases where no
    single class has a high enough probability
  – Weighting outputs to allow for unequal cost of
    outcomes
  – Lift charts and ROC curves

                                                     28
Generalisation and Over Fitting
• Most data mining models have a degree of
  complexity that can be controlled by the
  designer
• The goal is to find the degree of complexity
  that is best suited to the data
• A model that is too simple over generalises
• A model that is too complex over fits
• Both have an adverse effect on performance
                                                 29
Gen-Spec Trade Off
• Adding to the complexity of the model fits the
  training data better at the expense of higher
  test error




                                               30
Repeat or Finish
• The result of the data mining will leave you
  with either a model that works or the need to
  improve
• More data may need to be collected
• Different variables might be tried
• The process can loop several times before a
  satisfactory answer is found


                                                  31
Understanding and Using the Results
• The resulting model has the ability to perform
  the task it was set, so can be embedded in an
  automated system
• Some techniques produce models that are
  human readable and allow insights into the
  structure of the data
• Some are almost impossible to extract
  knowledge from

                                               32
33

Contenu connexe

Tendances

đề Cương môn học vật liệu may
đề Cương môn học vật liệu mayđề Cương môn học vật liệu may
đề Cương môn học vật liệu may
https://www.facebook.com/garmentspace
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 
Bai tap thuc hanh nhap mon tin hoc
Bai tap thuc hanh nhap mon tin hocBai tap thuc hanh nhap mon tin hoc
Bai tap thuc hanh nhap mon tin hoc
Hồ Lợi
 
Báo cáo đồ án cấu trúc dữ liệu đề tai49
Báo cáo đồ án cấu trúc dữ liệu đề tai49Báo cáo đồ án cấu trúc dữ liệu đề tai49
Báo cáo đồ án cấu trúc dữ liệu đề tai49
mydung09t3
 
03 lap trinh hop ngu voi 8086
03 lap trinh hop ngu voi 808603 lap trinh hop ngu voi 8086
03 lap trinh hop ngu voi 8086
onlylove511
 
[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG
[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG
[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG
Nhân Quả Công Bằng
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 
22회 중급 듣기 듣기대본
22회 중급 듣기 듣기대본22회 중급 듣기 듣기대본
22회 중급 듣기 듣기대본
Vantharith Oum
 

Tendances (20)

Automata slide DHBKHCM [S2NUCE.blogspot.com]
Automata slide DHBKHCM  [S2NUCE.blogspot.com]Automata slide DHBKHCM  [S2NUCE.blogspot.com]
Automata slide DHBKHCM [S2NUCE.blogspot.com]
 
Mạng Máy tính
Mạng Máy tínhMạng Máy tính
Mạng Máy tính
 
đề Cương môn học vật liệu may
đề Cương môn học vật liệu mayđề Cương môn học vật liệu may
đề Cương môn học vật liệu may
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 
Bai tap thuc hanh nhap mon tin hoc
Bai tap thuc hanh nhap mon tin hocBai tap thuc hanh nhap mon tin hoc
Bai tap thuc hanh nhap mon tin hoc
 
Báo cáo đồ án cấu trúc dữ liệu đề tai49
Báo cáo đồ án cấu trúc dữ liệu đề tai49Báo cáo đồ án cấu trúc dữ liệu đề tai49
Báo cáo đồ án cấu trúc dữ liệu đề tai49
 
Machine learning
Machine learningMachine learning
Machine learning
 
Cấu trúc dữ liệu cơ bản 1
Cấu trúc dữ liệu cơ bản 1Cấu trúc dữ liệu cơ bản 1
Cấu trúc dữ liệu cơ bản 1
 
Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)
 
Artificial neural network for machine learning
Artificial neural network for machine learningArtificial neural network for machine learning
Artificial neural network for machine learning
 
03 lap trinh hop ngu voi 8086
03 lap trinh hop ngu voi 808603 lap trinh hop ngu voi 8086
03 lap trinh hop ngu voi 8086
 
[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG
[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG
[GIÁO TRÌNH] THIẾT BỊ TRONG CÔNG NGHIỆP MAY – NGUYỄN TRỌNG HÙNG
 
机器学习
机器学习机器学习
机器学习
 
Deep Neural Networks for Machine Learning
Deep Neural Networks for Machine LearningDeep Neural Networks for Machine Learning
Deep Neural Networks for Machine Learning
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
22회 중급 듣기 듣기대본
22회 중급 듣기 듣기대본22회 중급 듣기 듣기대본
22회 중급 듣기 듣기대본
 
Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide
 
Giáo trình tin học ứng dụng ngành may 1
Giáo trình tin học ứng dụng ngành may 1Giáo trình tin học ứng dụng ngành may 1
Giáo trình tin học ứng dụng ngành may 1
 

En vedette

En vedette (11)

Reasons to migrate to Delphi XE
Reasons to migrate to Delphi XEReasons to migrate to Delphi XE
Reasons to migrate to Delphi XE
 
Leveraging Big Data in Scholarly Communication Space
Leveraging Big Data in Scholarly Communication SpaceLeveraging Big Data in Scholarly Communication Space
Leveraging Big Data in Scholarly Communication Space
 
What has happened to Foresight in the UK?
What has happened to Foresight in the UK?What has happened to Foresight in the UK?
What has happened to Foresight in the UK?
 
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the Future
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the FutureNATO Workshop on Pre-Detection of Lone Wolf Terrorists of the Future
NATO Workshop on Pre-Detection of Lone Wolf Terrorists of the Future
 
Weak Signals and Wild Cards
Weak Signals and Wild CardsWeak Signals and Wild Cards
Weak Signals and Wild Cards
 
Foresight General Concept & Methodology
Foresight General Concept & Methodology Foresight General Concept & Methodology
Foresight General Concept & Methodology
 
Horizon Scanning – Know the future of science today
Horizon Scanning – Know the future of science todayHorizon Scanning – Know the future of science today
Horizon Scanning – Know the future of science today
 
Dr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon ScanningDr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon Scanning
 
Introduction to Horizon Scanning 2016
Introduction to Horizon Scanning 2016Introduction to Horizon Scanning 2016
Introduction to Horizon Scanning 2016
 
A Brief Overview of Strategic Foresight - Workshop Slides for SSE-O
A Brief Overview of Strategic Foresight - Workshop Slides for SSE-OA Brief Overview of Strategic Foresight - Workshop Slides for SSE-O
A Brief Overview of Strategic Foresight - Workshop Slides for SSE-O
 
The foresight framework: Structuring a Foresight Project
The foresight framework: Structuring a Foresight ProjectThe foresight framework: Structuring a Foresight Project
The foresight framework: Structuring a Foresight Project
 

Similaire à Kevin Swingler: Introduction to Data Mining

DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
Akash527744
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
RichardGroom
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 

Similaire à Kevin Swingler: Introduction to Data Mining (20)

DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Predictive Analysis
Predictive AnalysisPredictive Analysis
Predictive Analysis
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 

Plus de Library and Information Science Research Coalition

Plus de Library and Information Science Research Coalition (20)

Research into practice: library and information research resources briefing
Research into practice: library and information research resources briefingResearch into practice: library and information research resources briefing
Research into practice: library and information research resources briefing
 
Research into practice: The present situation
Research into practice:The present situationResearch into practice:The present situation
Research into practice: The present situation
 
DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012
 
DREaM 5: Library and information science practitioner researcher excellence a...
DREaM 5: Library and information science practitioner researcher excellence a...DREaM 5: Library and information science practitioner researcher excellence a...
DREaM 5: Library and information science practitioner researcher excellence a...
 
DREaM 5: Facets of DREaM
DREaM 5: Facets of DREaMDREaM 5: Facets of DREaM
DREaM 5: Facets of DREaM
 
DREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and futureDREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and future
 
DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...
 
We have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods networkWe have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods network
 
Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012
 
Dr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from PsychologyDr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from Psychology
 
Welcome to DREaM3
Welcome to DREaM3Welcome to DREaM3
Welcome to DREaM3
 
Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)
 
Mike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to WebometricsMike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to Webometrics
 
Thomas Haigh: Techniques from History
Thomas Haigh: Techniques from HistoryThomas Haigh: Techniques from History
Thomas Haigh: Techniques from History
 
Thomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 taskThomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 task
 
Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...
 
LIS DREaM 2: Social Network Analysis Workshop Exercise Results
LIS DREaM 2: Social Network Analysis Workshop Exercise ResultsLIS DREaM 2: Social Network Analysis Workshop Exercise Results
LIS DREaM 2: Social Network Analysis Workshop Exercise Results
 
DREaM Event 2: Paul Lynch
DREaM Event 2: Paul LynchDREaM Event 2: Paul Lynch
DREaM Event 2: Paul Lynch
 
DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)
 
DREaM Event 2: Charles Oppenheim (Cases)
DREaM Event 2: Charles Oppenheim (Cases)DREaM Event 2: Charles Oppenheim (Cases)
DREaM Event 2: Charles Oppenheim (Cases)
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Dernier (20)

How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Kevin Swingler: Introduction to Data Mining

  • 1. Data Mining Methodology Kevin Swingler University of Stirling Lecturer, Computing Science kms@cs.stir.ac.uk
  • 2. What is Data Mining? • Generally, methods of using large quantities of data and appropriate algorithms to allow a computer to ‘learn’ to perform a task • Task oriented: – Predict outcomes or forecast the future – Classify objects as belonging to one of several categories – Separate data into clusters of similar objects • Most methods produce a model of the data that performs the task 2
  • 3. Some Examples • Predicting patterns of drug side-effects • Spotting credit card or insurance fraud • Controlling complex machinery • Predicting the outcome of medical interventions • Predicting the price of stocks and shares or exchange rates • Knowing when a cow is most fertile (really!) 3
  • 4. Examples in LIS • Text Mining – Automatically determine what an article is ‘about’ – Classify attitudes in social media • Demand Prediction – Predicting demand for resources such as new books or journals or buildings • Search and Recommend – Analysis of borrowing history to make recommendations – Links analysis for citation clustering 4
  • 5. Data Sources • In House – Data you own – Borrow records – Search histories – Catalogue data • Bought in – Demographic data about customers – Demographic data about the locality around a library 5
  • 6. Methods • Techniques for data mining are based on mathematics and statistics, but are implemented in easy to use software packages • Where methodology is important is in pre- processing the data, choosing the techniques, and interpreting the results 6
  • 7. CRISP DM Standard • CRoss Industry Standard Process for Data Mining 7
  • 8. Data Preparation • Clean the data – Remove rows with missing values – Remove rows with obvious data entry errors – e.g. Age = 200 – Recode obvious data entry inconsistencies – e.g. If Gender = M or F, but occasionally Male – Remove rows with minority values – Select which variables to use in the model 8
  • 9. Data Quantity • Choose the variables to be used for the model • Look at the distributions of the chosen values • Look at the level of noise in the data • Look at the degree of linearity in the data • Decide whether or not there are sufficient examples in the data • Treat unbalanced data 9
  • 10. Consider Error Costs • Imagine a system that classifies input patterns into one of several possible categories • Sometimes it will get things wrong, how often depends on the problem: – Direct mail targeting – very often – Credit risk assessment – quite often – Medical reasoning – very infrequently 10
  • 11. Error Costs • An error in one direction can cost more than an error in the opposite direction – Recommending a blood test based on a false positive is better than missing an infection due to a false negative – Missing a case of insurance fraud is more costly than flagging a claim to be double checked • The balance of examples in each case can be manipulated to reflect the cost 11
  • 12. Check Points • Data quantity and quality: do you have sufficient good data for the task? – How many variables are there? – How complex is the task? – Is the data’s distribution appropriate? • Outliers • Balance • Value set size 12
  • 13. Distributions • A frequency distribution is a count of how often each variable contains each value in a data set • For discrete numbers and categorical values, this is simply a count of each value • For continuous numbers, the count is of how many values fall into each of a set of sub- ranges 13
  • 14. Plotting Distributions • The easiest way to visualise a distribution is to plot it in a histogram: 14
  • 15. Features of a Distribution to Look For • Outliers • Minority values • Data Balance • Data entry errors 15
  • 16. Outliers • A small number of values that are much larger or much smaller than all the others • Can disrupt the data mining process and give misleading results • You should either remove them or, if they are important, collect more data to reflect this aspect of the world you are modelling • Could be data entry errors 16
  • 17. Minority Values • Values that only appear infrequently in the data • Do they appear often enough to contribute to the model? • Might be worth removing them from the data or collecting more data where they are represented • Are they needed in the finished system? • Could they be the result of data entry errors? 17
  • 18. Minority Values 600 500 400 300 200 100 0 Male Female M F What does this chart tell you about the gender variable in a data set? What should you do before modelling or mining the data? 18
  • 19. Flat and Wide Variables • Variables where all the values are minority values have a flat, wide distribution – one or two of each possible value • Such variables are of little use in data mining because the goal of DM is to find general patterns from specific data • No such patterns can exist if each data point is completely different • Such variables should be excluded from a model 19
  • 20. Data Balance • Imagine I want to predict whether or not a prospective customer will respond to a mailing campaign • I collect the data, put it into a data mining algorithm, which learns and reports a success rate of 98% • Sounds good, but when I put a new set of prospects through to see who to mail, what happens? 20
  • 21. A Problem • … the system predicts ‘No’ for every single prospect. • With a response rate on a campaign of 2%, then the system is right 98% of the time if it always says ‘No’. • So it never chooses anybody to target in the campaign 21
  • 22. A Solution • One data pre-processing solution is to balance the number of examples of each target class in the output variable • In our previous example: 50% customers and 50% non- customers • That way, any gain in accuracy over 50% would certainly be due to patterns in the data, not the prior distribution • This is not always easy to achieve – you might need to throw away a lot of data to balance the examples, or build several models on balanced subsets • Not always necessary – if an event is rare because its cause is rare, then the problem won’t arise 22
  • 23. Data Quantity • How much data do you need? • How long is a piece of string? • Data must be sufficient to: – Represent the dynamics of the system to be modelled – Cover all situations likely to be encountered when predictions are needed – Compensate for any noise in the data 23
  • 24. Model Building • Choose a number of techniques suitable to the task: – Neural network for prediction or classification – Decision tree for classification – Rule induction for classification – Bayesian network for classification – K-Means for clustering 24
  • 25. Train Models • For each technique: – Run a series of experiments with different parameters – Each experiment should use around 70% of the data for training and the rest for testing – When a good solution is found, use cross validation (10 fold is a good choice) to verify the result 25
  • 26. Cross Validation • Split the data into ten subsets, then train 10 models – each one using 9 of the 10 subsets as training data and the 10th as test. The score is the average of all 10. • This is a more accurate representation of how well the data may be modelled, as it reduces the risk of getting a lucky test set 26
  • 27. Assess Models • You can measure the success of your model in a number of ways – Mean Squared error – not always meaningful – Percentage correct for classification – Confusion matrix for classification Output= True False True 80 30 False 20 90 27
  • 28. Probability Outputs • Most classification techniques provide a score with the classification – either a probability or some other measure • This can be used: – Allow an answer of “unsure” for cases where no single class has a high enough probability – Weighting outputs to allow for unequal cost of outcomes – Lift charts and ROC curves 28
  • 29. Generalisation and Over Fitting • Most data mining models have a degree of complexity that can be controlled by the designer • The goal is to find the degree of complexity that is best suited to the data • A model that is too simple over generalises • A model that is too complex over fits • Both have an adverse effect on performance 29
  • 30. Gen-Spec Trade Off • Adding to the complexity of the model fits the training data better at the expense of higher test error 30
  • 31. Repeat or Finish • The result of the data mining will leave you with either a model that works or the need to improve • More data may need to be collected • Different variables might be tried • The process can loop several times before a satisfactory answer is found 31
  • 32. Understanding and Using the Results • The resulting model has the ability to perform the task it was set, so can be embedded in an automated system • Some techniques produce models that are human readable and allow insights into the structure of the data • Some are almost impossible to extract knowledge from 32
  • 33. 33