Presentation to the third LIS DREaM workshop, held at Edinburgh Napier university on Wednesday 25th April 2012.
More information about the event can be found at http://lisresearch.org/dream-project/dream-event-4-workshop-wednesday-25-april-2012/
1. Data Mining Methodology
Kevin Swingler
University of Stirling
Lecturer, Computing Science
kms@cs.stir.ac.uk
2. What is Data Mining?
• Generally, methods of using large quantities of data
and appropriate algorithms to allow a computer to
‘learn’ to perform a task
• Task oriented:
– Predict outcomes or forecast the future
– Classify objects as belonging to one of several categories
– Separate data into clusters of similar objects
• Most methods produce a model of the data that
performs the task
2
3. Some Examples
• Predicting patterns of drug side-effects
• Spotting credit card or insurance fraud
• Controlling complex machinery
• Predicting the outcome of medical
interventions
• Predicting the price of stocks and shares or
exchange rates
• Knowing when a cow is most fertile (really!)
3
4. Examples in LIS
• Text Mining
– Automatically determine what an article is ‘about’
– Classify attitudes in social media
• Demand Prediction
– Predicting demand for resources such as new books or
journals or buildings
• Search and Recommend
– Analysis of borrowing history to make recommendations
– Links analysis for citation clustering
4
5. Data Sources
• In House – Data you own
– Borrow records
– Search histories
– Catalogue data
• Bought in
– Demographic data about customers
– Demographic data about the locality around a
library
5
6. Methods
• Techniques for data mining are based on
mathematics and statistics, but are
implemented in easy to use software
packages
• Where methodology is important is in pre-
processing the data, choosing the techniques,
and interpreting the results
6
8. Data Preparation
• Clean the data
– Remove rows with missing values
– Remove rows with obvious data entry errors – e.g.
Age = 200
– Recode obvious data entry inconsistencies – e.g. If
Gender = M or F, but occasionally Male
– Remove rows with minority values
– Select which variables to use in the model
8
9. Data Quantity
• Choose the variables to be used for the model
• Look at the distributions of the chosen values
• Look at the level of noise in the data
• Look at the degree of linearity in the data
• Decide whether or not there are sufficient
examples in the data
• Treat unbalanced data
9
10. Consider Error Costs
• Imagine a system that classifies input patterns
into one of several possible categories
• Sometimes it will get things wrong, how often
depends on the problem:
– Direct mail targeting – very often
– Credit risk assessment – quite often
– Medical reasoning – very infrequently
10
11. Error Costs
• An error in one direction can cost more than
an error in the opposite direction
– Recommending a blood test based on a false
positive is better than missing an infection due to
a false negative
– Missing a case of insurance fraud is more costly
than flagging a claim to be double checked
• The balance of examples in each case can be
manipulated to reflect the cost
11
12. Check Points
• Data quantity and quality: do you have
sufficient good data for the task?
– How many variables are there?
– How complex is the task?
– Is the data’s distribution appropriate?
• Outliers
• Balance
• Value set size
12
13. Distributions
• A frequency distribution is a count of how
often each variable contains each value in a
data set
• For discrete numbers and categorical values,
this is simply a count of each value
• For continuous numbers, the count is of how
many values fall into each of a set of sub-
ranges
13
15. Features of a Distribution
to Look For
• Outliers
• Minority values
• Data Balance
• Data entry errors
15
16. Outliers
• A small number of values that are much larger
or much smaller than all the others
• Can disrupt the data mining process and give
misleading results
• You should either remove them or, if they are
important, collect more data to reflect this
aspect of the world you are modelling
• Could be data entry errors
16
17. Minority Values
• Values that only appear infrequently in the data
• Do they appear often enough to contribute to the
model?
• Might be worth removing them from the data or
collecting more data where they are represented
• Are they needed in the finished system?
• Could they be the result of data entry errors?
17
18. Minority Values
600
500
400
300
200
100
0
Male Female M F
What does this chart tell you about the gender variable in a data set?
What should you do before modelling or mining the data?
18
19. Flat and Wide Variables
• Variables where all the values are minority values
have a flat, wide distribution – one or two of each
possible value
• Such variables are of little use in data mining because
the goal of DM is to find general patterns from
specific data
• No such patterns can exist if each data point is
completely different
• Such variables should be excluded from a model
19
20. Data Balance
• Imagine I want to predict whether or not a
prospective customer will respond to a mailing
campaign
• I collect the data, put it into a data mining
algorithm, which learns and reports a success
rate of 98%
• Sounds good, but when I put a new set of
prospects through to see who to mail, what
happens?
20
21. A Problem
• … the system predicts ‘No’ for every single
prospect.
• With a response rate on a campaign of 2%,
then the system is right 98% of the time if it
always says ‘No’.
• So it never chooses anybody to target in the
campaign
21
22. A Solution
• One data pre-processing solution is to balance the number of
examples of each target class in the output variable
• In our previous example: 50% customers and 50% non-
customers
• That way, any gain in accuracy over 50% would certainly be
due to patterns in the data, not the prior distribution
• This is not always easy to achieve – you might need to throw
away a lot of data to balance the examples, or build several
models on balanced subsets
• Not always necessary – if an event is rare because its cause is
rare, then the problem won’t arise
22
23. Data Quantity
• How much data do you need?
• How long is a piece of string?
• Data must be sufficient to:
– Represent the dynamics of the system to be
modelled
– Cover all situations likely to be encountered when
predictions are needed
– Compensate for any noise in the data
23
24. Model Building
• Choose a number of techniques suitable to
the task:
– Neural network for prediction or classification
– Decision tree for classification
– Rule induction for classification
– Bayesian network for classification
– K-Means for clustering
24
25. Train Models
• For each technique:
– Run a series of experiments with different
parameters
– Each experiment should use around 70% of the
data for training and the rest for testing
– When a good solution is found, use cross
validation (10 fold is a good choice) to verify the
result
25
26. Cross Validation
• Split the data into ten subsets, then train 10
models – each one using 9 of the 10 subsets
as training data and the 10th as test. The score
is the average of all 10.
• This is a more accurate representation of how
well the data may be modelled, as it reduces
the risk of getting a lucky test set
26
27. Assess Models
• You can measure the success of your model in a
number of ways
– Mean Squared error – not always meaningful
– Percentage correct for classification
– Confusion matrix for classification
Output= True False
True 80 30
False 20 90
27
28. Probability Outputs
• Most classification techniques provide a score
with the classification – either a probability or
some other measure
• This can be used:
– Allow an answer of “unsure” for cases where no
single class has a high enough probability
– Weighting outputs to allow for unequal cost of
outcomes
– Lift charts and ROC curves
28
29. Generalisation and Over Fitting
• Most data mining models have a degree of
complexity that can be controlled by the
designer
• The goal is to find the degree of complexity
that is best suited to the data
• A model that is too simple over generalises
• A model that is too complex over fits
• Both have an adverse effect on performance
29
30. Gen-Spec Trade Off
• Adding to the complexity of the model fits the
training data better at the expense of higher
test error
30
31. Repeat or Finish
• The result of the data mining will leave you
with either a model that works or the need to
improve
• More data may need to be collected
• Different variables might be tried
• The process can loop several times before a
satisfactory answer is found
31
32. Understanding and Using the Results
• The resulting model has the ability to perform
the task it was set, so can be embedded in an
automated system
• Some techniques produce models that are
human readable and allow insights into the
structure of the data
• Some are almost impossible to extract
knowledge from
32