1. Advanced Analytics
THE NINE LAWS OF DATA MINING
Duncan Ross
@duncan3ross
duncan.ross@teradata.com
Based on the 9 Laws of Data Mining by Tom Khabaza
2. What you won‟t get from this presentation
• The last two algorithms you need to know!
• An explanation of Bayes‟ theorem
• The name of the software that will make you $ millions
> Not even a comparison of different software!
The grave of Thomas Bayes (probably) – near “silicon roundabout”
Image via Wikimedia
2/28/2013 @duncan3ross
3. THE 0TH LAW
Advanced Analytics
Data Mining laws also work as
Data Science laws
4. What is data mining?
• This question generates more arguments than answers
• Common features
> Predicting or classifying things
> Based on historical cases (with or without outcomes)
> Machine learning techniques
> No predefined underlying model assumed
Image via Wikimedia
2/28/2013 @duncan3ross
5. What, where, why and how of data mining
Who?
Why? 9 Laws
How? CRISP-DM
What?
Where? Unified data architecture
2/28/2013 @duncan3ross
7. THE 7TH LAW
Advanced Analytics
Prediction increases information locally by
generalisation
8. This may seem obvious
• Data mining learns from generalisations
> Historical cases build a model of reality
• These general models then predict an outcome that is local
to a case and a time
> How likely is it that someone will purchase product „x‟
> Will person a influence person b
> What number will the ball land on in roulette
• The knowledge gained may have been implied in the data,
but it is new and valuable
2/28/2013 @duncan3ross
9. Why the 7th Law is important
• Results need to be thought of at a group level for
assessment
> Individual results may be poor even when generated from a
great model
• Two levels of value
> Prediction (what, when etc…)
> Model (how…)
• The gap between the general and the local is the difference
between model building and scoring
> Hadoop?
> R?
2/28/2013 @duncan3ross
10. THE 5TH LAW
Advanced Analytics
There are always patterns
11. The heart of data science…
… is taking the 5th Law to heart
• A major difference between the approach of data mining and
data science is in the “Field of Dreams”
> Data mining (usually) requires measurable ROI prior to projects
> Data science is trading on probable ROI prior to projects
• Fortunately there is still a lot of gold in those hills
> And as technologies and data increase the number of hills is also
increasing
2/28/2013 @duncan3ross
13. But…
• Just because there are always patterns doesn‟t mean that
they are useful
> Algorithms can (and will) cluster a cloud
> Without Laws 1 and 2 patterns may not be a good thing
2/28/2013 @duncan3ross
14. THE 1ST LAW
Advanced Analytics
Business objectives are the origin of every
data mining solution
THE 2ND LAW
Advanced Analytics
Business knowledge is central to every
step of the data mining process
15. The sad tale of churn
• This story begins with a gains curve…
2/28/2013 @duncan3ross
16. What was the business objective?
• To predict churn
• What was the definition of churn?
• What did the business actually want to do?
> Predict “churn”?
> Predict people who became inactive?
> Predict people who became inactive who might not if contacted?
2/28/2013 @duncan3ross
17. Why the 1st and 2nd Laws are important
• Because we aren‟t doing this for the fun of it
> Or at least not just for the fun of it
• At every stage ask:
> Does this relate to the business question?
> Is the original business question still valid?
> Is there a better question that could be asked of this data?
> Can this be acted on?
> What does this actually mean?
• Document the answers, and refer back to them
2/28/2013 @duncan3ross
18. THE 4TH LAW
Advanced Analytics
There is no free lunch for the
data miner
19. The last algorithm you will need to learn
• Is….
• I spent a lot of time on this in the 1990s
> Neural nets
> Regression
> Decision trees
• If you know in advance what technique you need to use the
problem has already been solved
2/28/2013 @duncan3ross
20. The case that worked... then didn‘t
Campaign Topic
Identify fingerprint of churners
Description
SNA offers an opportunity to detect potential churners earlier (possibly before
they have completely ceased all on-net activity) and also identifies the
individuals who are likely to have the best chance of persuading them to return.
The aim of this campaign format is to use SNA to detect potential churners
during the process of leaving and motivate them to stay.
Current Approach: New Approach
Active Inactive
Churn detected Churn detected
2/28/2013 @duncan3ross
21. Why the 4th Law is important
• Solutions are not generally reproducible
> It may work here, but not there
• Methodologies are reproducible
• Learnings may have value
• Time will invalidate even the best models
2/28/2013 @duncan3ross
22. THE 3RD LAW
Advanced Analytics
Data preparation is more than half of every
data mining process
26. What events lead up to a reboot?
Note number of
paths with a
reboot, following
another reboot!
CREATE dimension table wrk.npath_reboot_5events
AS SELECT path, COUNT(*) AS path_count
FROM nPath
(ON wrk.w_event_f
PARTITION BY srv_id SELECT *
ORDER BY evt_ts desc FROM GraphGen (ON
MODE (NONOVERLAPPING ) (SELECT * from wrk.npath_reboot_5events
PATTERN ('X{0,5}.reboot') ORDER BY path_count
SYMBOLS LIMIT 30 )
(true as X, PARTITION BY 1
evt_name = 'REBOOT' AS reboot) ORDER BY path_count desc
RESULT item_format('npath')
(FIRST( srv_id OF X) AS srv_id, item1_col('path')
ACCUMULATE (evt_name OF ANY (X,reboot)) score_col('path_count')
AS path) output_format('sankey')
) GROUP BY 1 ; justify('right'));
2/28/2013 @duncan3ross
27. More data issues
Looks like an issue with the
data on the 30th September
and beyond, the Reboot data
for October seems to have
been aggregated and added
to September the 30th
2/28/2013 @duncan3ross
28. Data preparation is tough
• Duncan‟s theorem
> The usefulness of a variable in a model is inversely related to
the amount of time you spend creating it
• Edouard‟s corollary
> If it turns out to be useful you could have created it in the time
indicated by Duncan‟s theorem
2/28/2013 @duncan3ross
29. Welcome to the world of big data
• Data just got noisier and less consistent
• Maintaining an analytical data dictionary just moved from
vital to really really vital
2/28/2013 @duncan3ross
30. Why the 3rd Law is important
• Because data prep is such a huge task you need to plan for it
well
> Assume that you will need to do it at least twice
– Experimentation
– Model building
– Deployment
• Look for software that makes it easy
> And repeatable
> And documentable
– Scripts ≠ documentation
• Documentation of your data is even more important than
documentation of your models
> Models can be very sensitive to data inputs
2/28/2013 @duncan3ross
31. THE 6TH LAW
Advanced Analytics
Data mining amplifies perception in the
business domain
32. Look for patterns in Network Infrastructure
• Too many end customers to visualise as a graph but network
has a hierarchy
> Internet Gateway Area Hub Customer Router
• Create a table using standard SQL to join the reference data
plus the Customer Hub error data into a single view
srv_id dslam err_cnt srvid_cnt nra_id dslam_cnt errorspersrvid
20785675 lgp44-2 2 248 MZL 2 15
22254516 ltc56-1 4 314 BOT 10 15
21059184 bch66-1 2 184 RIV 15 15
21149846 tsm83-1 2 308 LCR 3 13
20833837 did75-4 10 216 DID 23 13
22295785 gbw68-1 36 170 HRS 1 12
21807750 gmo34-1 2 117 BER 17 12
21374927 bgl93-1 2 246 G5Y 8 12
20291116 ien11-1 2 211 ALZ 2 12
21459244 pai34-1 4 210 M7C 3 11
21027647 bel60-1 4 223 TRO 10 11
20551629 pla13-1 10 332 BED 4 11
20633112 crj95-2 2 332 G5Y 8 11
20585199 bau06-1 46 349 BLA 21 10
21477790 cvl92-1 4 180 IMS 35 10
21292874 che78-1 2 163 PIT 2 10
2/28/2013 @duncan3ross
33. Visualise as a Graph using Aster GraphGen
Size of Node =
number of customers
Width of Edge =
number of errors
SELECT *
FROM graphgen
(ON
(SELECT DISTINCT dmt_act_dslam,
nra_id,
nbr_of_srvid,
errorspersrv,
nbr_of_dslam
FROM wrk.srvid_dslam_err)
PARTITION BY 1
ORDER BY errorspersrv
item_format('cfilter')
item1_col('dmt_act_dslam')
item2_col('nra_id')
score_col('errorspersrv')
cnt1_col('nbr_of_srvid')
cnt2_col('nbr_of_dslam')
output_format('sigma')
directed('false')
width_max(10)
width_min(1)
nodesize_max (3)
nodesize_min (1));
2/28/2013 @duncan3ross
34. Zoom in on area where the edge
width/colour indicates a problem
2/28/2013 @duncan3ross
35. Add churn information
• Add churn information to find customers connected to this
Hub that have cancelled their accounts
2/28/2013 @duncan3ross
38. Why the 6th Law is important
• We don‟t exist in a vacuum
> We need to sell the results of analysis
• This is a virtuous feedback loop
2/28/2013 @duncan3ross
39. THE 8TH LAW
Advanced Analytics
The value of data mining results is not
determined by the accuracy or stability of
predictive models
40. If your model is 98% accurate – so what?
• Or if it‟s right 1 time in 35?
2/28/2013 @duncan3ross
41. How can you evaluate models?
• Type I and Type II errors
> What is the cost (opportunity and actual) of a false positive?
> What is the cost of a false negative?
• Gains curves
> But beware the over accurate curve
• Don‟t the forget the user
> Decision trees fight back
2/28/2013 @duncan3ross
42. THE 9TH LAW
Advanced Analytics
All patterns are subject to change
43. SUMMARY
Advanced Analytics
0 Listen to data miners…
7 Data mining brings new knowledge
5 And there will always be new knowledge
1 Start with the business
2 Keep going back to the business
4 It won’t get easier with time
3 Especially given the state your data is in
6 But you will improve business results
8 As long as you look for the right outputs
9 Goto 0
44. RESOURCES
Advanced Analytics
• http://khabaza.codimension.net/index_files/9laws.htm
• The Society of Data Miners (coming soon)
> Available on LinkedIn
• CRISP-DM