SlideShare une entreprise Scribd logo
1  sur  44
Advanced Analytics



   THE NINE LAWS OF DATA MINING
   Duncan Ross
   @duncan3ross
   duncan.ross@teradata.com

   Based on the 9 Laws of Data Mining by Tom Khabaza
What you won‟t get from this presentation
• The last two algorithms you need to know!
• An explanation of Bayes‟ theorem
• The name of the software that will make you $ millions
    > Not even a comparison of different software!




            The grave of Thomas Bayes (probably) – near “silicon roundabout”
                                     Image via Wikimedia

2/28/2013                             @duncan3ross
THE 0TH LAW
                                 Advanced Analytics




      Data Mining laws also work as
           Data Science laws
What is data mining?
• This question generates more arguments than answers

• Common features
    > Predicting or classifying things
    > Based on historical cases (with or without outcomes)
    > Machine learning techniques
    > No predefined underlying model assumed




                                                    Image via Wikimedia



2/28/2013                         @duncan3ross
What, where, why and how of data mining

                                Who?


                     Why?      9 Laws


            How?             CRISP-DM



What?


                   Where? Unified data architecture

2/28/2013                       @duncan3ross
CRISP-DM created to help




2/28/2013         @duncan3ross
THE 7TH LAW
                                   Advanced Analytics




Prediction increases information locally by
               generalisation
This may seem obvious
• Data mining learns from generalisations
    > Historical cases build a model of reality


• These general models then predict an outcome that is local
  to a case and a time
    > How likely is it that someone will purchase product „x‟
    > Will person a influence person b
    > What number will the ball land on in roulette


• The knowledge gained may have been implied in the data,
  but it is new and valuable




2/28/2013                         @duncan3ross
Why the 7th Law is important
• Results need to be thought of at a group level for
  assessment
    > Individual results may be poor even when generated from a
        great model


• Two levels of value
    > Prediction (what, when etc…)
    > Model (how…)


• The gap between the general and the local is the difference
  between model building and scoring
    > Hadoop?
    > R?




2/28/2013                       @duncan3ross
THE 5TH LAW
                                    Advanced Analytics




        There are always patterns
The heart of data science…
… is taking the 5th Law to heart

• A major difference between the approach of data mining and
  data science is in the “Field of Dreams”
    > Data mining (usually) requires measurable ROI prior to projects
    > Data science is trading on probable ROI prior to projects


• Fortunately there is still a lot of gold in those hills
    > And as technologies and data increase the number of hills is also
        increasing




2/28/2013                        @duncan3ross
Graph of hills vs gold extracted




2/28/2013           @duncan3ross
But…
• Just because there are always patterns doesn‟t mean that
  they are useful
    > Algorithms can (and will) cluster a cloud
    > Without Laws 1 and 2 patterns may not be a good thing




2/28/2013                        @duncan3ross
THE 1ST LAW
                                  Advanced Analytics



Business objectives are the origin of every
          data mining solution

THE 2ND LAW
                                  Advanced Analytics



  Business knowledge is central to every
     step of the data mining process
The sad tale of churn
• This story begins with a gains curve…




2/28/2013                   @duncan3ross
What was the business objective?
• To predict churn

• What was the definition of churn?

• What did the business actually want to do?
    > Predict “churn”?
    > Predict people who became inactive?
    > Predict people who became inactive who might not if contacted?




2/28/2013                       @duncan3ross
Why the 1st and 2nd Laws are important
• Because we aren‟t doing this for the fun of it
    > Or at least not just for the fun of it


• At every stage ask:
    > Does this relate to the business question?
    > Is the original business question still valid?
    > Is there a better question that could be asked of this data?
    > Can this be acted on?
    > What does this actually mean?


• Document the answers, and refer back to them




2/28/2013                          @duncan3ross
THE 4TH LAW
                                   Advanced Analytics




       There is no free lunch for the
                data miner
The last algorithm you will need to learn
• Is….

• I spent a lot of time on this in the 1990s
    > Neural nets
    > Regression
    > Decision trees


• If you know in advance what technique you need to use the
  problem has already been solved




2/28/2013                    @duncan3ross
The case that worked... then didn‘t

                                          Campaign Topic
Identify fingerprint of churners

                                           Description
SNA offers an opportunity to detect potential churners earlier (possibly before
they have completely ceased all on-net activity) and also identifies the
individuals who are likely to have the best chance of persuading them to return.
The aim of this campaign format is to use SNA to detect potential churners
during the process of leaving and motivate them to stay.

             Current Approach:                               New Approach




            Active   Inactive



                         Churn detected                       Churn detected


2/28/2013                                     @duncan3ross
Why the 4th Law is important
• Solutions are not generally reproducible
    > It may work here, but not there


• Methodologies are reproducible

• Learnings may have value

• Time will invalidate even the best models




2/28/2013                       @duncan3ross
THE 3RD LAW
                                  Advanced Analytics




Data preparation is more than half of every
          data mining process
Data preparation through a case…




2/28/2013          @duncan3ross
The problems of text data




2/28/2013          @duncan3ross
Data quality raises it‟s head…




2/28/2013           @duncan3ross
What events lead up to a reboot?



 Note number of
  paths with a
reboot, following
 another reboot!




   CREATE dimension table wrk.npath_reboot_5events
   AS SELECT path, COUNT(*) AS path_count
   FROM nPath
                     (ON wrk.w_event_f
                      PARTITION BY srv_id                               SELECT *
                      ORDER BY evt_ts desc                              FROM GraphGen (ON
                      MODE (NONOVERLAPPING )                                                 (SELECT * from wrk.npath_reboot_5events
                      PATTERN ('X{0,5}.reboot')                                       ORDER BY path_count
                      SYMBOLS                                                                 LIMIT 30 )
                          (true as X,                                   PARTITION BY 1
               evt_name = 'REBOOT' AS reboot)                           ORDER BY path_count desc
          RESULT                                                        item_format('npath')
              (FIRST( srv_id OF X) AS srv_id,                           item1_col('path')
               ACCUMULATE (evt_name OF ANY (X,reboot))                  score_col('path_count')
                      AS path)                                          output_format('sankey')
          ) GROUP BY 1 ;                                                justify('right'));



2/28/2013                                                @duncan3ross
More data issues




                                  Looks like an issue with the
                                  data on the 30th September
                                  and beyond, the Reboot data
                                  for October seems to have
                                  been aggregated and added
                                  to September the 30th




2/28/2013          @duncan3ross
Data preparation is tough
• Duncan‟s theorem
    > The usefulness of a variable in a model is inversely related to
        the amount of time you spend creating it


• Edouard‟s corollary
    > If it turns out to be useful you could have created it in the time
        indicated by Duncan‟s theorem




2/28/2013                         @duncan3ross
Welcome to the world of big data
• Data just got noisier and less consistent

• Maintaining an analytical data dictionary just moved from
  vital to really really vital




2/28/2013                    @duncan3ross
Why the 3rd Law is important
• Because data prep is such a huge task you need to plan for it
  well
    > Assume that you will need to do it at least twice
      – Experimentation
      – Model building
      – Deployment


• Look for software that makes it easy
    > And repeatable
    > And documentable
      – Scripts ≠ documentation


• Documentation of your data is even more important than
  documentation of your models
    > Models can be very sensitive to data inputs



2/28/2013                         @duncan3ross
THE 6TH LAW
                                  Advanced Analytics




  Data mining amplifies perception in the
            business domain
Look for patterns in Network Infrastructure
• Too many end customers to visualise as a graph but network
  has a hierarchy
    > Internet Gateway Area Hub Customer Router


• Create a table using standard SQL to join the reference data
  plus the Customer Hub error data into a single view
            srv_id     dslam     err_cnt   srvid_cnt   nra_id   dslam_cnt   errorspersrvid
            20785675   lgp44-2   2         248         MZL      2           15
            22254516   ltc56-1   4         314         BOT      10          15
            21059184   bch66-1   2         184         RIV      15          15
            21149846   tsm83-1   2         308         LCR      3           13
            20833837   did75-4   10        216         DID      23          13
            22295785   gbw68-1   36        170         HRS      1           12
            21807750   gmo34-1   2         117         BER      17          12
            21374927   bgl93-1   2         246         G5Y      8           12
            20291116   ien11-1   2         211         ALZ      2           12
            21459244   pai34-1   4         210         M7C      3           11
            21027647   bel60-1   4         223         TRO      10          11
            20551629   pla13-1   10        332         BED      4           11
            20633112   crj95-2   2         332         G5Y      8           11
            20585199   bau06-1   46        349         BLA      21          10
            21477790   cvl92-1   4         180         IMS      35          10
            21292874   che78-1   2         163         PIT      2           10




2/28/2013                                     @duncan3ross
Visualise as a Graph using Aster GraphGen

                                                          Size of Node =
                                                          number of customers
                                                          Width of Edge =
                                                          number of errors




                                  SELECT *
                                  FROM graphgen
                                    (ON
                                                      (SELECT DISTINCT dmt_act_dslam,
                                                       nra_id,
                                           nbr_of_srvid,
                                                       errorspersrv,
                                                       nbr_of_dslam
                                           FROM wrk.srvid_dslam_err)
                                    PARTITION BY 1
                                    ORDER BY errorspersrv
                                    item_format('cfilter')
                                    item1_col('dmt_act_dslam')
                                    item2_col('nra_id')
                                    score_col('errorspersrv')
                                    cnt1_col('nbr_of_srvid')
                                    cnt2_col('nbr_of_dslam')
                                    output_format('sigma')
                                    directed('false')
                                    width_max(10)
                                    width_min(1)
                                    nodesize_max (3)
                                    nodesize_min (1));




2/28/2013          @duncan3ross
Zoom in on area where the edge
width/colour indicates a problem




2/28/2013          @duncan3ross
Add churn information
• Add churn information to find customers connected to this
  Hub that have cancelled their accounts




2/28/2013                   @duncan3ross
Synch Issues by Hub Type




2/28/2013         @duncan3ross
Error and Complaint rates by equipment type




2/28/2013            @duncan3ross
Why the 6th Law is important
• We don‟t exist in a vacuum
    > We need to sell the results of analysis


• This is a virtuous feedback loop




2/28/2013                        @duncan3ross
THE 8TH LAW
                                  Advanced Analytics




 The value of data mining results is not
determined by the accuracy or stability of
           predictive models
If your model is 98% accurate – so what?
• Or if it‟s right 1 time in 35?




2/28/2013                      @duncan3ross
How can you evaluate models?
• Type I and Type II errors
    > What is the cost (opportunity and actual) of a false positive?
    > What is the cost of a false negative?


• Gains curves
    > But beware the over accurate curve


• Don‟t the forget the user
    > Decision trees fight back




2/28/2013                         @duncan3ross
THE 9TH LAW
                                  Advanced Analytics



    All patterns are subject to change
SUMMARY
                                             Advanced Analytics



0   Listen to data miners…
7   Data mining brings new knowledge
5   And there will always be new knowledge
1   Start with the business
2   Keep going back to the business
4   It won’t get easier with time
3   Especially given the state your data is in
6   But you will improve business results
8   As long as you look for the right outputs
9   Goto 0
RESOURCES
                                              Advanced Analytics


• http://khabaza.codimension.net/index_files/9laws.htm

• The Society of Data Miners (coming soon)
  > Available on LinkedIn


• CRISP-DM

Contenu connexe

Dernier

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Dernier (20)

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

En vedette

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

En vedette (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Strata: 9 laws of Data Mining

  • 1. Advanced Analytics THE NINE LAWS OF DATA MINING Duncan Ross @duncan3ross duncan.ross@teradata.com Based on the 9 Laws of Data Mining by Tom Khabaza
  • 2. What you won‟t get from this presentation • The last two algorithms you need to know! • An explanation of Bayes‟ theorem • The name of the software that will make you $ millions > Not even a comparison of different software! The grave of Thomas Bayes (probably) – near “silicon roundabout” Image via Wikimedia 2/28/2013 @duncan3ross
  • 3. THE 0TH LAW Advanced Analytics Data Mining laws also work as Data Science laws
  • 4. What is data mining? • This question generates more arguments than answers • Common features > Predicting or classifying things > Based on historical cases (with or without outcomes) > Machine learning techniques > No predefined underlying model assumed Image via Wikimedia 2/28/2013 @duncan3ross
  • 5. What, where, why and how of data mining Who? Why? 9 Laws How? CRISP-DM What? Where? Unified data architecture 2/28/2013 @duncan3ross
  • 6. CRISP-DM created to help 2/28/2013 @duncan3ross
  • 7. THE 7TH LAW Advanced Analytics Prediction increases information locally by generalisation
  • 8. This may seem obvious • Data mining learns from generalisations > Historical cases build a model of reality • These general models then predict an outcome that is local to a case and a time > How likely is it that someone will purchase product „x‟ > Will person a influence person b > What number will the ball land on in roulette • The knowledge gained may have been implied in the data, but it is new and valuable 2/28/2013 @duncan3ross
  • 9. Why the 7th Law is important • Results need to be thought of at a group level for assessment > Individual results may be poor even when generated from a great model • Two levels of value > Prediction (what, when etc…) > Model (how…) • The gap between the general and the local is the difference between model building and scoring > Hadoop? > R? 2/28/2013 @duncan3ross
  • 10. THE 5TH LAW Advanced Analytics There are always patterns
  • 11. The heart of data science… … is taking the 5th Law to heart • A major difference between the approach of data mining and data science is in the “Field of Dreams” > Data mining (usually) requires measurable ROI prior to projects > Data science is trading on probable ROI prior to projects • Fortunately there is still a lot of gold in those hills > And as technologies and data increase the number of hills is also increasing 2/28/2013 @duncan3ross
  • 12. Graph of hills vs gold extracted 2/28/2013 @duncan3ross
  • 13. But… • Just because there are always patterns doesn‟t mean that they are useful > Algorithms can (and will) cluster a cloud > Without Laws 1 and 2 patterns may not be a good thing 2/28/2013 @duncan3ross
  • 14. THE 1ST LAW Advanced Analytics Business objectives are the origin of every data mining solution THE 2ND LAW Advanced Analytics Business knowledge is central to every step of the data mining process
  • 15. The sad tale of churn • This story begins with a gains curve… 2/28/2013 @duncan3ross
  • 16. What was the business objective? • To predict churn • What was the definition of churn? • What did the business actually want to do? > Predict “churn”? > Predict people who became inactive? > Predict people who became inactive who might not if contacted? 2/28/2013 @duncan3ross
  • 17. Why the 1st and 2nd Laws are important • Because we aren‟t doing this for the fun of it > Or at least not just for the fun of it • At every stage ask: > Does this relate to the business question? > Is the original business question still valid? > Is there a better question that could be asked of this data? > Can this be acted on? > What does this actually mean? • Document the answers, and refer back to them 2/28/2013 @duncan3ross
  • 18. THE 4TH LAW Advanced Analytics There is no free lunch for the data miner
  • 19. The last algorithm you will need to learn • Is…. • I spent a lot of time on this in the 1990s > Neural nets > Regression > Decision trees • If you know in advance what technique you need to use the problem has already been solved 2/28/2013 @duncan3ross
  • 20. The case that worked... then didn‘t Campaign Topic Identify fingerprint of churners Description SNA offers an opportunity to detect potential churners earlier (possibly before they have completely ceased all on-net activity) and also identifies the individuals who are likely to have the best chance of persuading them to return. The aim of this campaign format is to use SNA to detect potential churners during the process of leaving and motivate them to stay. Current Approach: New Approach Active Inactive Churn detected Churn detected 2/28/2013 @duncan3ross
  • 21. Why the 4th Law is important • Solutions are not generally reproducible > It may work here, but not there • Methodologies are reproducible • Learnings may have value • Time will invalidate even the best models 2/28/2013 @duncan3ross
  • 22. THE 3RD LAW Advanced Analytics Data preparation is more than half of every data mining process
  • 23. Data preparation through a case… 2/28/2013 @duncan3ross
  • 24. The problems of text data 2/28/2013 @duncan3ross
  • 25. Data quality raises it‟s head… 2/28/2013 @duncan3ross
  • 26. What events lead up to a reboot? Note number of paths with a reboot, following another reboot! CREATE dimension table wrk.npath_reboot_5events AS SELECT path, COUNT(*) AS path_count FROM nPath (ON wrk.w_event_f PARTITION BY srv_id SELECT * ORDER BY evt_ts desc FROM GraphGen (ON MODE (NONOVERLAPPING ) (SELECT * from wrk.npath_reboot_5events PATTERN ('X{0,5}.reboot') ORDER BY path_count SYMBOLS LIMIT 30 ) (true as X, PARTITION BY 1 evt_name = 'REBOOT' AS reboot) ORDER BY path_count desc RESULT item_format('npath') (FIRST( srv_id OF X) AS srv_id, item1_col('path') ACCUMULATE (evt_name OF ANY (X,reboot)) score_col('path_count') AS path) output_format('sankey') ) GROUP BY 1 ; justify('right')); 2/28/2013 @duncan3ross
  • 27. More data issues Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th 2/28/2013 @duncan3ross
  • 28. Data preparation is tough • Duncan‟s theorem > The usefulness of a variable in a model is inversely related to the amount of time you spend creating it • Edouard‟s corollary > If it turns out to be useful you could have created it in the time indicated by Duncan‟s theorem 2/28/2013 @duncan3ross
  • 29. Welcome to the world of big data • Data just got noisier and less consistent • Maintaining an analytical data dictionary just moved from vital to really really vital 2/28/2013 @duncan3ross
  • 30. Why the 3rd Law is important • Because data prep is such a huge task you need to plan for it well > Assume that you will need to do it at least twice – Experimentation – Model building – Deployment • Look for software that makes it easy > And repeatable > And documentable – Scripts ≠ documentation • Documentation of your data is even more important than documentation of your models > Models can be very sensitive to data inputs 2/28/2013 @duncan3ross
  • 31. THE 6TH LAW Advanced Analytics Data mining amplifies perception in the business domain
  • 32. Look for patterns in Network Infrastructure • Too many end customers to visualise as a graph but network has a hierarchy > Internet Gateway Area Hub Customer Router • Create a table using standard SQL to join the reference data plus the Customer Hub error data into a single view srv_id dslam err_cnt srvid_cnt nra_id dslam_cnt errorspersrvid 20785675 lgp44-2 2 248 MZL 2 15 22254516 ltc56-1 4 314 BOT 10 15 21059184 bch66-1 2 184 RIV 15 15 21149846 tsm83-1 2 308 LCR 3 13 20833837 did75-4 10 216 DID 23 13 22295785 gbw68-1 36 170 HRS 1 12 21807750 gmo34-1 2 117 BER 17 12 21374927 bgl93-1 2 246 G5Y 8 12 20291116 ien11-1 2 211 ALZ 2 12 21459244 pai34-1 4 210 M7C 3 11 21027647 bel60-1 4 223 TRO 10 11 20551629 pla13-1 10 332 BED 4 11 20633112 crj95-2 2 332 G5Y 8 11 20585199 bau06-1 46 349 BLA 21 10 21477790 cvl92-1 4 180 IMS 35 10 21292874 che78-1 2 163 PIT 2 10 2/28/2013 @duncan3ross
  • 33. Visualise as a Graph using Aster GraphGen Size of Node = number of customers Width of Edge = number of errors SELECT * FROM graphgen (ON (SELECT DISTINCT dmt_act_dslam, nra_id, nbr_of_srvid, errorspersrv, nbr_of_dslam FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format('cfilter') item1_col('dmt_act_dslam') item2_col('nra_id') score_col('errorspersrv') cnt1_col('nbr_of_srvid') cnt2_col('nbr_of_dslam') output_format('sigma') directed('false') width_max(10) width_min(1) nodesize_max (3) nodesize_min (1)); 2/28/2013 @duncan3ross
  • 34. Zoom in on area where the edge width/colour indicates a problem 2/28/2013 @duncan3ross
  • 35. Add churn information • Add churn information to find customers connected to this Hub that have cancelled their accounts 2/28/2013 @duncan3ross
  • 36. Synch Issues by Hub Type 2/28/2013 @duncan3ross
  • 37. Error and Complaint rates by equipment type 2/28/2013 @duncan3ross
  • 38. Why the 6th Law is important • We don‟t exist in a vacuum > We need to sell the results of analysis • This is a virtuous feedback loop 2/28/2013 @duncan3ross
  • 39. THE 8TH LAW Advanced Analytics The value of data mining results is not determined by the accuracy or stability of predictive models
  • 40. If your model is 98% accurate – so what? • Or if it‟s right 1 time in 35? 2/28/2013 @duncan3ross
  • 41. How can you evaluate models? • Type I and Type II errors > What is the cost (opportunity and actual) of a false positive? > What is the cost of a false negative? • Gains curves > But beware the over accurate curve • Don‟t the forget the user > Decision trees fight back 2/28/2013 @duncan3ross
  • 42. THE 9TH LAW Advanced Analytics All patterns are subject to change
  • 43. SUMMARY Advanced Analytics 0 Listen to data miners… 7 Data mining brings new knowledge 5 And there will always be new knowledge 1 Start with the business 2 Keep going back to the business 4 It won’t get easier with time 3 Especially given the state your data is in 6 But you will improve business results 8 As long as you look for the right outputs 9 Goto 0
  • 44. RESOURCES Advanced Analytics • http://khabaza.codimension.net/index_files/9laws.htm • The Society of Data Miners (coming soon) > Available on LinkedIn • CRISP-DM