SlideShare une entreprise Scribd logo
1  sur  30
Data Mining Query
Languages
Kristen LeFevre
April 19, 2004
With Thanks to Zheng Huang and Lei Chen
Outline

 Introduce the problem of querying
  data mining models
 Overview of three different solutions
  and their contributions
 Topic for Discussion: What would an
  ideal solution support?
Problem Description
   You guys are armed with two powerful tools
     Database   management systems
     Efficient and effective data mining algorithms
      and frameworks
   Generally, this work asks:
     “How  can we merge the two?”
     “How can we integrate data mining more
      closely with traditional database systems,
      particularly querying?”
Three Different Answers
 DMQL: A Data Mining Query
  Language for Relational Databases
  (Han et al, Simon Fraser University)
 Integrating Data Mining with SQL
  Databases: OLE DB for Data Mining
  (Netz et al, Microsoft)
 MSQL: A Query Language for
  Database Mining (Imielinski &
  Virmani, Rutgers University)
Some Common Ground
   Create and manipulate data mining models
    through a SQL-based interface (“Command-
    driven” data mining)
   Abstract away the data mining particulars
   Data mining should be performed on data in
    the database (should not need to export to
    a special-purpose environment)
   Approaches differ on what kinds of models
    should be created, and what operations we
    should be able to perform
DMQL
   Commands specify the following:
     The  set of data relevant to the data mining
      task (the training set)
     The kinds of knowledge to be discovered
       •   Generalized relation
       •   Characteristic rules
       •   Discriminant rules
       •   Classification rules
       •   Association rules
DMQL

   Commands Specify the following:
     Background    knowledge
       • Concept hierarchies based on attribute
         relationships, etc.
     Various   thresholds
       • Minimum support, confidence, etc.
DMQL
                            Syntax
                             use database <database_name>
Specify background
knowledge
                             {use hierarchy <hierarchy_name> for
Specify rules to be
                               <attribute>}
discovered                   <rule_spec>
Relevant attributes or
aggregations                 related to <attr_or_agg_list>
Collect the set of           from <relation(s)>
relevant data to mine
                             [where <conditions>]
                             [order by <order list>]
Specify threshold            {with [<kinds of>] threshold =
parameters
                               <threshold_value> [for <attribute(s)>]}
DMQL
 Syntax <rule_spec>
find classification rules [as <rule_name>]
  [according to <attributes>]

Find association rules [as <rule_name>]

generalize data [into <relation_name>]

others
DMQL
 use database Hospital
 find association rules as Heart_Health
 related to Salary, Age, Smoker,
   Heart_Disease
 from Patient_Financial f, Patient_Medical m
 where f.ID = m.ID and m.age >= 18
 with support threshold = .05
 with confidence threshold = .7
DMQL

 DMQL provides a display in
  command to view resulting rules, but
  no advanced way to query them
 Suggests that a GUI interface might
  aid in the presentation of these results
  in different forms (charts, graphs, etc.)
MSQL

 Focus on Association Rules
 Seeks to provide a language both to
  selectively generate rules, and
  separately to query the rule base
 Expressive rule generation language,
  and techniques for optimizing some
  commands
MSQL
   Get-Rules and Select-Rules Queries
     Get-Rules  operator generates rules over
      elements of argument class C, which satisfy
      conditions described in the “where” clause
    [Project Body, Consequent,
      confidence, support]
    GetRules(C) [as R1]
    [into <rulebase_name>]
    [where <conds>]
    [sql-group-by clause]
    [using-clause]
MSQL
                    <conds> may contain a number of
                     conditions, including:
                           restrictions on the attributes in the body or
                           consequent
in, has, and is are rule    • “rule.body HAS {(Job = ‘Doctor’}”
    subset, superset,
      and equality          • “rule1.consequent IN rule2.body”
      respectively          • “rule.consequent IS {Age = *}”
                       pruning  conditions (restrict by support,
                        confidence, or size)
                       Stratified or correlated subqueries
MSQL
       GetRules(Patients)
       where Body has {Age = *}
       and Support > .05 and Confidence > .7
       and not exists ( GetRules(Patients)
                       Support > .05 and
                       Confidence > .7
                       and R2.Body HAS R1.Body)

Retrieve all rules with descriptors of the form “Age = x” in the body,
except when there is a rule with equal or greater support and
confidence with a rule containing a superset of the descriptors in
the body
MSQL
              GetRules(C) R1
              where <pruning-conds>
correlated    and not exists ( GetRules(C) R2
                                  where <same pruning-conds>
                                  and R2.Body HAS R1.Body)


              GetRules(C) R1
              where <pruning-conds>
              and consequent is {(X=*)}
 stratified   and consequent in (SelectRules(R2)
                                       where consequent is {(X=*)}
MSQL
   Nested Get-Rules Queries and their
    optimization
     Stratified(non-corrolated) queries are
      evaluated “bottom-up.” The subquery is
      evaluated first, and replaced with its results
      in the outer query.
     Correlated queries are evaluated either top-
      down or bottom-up (like “loop-unfolding”),
      and there are rules for choosing between the
      two options
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
                Support > .05 and
                Confidence > .7
                and R2.Body HAS R1.Body)
MSQL

Top-Down Evaluation
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7

For each rule produced by the outer, evaluate the
inner
 not exists ( GetRules(Patients)
   Support > .05 and   Confidence > .7
   and R2.Body HAS R1.Body)
MSQL

Bottom-Up Evaluation
not exists ( GetRules(Patients)
  Support > .05 and   Confidence > .7
  and R2.Body HAS R1.Body)


For each rule produced by the inner, evaluate the
outer
 GetRules(Patients)
 where Body has {Age = *}
 and Support > .05 and Confidence > .7
MSQL
                 Choosing between the two
                    In general, evaluate the expression with more
                     restrictive conditions first
                    Heuristic rules
                      • Evaluate the query with higher support threshold first
                      • Next consider confidence threshold
Meant to prevent
                      • A (length = x) expression is in general more restrictive
unconstrained           than (length > x), which is more restrictive than (length <
queries from being      x)
evaluated first       • “Body IS (constant expression)” is more restrictive than
                        “Body HAS”, which is more restrictive than “Body IN”
                      • Next consider “Consequent IN” expressions
                      • Descriptors of for (A = a) are more restrictive than
                        wildcards such as (A = *)
OLE DB for DM
                  An extension to the OLE DB interface for
                   Microsoft SQL Server
                  Seeks to support the following ideas:
                    Define  a model by specifying the set of
                     attributes to be predicted, the attributes used
                     for the prediction, and the algorithm
                    Populate the model using the training data
None of the         Predict attributes for new data using the
others
seemed to            populated model
support this        Browse the mining model (not fully
                     addressed because it varies a lot by model
                     type)
OLE DB for DM
   Defining a Mining Model
     Identify the set of data attributes to be
      predicted, the set of attributes to be used for
      prediction, and the algorithm to be used for
      building the model
   Populating the Model
     Pullthe information into a single rowset
      using views, and train the model using the
      data and algorithm specified
     Supports complex objects, so rowset may be
      hierarchical (see paper for more complex
      examples)
OLE DB for DM

   Using the mining model to predict
     Defines  a new operator prediction join.
       A model may be used to make
      predictions on datasets by taking the
      prediction join of the mining model
      and the data set.
OLE DB for DM
CREATE MINING MODEL [Heart_Health Prediction]
[ID] Int Key,
[Age] Int,
[Smoker] Int,
[Salary] Double discretized,
[HeartAttack] Int PREDICT, %Prediction column
USING [Decision_Trees_101]


Identifies the source columns for the training
data, the column to be predicted, and the data
mining algorithm.
OLE DB for DM
INSERT INTO [Heart_Health Prediction]
([ID], [Age], [Smoker], [Salary])
SELECT [ID], [Age], [Smoker], [Salary] FROM
  Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID



The INSERT represents using a tuple for
training the model (not actually inserting it into
the rowset).
OLE DB for DM
SELECT t.[ID],
  [Heart_Health Prediction].[HeartAttack]
FROM [Heart_Health Prediction]
PREDICTION JOIN (
SELECT [ID], [Age], [Smoker], [Salary]
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID) as t
ON [Heart_Health Prediction].Age = t.Age AND
  [Heath_Health Prediction].Smoker = t.Smoker
  AND [Heart_Health Prediction].Salary =
  t.Salary

Prediction join connects the model and an actual data
table to make predictions
Key Ideas

 Important to have an API for creating
  and manipulating data mining models
 The data is already in the DBMS, so it
  makes sense to do the data mining
  where the data is
 Applications already use SQL, so a
  SQL extension seems logical
Key Ideas
   Need a method for defining data mining
    models, including algorithm specification,
    specification of various parameters, and
    training set specification (DMQL, MSQL,
    ODBDM)
   Need a method of querying the models
    (MSQL)
   Need a way of using the data mining model
    to interact with other data in the database,
    for purposes such as prediction (ODBDM)
Discussion Topic:
What Functionality would
and Ideal Solution
Support?

Contenu connexe

Tendances

Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Database Normalization
Database NormalizationDatabase Normalization
Database NormalizationArun Sharma
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to MetadataEUDAT
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationDataminingTools Inc
 

Tendances (20)

Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Database Normalization
Database NormalizationDatabase Normalization
Database Normalization
 
Data clustring
Data clustring Data clustring
Data clustring
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
XML and DTD
XML and DTDXML and DTD
XML and DTD
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to Metadata
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Databases: Normalisation
Databases: NormalisationDatabases: Normalisation
Databases: Normalisation
 
Relational databases
Relational databasesRelational databases
Relational databases
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 

En vedette

Data mining platform
Data mining platformData mining platform
Data mining platformchanson zhang
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparisonric_biet
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Revolution Analytics
 
Cost Concepts
Cost ConceptsCost Concepts
Cost ConceptsAIT
 
Google Analytics Data Mining with R
Google Analytics Data Mining with RGoogle Analytics Data Mining with R
Google Analytics Data Mining with RTatvic Analytics
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 

En vedette (14)

Data mining platform
Data mining platformData mining platform
Data mining platform
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparison
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Cost Concepts
Cost ConceptsCost Concepts
Cost Concepts
 
Google Analytics Data Mining with R
Google Analytics Data Mining with RGoogle Analytics Data Mining with R
Google Analytics Data Mining with R
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 

Similaire à Data mining query languages

MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)LizLavaveshkul
 
Introduction to the Structured Query Language SQL
Introduction to the Structured Query Language SQLIntroduction to the Structured Query Language SQL
Introduction to the Structured Query Language SQLHarmony Kwawu
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
 
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343Edgar Alejandro Villegas
 
Jdbc oracle
Jdbc oracleJdbc oracle
Jdbc oracleyazidds2
 
Relational Database Management System part II
Relational Database Management System part IIRelational Database Management System part II
Relational Database Management System part IIKavithaA19
 
Overview on NoSQL and MongoDB
Overview on NoSQL and MongoDBOverview on NoSQL and MongoDB
Overview on NoSQL and MongoDBharithakannan
 
Bca examination 2017 dbms
Bca examination 2017 dbmsBca examination 2017 dbms
Bca examination 2017 dbmsAnjaan Gajendra
 
SQL interview questions by jeetendra mandal - part 4
SQL interview questions by jeetendra mandal - part 4SQL interview questions by jeetendra mandal - part 4
SQL interview questions by jeetendra mandal - part 4jeetendra mandal
 
sql ppt.pptx
sql ppt.pptxsql ppt.pptx
sql ppt.pptxArunT58
 

Similaire à Data mining query languages (20)

MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)
 
Drools rule Concepts
Drools rule ConceptsDrools rule Concepts
Drools rule Concepts
 
Module02
Module02Module02
Module02
 
Introduction to the Structured Query Language SQL
Introduction to the Structured Query Language SQLIntroduction to the Structured Query Language SQL
Introduction to the Structured Query Language SQL
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
Advance SQL.pptx
Advance SQL.pptxAdvance SQL.pptx
Advance SQL.pptx
 
Hibernate III
Hibernate IIIHibernate III
Hibernate III
 
Algebra relacional
Algebra relacionalAlgebra relacional
Algebra relacional
 
Sql
SqlSql
Sql
 
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
 
Jdbc oracle
Jdbc oracleJdbc oracle
Jdbc oracle
 
Rdbms
RdbmsRdbms
Rdbms
 
Relational Database Management System part II
Relational Database Management System part IIRelational Database Management System part II
Relational Database Management System part II
 
Overview on NoSQL and MongoDB
Overview on NoSQL and MongoDBOverview on NoSQL and MongoDB
Overview on NoSQL and MongoDB
 
Bca examination 2017 dbms
Bca examination 2017 dbmsBca examination 2017 dbms
Bca examination 2017 dbms
 
SQL interview questions by jeetendra mandal - part 4
SQL interview questions by jeetendra mandal - part 4SQL interview questions by jeetendra mandal - part 4
SQL interview questions by jeetendra mandal - part 4
 
sql ppt.pptx
sql ppt.pptxsql ppt.pptx
sql ppt.pptx
 
RDBMS
RDBMSRDBMS
RDBMS
 
Sql ppt
Sql pptSql ppt
Sql ppt
 

Data mining query languages

  • 1. Data Mining Query Languages Kristen LeFevre April 19, 2004 With Thanks to Zheng Huang and Lei Chen
  • 2. Outline  Introduce the problem of querying data mining models  Overview of three different solutions and their contributions  Topic for Discussion: What would an ideal solution support?
  • 3. Problem Description  You guys are armed with two powerful tools  Database management systems  Efficient and effective data mining algorithms and frameworks  Generally, this work asks:  “How can we merge the two?”  “How can we integrate data mining more closely with traditional database systems, particularly querying?”
  • 4. Three Different Answers  DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University)  Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft)  MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University)
  • 5. Some Common Ground  Create and manipulate data mining models through a SQL-based interface (“Command- driven” data mining)  Abstract away the data mining particulars  Data mining should be performed on data in the database (should not need to export to a special-purpose environment)  Approaches differ on what kinds of models should be created, and what operations we should be able to perform
  • 6. DMQL  Commands specify the following:  The set of data relevant to the data mining task (the training set)  The kinds of knowledge to be discovered • Generalized relation • Characteristic rules • Discriminant rules • Classification rules • Association rules
  • 7. DMQL  Commands Specify the following:  Background knowledge • Concept hierarchies based on attribute relationships, etc.  Various thresholds • Minimum support, confidence, etc.
  • 8. DMQL  Syntax use database <database_name> Specify background knowledge {use hierarchy <hierarchy_name> for Specify rules to be <attribute>} discovered <rule_spec> Relevant attributes or aggregations related to <attr_or_agg_list> Collect the set of from <relation(s)> relevant data to mine [where <conditions>] [order by <order list>] Specify threshold {with [<kinds of>] threshold = parameters <threshold_value> [for <attribute(s)>]}
  • 9. DMQL  Syntax <rule_spec> find classification rules [as <rule_name>] [according to <attributes>] Find association rules [as <rule_name>] generalize data [into <relation_name>] others
  • 10. DMQL use database Hospital find association rules as Heart_Health related to Salary, Age, Smoker, Heart_Disease from Patient_Financial f, Patient_Medical m where f.ID = m.ID and m.age >= 18 with support threshold = .05 with confidence threshold = .7
  • 11. DMQL  DMQL provides a display in command to view resulting rules, but no advanced way to query them  Suggests that a GUI interface might aid in the presentation of these results in different forms (charts, graphs, etc.)
  • 12. MSQL  Focus on Association Rules  Seeks to provide a language both to selectively generate rules, and separately to query the rule base  Expressive rule generation language, and techniques for optimizing some commands
  • 13. MSQL  Get-Rules and Select-Rules Queries  Get-Rules operator generates rules over elements of argument class C, which satisfy conditions described in the “where” clause [Project Body, Consequent, confidence, support] GetRules(C) [as R1] [into <rulebase_name>] [where <conds>] [sql-group-by clause] [using-clause]
  • 14. MSQL  <conds> may contain a number of conditions, including:  restrictions on the attributes in the body or consequent in, has, and is are rule • “rule.body HAS {(Job = ‘Doctor’}” subset, superset, and equality • “rule1.consequent IN rule2.body” respectively • “rule.consequent IS {Age = *}”  pruning conditions (restrict by support, confidence, or size)  Stratified or correlated subqueries
  • 15. MSQL GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7 and not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body) Retrieve all rules with descriptors of the form “Age = x” in the body, except when there is a rule with equal or greater support and confidence with a rule containing a superset of the descriptors in the body
  • 16. MSQL GetRules(C) R1 where <pruning-conds> correlated and not exists ( GetRules(C) R2 where <same pruning-conds> and R2.Body HAS R1.Body) GetRules(C) R1 where <pruning-conds> and consequent is {(X=*)} stratified and consequent in (SelectRules(R2) where consequent is {(X=*)}
  • 17. MSQL  Nested Get-Rules Queries and their optimization  Stratified(non-corrolated) queries are evaluated “bottom-up.” The subquery is evaluated first, and replaced with its results in the outer query.  Correlated queries are evaluated either top- down or bottom-up (like “loop-unfolding”), and there are rules for choosing between the two options
  • 18. MSQL GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7 and not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)
  • 19. MSQL Top-Down Evaluation GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7 For each rule produced by the outer, evaluate the inner not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body)
  • 20. MSQL Bottom-Up Evaluation not exists ( GetRules(Patients) Support > .05 and Confidence > .7 and R2.Body HAS R1.Body) For each rule produced by the inner, evaluate the outer GetRules(Patients) where Body has {Age = *} and Support > .05 and Confidence > .7
  • 21. MSQL  Choosing between the two  In general, evaluate the expression with more restrictive conditions first  Heuristic rules • Evaluate the query with higher support threshold first • Next consider confidence threshold Meant to prevent • A (length = x) expression is in general more restrictive unconstrained than (length > x), which is more restrictive than (length < queries from being x) evaluated first • “Body IS (constant expression)” is more restrictive than “Body HAS”, which is more restrictive than “Body IN” • Next consider “Consequent IN” expressions • Descriptors of for (A = a) are more restrictive than wildcards such as (A = *)
  • 22. OLE DB for DM  An extension to the OLE DB interface for Microsoft SQL Server  Seeks to support the following ideas:  Define a model by specifying the set of attributes to be predicted, the attributes used for the prediction, and the algorithm  Populate the model using the training data None of the  Predict attributes for new data using the others seemed to populated model support this  Browse the mining model (not fully addressed because it varies a lot by model type)
  • 23. OLE DB for DM  Defining a Mining Model  Identify the set of data attributes to be predicted, the set of attributes to be used for prediction, and the algorithm to be used for building the model  Populating the Model  Pullthe information into a single rowset using views, and train the model using the data and algorithm specified  Supports complex objects, so rowset may be hierarchical (see paper for more complex examples)
  • 24. OLE DB for DM  Using the mining model to predict  Defines a new operator prediction join. A model may be used to make predictions on datasets by taking the prediction join of the mining model and the data set.
  • 25. OLE DB for DM CREATE MINING MODEL [Heart_Health Prediction] [ID] Int Key, [Age] Int, [Smoker] Int, [Salary] Double discretized, [HeartAttack] Int PREDICT, %Prediction column USING [Decision_Trees_101] Identifies the source columns for the training data, the column to be predicted, and the data mining algorithm.
  • 26. OLE DB for DM INSERT INTO [Heart_Health Prediction] ([ID], [Age], [Smoker], [Salary]) SELECT [ID], [Age], [Smoker], [Salary] FROM Patient_Medical M, Patient_Financial F WHERE M.ID = F.ID The INSERT represents using a tuple for training the model (not actually inserting it into the rowset).
  • 27. OLE DB for DM SELECT t.[ID], [Heart_Health Prediction].[HeartAttack] FROM [Heart_Health Prediction] PREDICTION JOIN ( SELECT [ID], [Age], [Smoker], [Salary] FROM Patient_Medical M, Patient_Financial F WHERE M.ID = F.ID) as t ON [Heart_Health Prediction].Age = t.Age AND [Heath_Health Prediction].Smoker = t.Smoker AND [Heart_Health Prediction].Salary = t.Salary Prediction join connects the model and an actual data table to make predictions
  • 28. Key Ideas  Important to have an API for creating and manipulating data mining models  The data is already in the DBMS, so it makes sense to do the data mining where the data is  Applications already use SQL, so a SQL extension seems logical
  • 29. Key Ideas  Need a method for defining data mining models, including algorithm specification, specification of various parameters, and training set specification (DMQL, MSQL, ODBDM)  Need a method of querying the models (MSQL)  Need a way of using the data mining model to interact with other data in the database, for purposes such as prediction (ODBDM)
  • 30. Discussion Topic: What Functionality would and Ideal Solution Support?