SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Data Representation
Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Lecture outline                            2



 Components of the input
     Concepts, instances, and attributes
 Attributes types
     Nominal
     Ordinal
     Interval
     Ratio
 Missing values and inaccurate values




                   Prof. Pier Luca Lanzi
Components of the input                        3



 Concepts
     Kinds of things that can be learned
 Instances
     the individual, independent examples of a concept
 Attributes
     measuring aspects of an instance




                  Prof. Pier Luca Lanzi
What’s in an example?                            4



   Instance: specific type of example
   Thing to be classified, associated, or clustered
   Individual, independent example of target concept
   Characterized by a predetermined set of attributes
   Input to learning scheme: set of instances/dataset
   Represented as a single relation/flat file
   Rather restricted form of input
   No relationships between objects
   Most common form in practical data mining




                    Prof. Pier Luca Lanzi
What’s in an attribute?                       5



 Each instance is described by a fixed predefined
  set of features, its “attributes”
 But the number of attributes may vary in practice
 Possible solution: “irrelevant value” flag
 Related problem: existence of an attribute may depend of
  value of another one
 Possible attribute types (“levels of measurement”):
 Nominal, ordinal, interval and ratio




                  Prof. Pier Luca Lanzi
The contact lenses data                                                   6


     Age         Spectacle prescription      Astigmatism   Tear production rate   Recommended
                                                                                      lenses
    Young              Myope                      No            Reduced                None
    Young              Myope                      No             Normal                Soft
    Young              Myope                      Yes           Reduced                None
    Young              Myope                      Yes            Normal                Hard
    Young           Hypermetrope                  No            Reduced                None
    Young           Hypermetrope                  No             Normal                Soft
    Young           Hypermetrope                  Yes           Reduced                None
    Young           Hypermetrope                  Yes            Normal                hard
Pre-presbyopic         Myope                      No            Reduced                None
Pre-presbyopic         Myope                      No             Normal                Soft
Pre-presbyopic         Myope                      Yes           Reduced                None
Pre-presbyopic         Myope                      Yes            Normal                Hard
Pre-presbyopic      Hypermetrope                  No            Reduced                None
Pre-presbyopic      Hypermetrope                  No             Normal                Soft
Pre-presbyopic      Hypermetrope                  Yes           Reduced                None
Pre-presbyopic      Hypermetrope                  Yes            Normal                None
  Presbyopic           Myope                      No            Reduced                None
  Presbyopic           Myope                      No             Normal                None
  Presbyopic           Myope                      Yes           Reduced                None
  Presbyopic           Myope                      Yes            Normal                Hard
  Presbyopic        Hypermetrope                  No            Reduced                None
  Presbyopic        Hypermetrope                  No             Normal                Soft
  Presbyopic        Hypermetrope                  Yes           Reduced                None
  Presbyopic        Hypermetrope                  Yes            Normal                None
                                Prof. Pier Luca Lanzi
The weather problem                                   7




    Outlook   Temperature          Humidity   Windy       Play
    Sunny        Hot                 High     False       No
    Sunny        Hot                 High      True       No
   Overcast      Hot                 High     False       Yes
     Rainy       Mild               Normal    False       Yes
      …           …                    …        …          …




    Outlook   Temperature         Humidity    Windy       Play
    Sunny         85                  85      False       No
    Sunny         80                  90      True        No
   Overcast       83                  86      False       Yes
     Rainy        75                  80      False       Yes
      …           …                    …       …           …


                   Prof. Pier Luca Lanzi
Nominal quantities                            8



 Values are distinct symbols
     Values themselves serve only as labels or names
 Example
     Attribute “outlook” from weather data
     Values: “sunny”,”overcast”, and “rainy”
 No relation is implied among nominal values
     No ordering
     No distance measure
 Only equality tests can be performed




                  Prof. Pier Luca Lanzi
Ordinal quantities                             9



 Impose order on values
 No distance between values defined
 Example:
     The attribute “temperature” in weather data
     Values: “hot” > “mild” > “cool”
 Addition and subtraction don’t make sense
 Example rule:
       temperature < hot Þ play = yes
 Distinction between nominal and ordinal not always clear
  (e.g. attribute “outlook”)




                     Prof. Pier Luca Lanzi
Interval quantities                            10



 Interval quantities are not only ordered but measured in
  fixed and equal units
 Examples
      Attribute “temperature” expressed in degrees
      Attribute “year”
 Difference of two values makes sense
 Sum or product doesn’t make sense
 Zero point is not defined




                   Prof. Pier Luca Lanzi
Ratio quantities                               11



 Ratio quantities are ones for which the
  measurement scheme defines a zero point
 Example
      Attribute “distance”
      Distance between an object and itself is zero
 Ratio quantities are treated as real numbers
 All mathematical operations are allowed
 Is there an “inherently” defined zero point?
      It depends on scientific knowledge
      E.g. Fahrenheit knew no lower limit to temperature




                   Prof. Pier Luca Lanzi
Attribute types used in practice                12



 Most schemes accommodate just two levels of
  measurement: nominal and ordinal
 Nominal attributes are also called “categorical”,
  ”enumerated”, or “discrete”
 But: “enumerated” and “discrete” imply order
 Special case: dichotomy (“boolean” attribute)
 Ordinal attributes are called “numeric”, or “continuous”
 But: “continuous” implies mathematical continuity




                   Prof. Pier Luca Lanzi
Why specify attribute types?                13



 Why Machine Learning algorithms need
  to know about attribute type?
 To be able to make right comparisons and
  learn correct concepts
 Example
     Outlook > “sunny” does not make sense, while
     Temperature > “cool” or
     Humidity > 70 does
 Additional uses of attribute type
     Check for valid values
     Deal with missing, etc.




                  Prof. Pier Luca Lanzi
Nominal vs. ordinal                          14



 Attribute “age” nominal

     If age = young and astigmatic = no
     and tear production rate = normal
     then recommendation = soft

     If age = pre-presbyopic and astigmatic = no
     and tear production rate = normal
     then recommendation = soft

 Attribute “age” ordinal
  (e.g. “young” < “pre-presbyopic” < “presbyopic”)

     If age≤pre-presbyopic and astigmatic = no
     and tear production rate = normal
     then recommendation = soft


                  Prof. Pier Luca Lanzi
Missing values                                  15



 Frequently indicated by out-of-range entries
 Types: unknown, unrecorded, irrelevant
 Reasons:
     Malfunctioning equipment
     Changes in experimental design
     Collation of different datasets
     Measurement not possible
 Missing value may have significance in itself
     E.g. missing test in a medical examination
 Most schemes assume that is not the case
     “missing” may need to be coded as additional value
 Does absence of value have some significance?
     If it does, “missing” is a separate value
     If it does not, “missing” must be treated in a special way



                   Prof. Pier Luca Lanzi
Inaccurate values                               16



 What is the reason?
     data has not been collected for mining it
 What is the result?
     errors and omissions that don’t affect original purpose of
     data (e.g. age of customer)
 Typographical errors in nominal attributes,
  thus values need to be checked for consistency
 Typographical and measurement errors in numeric
  attributes, thus outliers need to be identified
 Errors may be deliberate (e.g. wrong zip codes)
 Other problems: duplicates, stale data




                    Prof. Pier Luca Lanzi
Getting to know the data                        17



   Simple visualization tools are very useful
   Nominal attributes: histograms
   Numeric attributes: graphs
   2-D and 3-D plots show dependencies
   Need to consult domain experts
   When there is too much data to inspect, take a sample!




                    Prof. Pier Luca Lanzi
Example: the ARFF format                          18




  %
  % ARFF file for weather data with some numeric features
  %
  @relation weather

  @attribute   outlook {sunny, overcast, rainy}
  @attribute   temperature numeric
  @attribute   humidity numeric
  @attribute   windy {true, false}
  @attribute   play? {yes, no}

  @data
  sunny, 85, 85, false, no
  sunny, 80, 90, true, no
  overcast, 83, 86, false, yes
  ...




                    Prof. Pier Luca Lanzi
ARFF: Additional attribute types                 19



 ARFF supports string attributes:
      @attribute description string

     Similar to nominal attributes but list of values
     is not pre-specified

 ARFF also supports date attributes:
      @attribute today date

     Uses the ISO-8601 combined date
     and time format yyyy-MM-dd-THH:mm:ss




                   Prof. Pier Luca Lanzi
ARFF: Attribute Types & Interpretation         20



 Interpretation of attribute types in ARFF depends
  on the mining scheme

 Numeric attributes are interpreted as
    Ordinal scales if less-than and greater-than are used
    Ratio scales if distance calculations are performed
    (normalization/standardization may be required)

 Instance-based schemes define distance between nominal
  values (0 if values are equal, 1 otherwise)
 Integers in some given data file:
  nominal, ordinal, or ratio scale?




                   Prof. Pier Luca Lanzi

Contenu connexe

Plus de Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 

Plus de Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 

Dernier

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Lecture 03 Data Representation

  • 1. Data Representation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
  • 2. Lecture outline 2  Components of the input Concepts, instances, and attributes  Attributes types Nominal Ordinal Interval Ratio  Missing values and inaccurate values Prof. Pier Luca Lanzi
  • 3. Components of the input 3  Concepts Kinds of things that can be learned  Instances the individual, independent examples of a concept  Attributes measuring aspects of an instance Prof. Pier Luca Lanzi
  • 4. What’s in an example? 4  Instance: specific type of example  Thing to be classified, associated, or clustered  Individual, independent example of target concept  Characterized by a predetermined set of attributes  Input to learning scheme: set of instances/dataset  Represented as a single relation/flat file  Rather restricted form of input  No relationships between objects  Most common form in practical data mining Prof. Pier Luca Lanzi
  • 5. What’s in an attribute? 5  Each instance is described by a fixed predefined set of features, its “attributes”  But the number of attributes may vary in practice  Possible solution: “irrelevant value” flag  Related problem: existence of an attribute may depend of value of another one  Possible attribute types (“levels of measurement”):  Nominal, ordinal, interval and ratio Prof. Pier Luca Lanzi
  • 6. The contact lenses data 6 Age Spectacle prescription Astigmatism Tear production rate Recommended lenses Young Myope No Reduced None Young Myope No Normal Soft Young Myope Yes Reduced None Young Myope Yes Normal Hard Young Hypermetrope No Reduced None Young Hypermetrope No Normal Soft Young Hypermetrope Yes Reduced None Young Hypermetrope Yes Normal hard Pre-presbyopic Myope No Reduced None Pre-presbyopic Myope No Normal Soft Pre-presbyopic Myope Yes Reduced None Pre-presbyopic Myope Yes Normal Hard Pre-presbyopic Hypermetrope No Reduced None Pre-presbyopic Hypermetrope No Normal Soft Pre-presbyopic Hypermetrope Yes Reduced None Pre-presbyopic Hypermetrope Yes Normal None Presbyopic Myope No Reduced None Presbyopic Myope No Normal None Presbyopic Myope Yes Reduced None Presbyopic Myope Yes Normal Hard Presbyopic Hypermetrope No Reduced None Presbyopic Hypermetrope No Normal Soft Presbyopic Hypermetrope Yes Reduced None Presbyopic Hypermetrope Yes Normal None Prof. Pier Luca Lanzi
  • 7. The weather problem 7 Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild Normal False Yes … … … … … Outlook Temperature Humidity Windy Play Sunny 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 75 80 False Yes … … … … … Prof. Pier Luca Lanzi
  • 8. Nominal quantities 8  Values are distinct symbols Values themselves serve only as labels or names  Example Attribute “outlook” from weather data Values: “sunny”,”overcast”, and “rainy”  No relation is implied among nominal values No ordering No distance measure  Only equality tests can be performed Prof. Pier Luca Lanzi
  • 9. Ordinal quantities 9  Impose order on values  No distance between values defined  Example: The attribute “temperature” in weather data Values: “hot” > “mild” > “cool”  Addition and subtraction don’t make sense  Example rule: temperature < hot Þ play = yes  Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”) Prof. Pier Luca Lanzi
  • 10. Interval quantities 10  Interval quantities are not only ordered but measured in fixed and equal units  Examples Attribute “temperature” expressed in degrees Attribute “year”  Difference of two values makes sense  Sum or product doesn’t make sense  Zero point is not defined Prof. Pier Luca Lanzi
  • 11. Ratio quantities 11  Ratio quantities are ones for which the measurement scheme defines a zero point  Example Attribute “distance” Distance between an object and itself is zero  Ratio quantities are treated as real numbers  All mathematical operations are allowed  Is there an “inherently” defined zero point? It depends on scientific knowledge E.g. Fahrenheit knew no lower limit to temperature Prof. Pier Luca Lanzi
  • 12. Attribute types used in practice 12  Most schemes accommodate just two levels of measurement: nominal and ordinal  Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”  But: “enumerated” and “discrete” imply order  Special case: dichotomy (“boolean” attribute)  Ordinal attributes are called “numeric”, or “continuous”  But: “continuous” implies mathematical continuity Prof. Pier Luca Lanzi
  • 13. Why specify attribute types? 13  Why Machine Learning algorithms need to know about attribute type?  To be able to make right comparisons and learn correct concepts  Example Outlook > “sunny” does not make sense, while Temperature > “cool” or Humidity > 70 does  Additional uses of attribute type Check for valid values Deal with missing, etc. Prof. Pier Luca Lanzi
  • 14. Nominal vs. ordinal 14  Attribute “age” nominal If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft  Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age≤pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Prof. Pier Luca Lanzi
  • 15. Missing values 15  Frequently indicated by out-of-range entries  Types: unknown, unrecorded, irrelevant  Reasons: Malfunctioning equipment Changes in experimental design Collation of different datasets Measurement not possible  Missing value may have significance in itself E.g. missing test in a medical examination  Most schemes assume that is not the case “missing” may need to be coded as additional value  Does absence of value have some significance? If it does, “missing” is a separate value If it does not, “missing” must be treated in a special way Prof. Pier Luca Lanzi
  • 16. Inaccurate values 16  What is the reason? data has not been collected for mining it  What is the result? errors and omissions that don’t affect original purpose of data (e.g. age of customer)  Typographical errors in nominal attributes, thus values need to be checked for consistency  Typographical and measurement errors in numeric attributes, thus outliers need to be identified  Errors may be deliberate (e.g. wrong zip codes)  Other problems: duplicates, stale data Prof. Pier Luca Lanzi
  • 17. Getting to know the data 17  Simple visualization tools are very useful  Nominal attributes: histograms  Numeric attributes: graphs  2-D and 3-D plots show dependencies  Need to consult domain experts  When there is too much data to inspect, take a sample! Prof. Pier Luca Lanzi
  • 18. Example: the ARFF format 18 % % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ... Prof. Pier Luca Lanzi
  • 19. ARFF: Additional attribute types 19  ARFF supports string attributes: @attribute description string Similar to nominal attributes but list of values is not pre-specified  ARFF also supports date attributes: @attribute today date Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss Prof. Pier Luca Lanzi
  • 20. ARFF: Attribute Types & Interpretation 20  Interpretation of attribute types in ARFF depends on the mining scheme  Numeric attributes are interpreted as Ordinal scales if less-than and greater-than are used Ratio scales if distance calculations are performed (normalization/standardization may be required)  Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise)  Integers in some given data file: nominal, ordinal, or ratio scale? Prof. Pier Luca Lanzi