SlideShare une entreprise Scribd logo
1  sur  39
Data for Data Mining
What is Data? Data is a Collection of data objects and their attributes An attribute is a property or characteristic of an object Ex:        eye color of a person,temperature etc. A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance
Types of Attributes  There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts
Attribute Type Description Examples Operations Nominal nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rankcorrelation run & sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.  (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation
Discrete  Attribute Has only a finite or countable infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents  Often represented as integer variables.    Note: binary attributes are a special case of discrete attributes
Continuous Attribute 	Has real numbers as attribute values Examples: temperature, height, or weight.   Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.
Types of data sets  Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures
Contd… Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data
Important Characteristics of Structured Data Dimensionality  Curse of Dimensionality Sparsity  Only presence counts Resolution  Patterns depend on the scale
Data Quality  What kinds of data quality problems? How can we detect problems with the data?  What can we do about these problems?
Examples Examples of data quality problems:  Noise and outliers  missing values  duplicate data
Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation
Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction  Reduce the number of attributes or objects Change of scale  Cities aggregated into regions, states, countries, etc More “stable” data  Aggregated data tends to have less variability
Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.
Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample.      In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition
Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise
Techniques Techniques Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques
Feature Subset Selection Another way to reduce dimensionality of data Redundant features  duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA
Techniques Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches:  Feature selection occurs naturally as part of the data mining algorithm Filter approaches:  Features are selected before data mining algorithm is run Wrapper approaches:  Use the data mining algorithm as a black box to find best subset of attributes
Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction  domain-specific Mapping Data to New Space Feature Construction  combining features
Discretization and Binarization Transforming  a  continuous  attribute into a categorical attribute  is  know as  Discretization. Transforming  both continuous and  discrete attributes into one  o r more  binary  attribute  is  known  as  Binarization.
Techniques original data Equal width discretization Equal frequency discretization K-Means Discretization
Attribute Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: xk, log(x), ex, |x| Standardization and Normalization
Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies
Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
Dissimilarities between Data Objects DISTANCES Euclidean Distance Minkowski  Distance
Eulidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ.
Minkowski Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Similarities between Data Objects Similarities, also have some well known properties. s(p, q) = 1 (or maximum similarity) only if p= q.  s(p, q) = s(q, p)   for all p and q. (Symmetry) 	where s(p, q) is the similarity between points (data objects), p and q.
Similarity Between Binary Vectors Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities 	M01= the number of attributes where p was 0 and q was 1 	M10 = the number of attributes where p was 1 and q was 0 	M00= the number of attributes where p was 0 and q was 0 	M11= the number of attributes where p was 1 and q was 1
Contd…. Simple Matching and Jaccard Coefficients  	SMC =  number of matches / number of attributes           		 =  (M11 + M00) / (M01 + M10 + M11 + M00) 	J = number of 11 matches / number of not-both-zero attributes values    	   = (M11) / (M01 + M10 + M11)
Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) =  (d1d2) / ||d1|| ||d2|| ,     where  indicates vector dot product and || d || is  the   length of vector d.
Example d1=  3 2 0 5 0 0 0 2 0 0 	    	d2 =  1 0 0 0 0 0 0 1 0 2     d1d2=  3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5    ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =  (42) 0.5 = 6.481     ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6) 0.5 = 2.245 cos( d1, d2 ) = .3150
Extended Jaccard Coefficient (Tanimoto) Variation of Jaccard for continuous or count attributes Reduces to Jaccard for binary attributes
Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product
General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed.
Contd….
Using Weights to Combine Similarities         May not want to treat all attributes the same. Use weights wk which are between 0 and 1 and sum to 1.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Contenu connexe

Tendances

Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 

Tendances (20)

Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
basic structure of SQL FINAL.pptx
basic structure of SQL FINAL.pptxbasic structure of SQL FINAL.pptx
basic structure of SQL FINAL.pptx
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data cubes
Data cubesData cubes
Data cubes
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
 

En vedette

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Top10 algorithms data mining
Top10 algorithms data miningTop10 algorithms data mining
Top10 algorithms data mining
Asad Ahamad
 
Public Transportation
Public TransportationPublic Transportation
Public Transportation
dpapageorge
 
Ireland Apo University Fy 10 Tibbs Slideshare
Ireland Apo University Fy 10 Tibbs SlideshareIreland Apo University Fy 10 Tibbs Slideshare
Ireland Apo University Fy 10 Tibbs Slideshare
Tibbs Pereira
 

En vedette (20)

Data mining
Data miningData mining
Data mining
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And Attributes
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Data discretization
Data discretizationData discretization
Data discretization
 
Top10 algorithms data mining
Top10 algorithms data miningTop10 algorithms data mining
Top10 algorithms data mining
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Ontwikkeling In Eigen Handen Nl Web
Ontwikkeling In Eigen Handen Nl WebOntwikkeling In Eigen Handen Nl Web
Ontwikkeling In Eigen Handen Nl Web
 
Simulation
SimulationSimulation
Simulation
 
Public Transportation
Public TransportationPublic Transportation
Public Transportation
 
LISP:Loops In Lisp
LISP:Loops In LispLISP:Loops In Lisp
LISP:Loops In Lisp
 
Apresentação Red Advisers
Apresentação Red AdvisersApresentação Red Advisers
Apresentação Red Advisers
 
DataKraft - Powerful No-Coding Platform for Business Applications
DataKraft - Powerful No-Coding Platform for Business ApplicationsDataKraft - Powerful No-Coding Platform for Business Applications
DataKraft - Powerful No-Coding Platform for Business Applications
 
Welcome
WelcomeWelcome
Welcome
 
Norihicodanch
NorihicodanchNorihicodanch
Norihicodanch
 
Mphone
MphoneMphone
Mphone
 
R Datatypes
R DatatypesR Datatypes
R Datatypes
 
Ireland Apo University Fy 10 Tibbs Slideshare
Ireland Apo University Fy 10 Tibbs SlideshareIreland Apo University Fy 10 Tibbs Slideshare
Ireland Apo University Fy 10 Tibbs Slideshare
 
Data Applied: Developer Quicklook
Data Applied: Developer QuicklookData Applied: Developer Quicklook
Data Applied: Developer Quicklook
 

Similaire à Data For Datamining

Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
OllieShoresna
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docx
whittemorelucilla
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Tony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
James Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Fraboni Ec
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
saranya12345
 

Similaire à Data For Datamining (20)

Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docx
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
 
Lect4
Lect4Lect4
Lect4
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 

Plus de DataminingTools Inc

Plus de DataminingTools Inc (20)

Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Areas of machine leanring
Areas of machine leanringAreas of machine leanring
Areas of machine leanring
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI 2AI: Learning in AI 2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 
AI: Belief Networks
AI: Belief NetworksAI: Belief Networks
AI: Belief Networks
 
AI: AI & Searching
AI: AI & SearchingAI: AI & Searching
AI: AI & Searching
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Data For Datamining

  • 1. Data for Data Mining
  • 2. What is Data? Data is a Collection of data objects and their attributes An attribute is a property or characteristic of an object Ex: eye color of a person,temperature etc. A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance
  • 3. Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts
  • 4. Attribute Type Description Examples Operations Nominal nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rankcorrelation run & sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation
  • 5. Discrete Attribute Has only a finite or countable infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes
  • 6. Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.
  • 7. Types of data sets Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures
  • 8. Contd… Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data
  • 9. Important Characteristics of Structured Data Dimensionality Curse of Dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale
  • 10. Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems?
  • 11. Examples Examples of data quality problems: Noise and outliers missing values duplicate data
  • 12. Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation
  • 13. Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More “stable” data Aggregated data tends to have less variability
  • 14. Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.
  • 15. Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition
  • 16. Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise
  • 17. Techniques Techniques Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques
  • 18. Feature Subset Selection Another way to reduce dimensionality of data Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA
  • 19. Techniques Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm Filter approaches: Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes
  • 20. Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction domain-specific Mapping Data to New Space Feature Construction combining features
  • 21. Discretization and Binarization Transforming a continuous attribute into a categorical attribute is know as Discretization. Transforming both continuous and discrete attributes into one o r more binary attribute is known as Binarization.
  • 22. Techniques original data Equal width discretization Equal frequency discretization K-Means Discretization
  • 23. Attribute Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: xk, log(x), ex, |x| Standardization and Normalization
  • 24. Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies
  • 25. Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
  • 26. Dissimilarities between Data Objects DISTANCES Euclidean Distance Minkowski Distance
  • 27. Eulidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ.
  • 28. Minkowski Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
  • 29. Similarities between Data Objects Similarities, also have some well known properties. s(p, q) = 1 (or maximum similarity) only if p= q. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.
  • 30. Similarity Between Binary Vectors Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities M01= the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00= the number of attributes where p was 0 and q was 0 M11= the number of attributes where p was 1 and q was 1
  • 31. Contd…. Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)
  • 32. Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1d2) / ||d1|| ||d2|| , where  indicates vector dot product and || d || is the length of vector d.
  • 33. Example d1= 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6) 0.5 = 2.245 cos( d1, d2 ) = .3150
  • 34. Extended Jaccard Coefficient (Tanimoto) Variation of Jaccard for continuous or count attributes Reduces to Jaccard for binary attributes
  • 35. Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product
  • 36. General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed.
  • 38. Using Weights to Combine Similarities May not want to treat all attributes the same. Use weights wk which are between 0 and 1 and sum to 1.
  • 39. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net