SlideShare une entreprise Scribd logo
1  sur  32
Chapter 3 Similarity Measures Data Mining Technology
Chapter 3 Similarity Measures Written by Kevin E. Heinrich Presented by Zhao Xinyou [email_address] 2007.6.7 Some materials (Examples) are taken from Website.
Searching Process Input Text Process Index Query Sorting Show Text Result Input Text Index Query Sorting Show Text Result Input Key Words Results Search ,[object Object],[object Object],[object Object],[object Object]
Example
Similarity Measures ,[object Object],[object Object],[object Object],[object Object],PP27-28
Classic Similarity Measures ,[object Object],[object Object],[object Object],PP28
Conversion ,[object Object],[object Object],[1,  10] [0,  1] s’  =  (s – 1 ) / 9 Generally, we may use : s’  = ( s – min_s ) / ( max_s – min_s )  Linear Non-linear PP28
Vector-Space Model-VSM ,[object Object],PP28-29
Example ,[object Object],[object Object],[object Object],PP28-29 If similarity(q,d i ) > similarity(q, d j )  We may get the result d i  is more relevant than d j
Simple Measure Technology Documents Set PP29 Retrieved  A Relevant  B Retrieved and Relevant  A ∩B Precision  = Returned Relevant Documents / Total Returned Documents Recall  = Returned Relevant Documents / Total Relevant Documents P(A,B) = |A ∩B| / |A| R(A,B) = |A ∩B| / |B|
Example--Simple Measure Technology PP29 Documents Set A,C,E,G, H, I, J Relevant B,D,F Retrieved &  Relevant W,Y Retrieved |B| =  {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10 |A| =  {retrieved} = {B, D, F,W,Y} = 5 |A∩B| =  {relevant} ∩ {retrieved} ={B,D,F} = 3 P =  precision = 3/5 = 60% R =  recall  = 3/10 = 30%
Precision-Recall Graph-Curves ,[object Object],[object Object],PP29-30 One Query Two Queries Difficult to determine which of these two hypothetical results is better
Similarity measures based on VSM ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],PP30
Dice Coefficient-Cont’ ,[object Object],[object Object],PP30 ,[object Object]
Dice Coefficient-Cont’ ,[object Object],PP30 α>0.5 : precision is more important α<0.5 : recall is more important Usually α=0.5
Overlap Coefficient PP30-31 Documents Set A queries B Documents
Jaccard Coefficient-Cont’ PP31 Documents Set A queries B Documents
Example- Jaccard Coefficient ,[object Object],[object Object],[object Object],[object Object],[object Object],PP31
Cosine Coefficient-Cont’ PP31-32 (d 21 ,d 22 ,…d 2n ) (d 11 ,d 12 ,…d 1n ) (q 1 ,q 2 ,…q n )
Example-Cosine Coefficient ,[object Object],[object Object],[object Object],C ( D1  ,  Q ) =  = 10/ √ (38*4) = 0.81 C ( D2  ,  Q ) = 2 / √ (59*4) = 0.13 PP31-32 (3,7,1) (2,3,5) (0,0,2)
Asymmetric PP31 dj di W ki -->w kj
Euclidean distance PP32
Manhattan block distance PP32
Other Measures ,[object Object],[object Object],[object Object],[object Object],[object Object],PP32
Comparison PP34
Comparison Simple matching Dice’s Coefficient Cosine Coefficient Overlap Coefficient Jaccard’s Coefficient |A|+|B|-|A ∩ B|  ≥(|A|+|B|)/2 |A| ≥ |A ∩ B| |B|  ≥  |A∩B| (|A|+|B|)/2  ≥ √ (|A|*|B|) √ (|A|*|B|) ≥  min (|A|, |B|) O≥C≥D≥J PP34
Example- Documents-Term-Query -Cont’ ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],T1: search(ing)   T2: Engine(s)   T3: Models T4: database  T5: query  T6: language T7: documents   T8: measur(es,ing)   T9: conceptual  T10: dependence   T11:  domain  T12: structure   T13: similarity   T14: semantic T15:  ontologies T16: information T17:  retrieval Query:  Semantic similarity   measures  used by  search engines  and other  information searching  mechanisms  PP33
Example- Term-Document Matrix -Cont’ Matrix[q][A] PP34 1 1 1 1 1 2 Q 1 1 1 1 1 D7 1 1 1 D6 1 1 1 1 D5 1 1 1 D4 1 1 1 1 D3 1 1 1 D2 1 1 1 D1 T17 T16 T15 T14 13 T12 T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1
Dice coefficient PP30, PP34
Final Results PP34 O≥C≥D≥J
Current Applications ,[object Object],[object Object],[object Object],PP35-38
[object Object]

Contenu connexe

Tendances

Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
snegacmr
 

Tendances (20)

Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysis
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
data mining
data miningdata mining
data mining
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithm
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 

En vedette

Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
Computer networking short_questions_and_answers
Computer networking short_questions_and_answersComputer networking short_questions_and_answers
Computer networking short_questions_and_answers
Tarun Thakur
 
Router configuration in packet tracer
Router configuration in packet  tracerRouter configuration in packet  tracer
Router configuration in packet tracer
Anabia Anabia
 
Teacher management system guide
Teacher management system guideTeacher management system guide
Teacher management system guide
nicolasmunozvera
 
Router configuration
Router configurationRouter configuration
Router configuration
97148881557
 

En vedette (20)

Day 5.3 configuration of router
Day 5.3 configuration of routerDay 5.3 configuration of router
Day 5.3 configuration of router
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
E s switch_v6_ch01
E s switch_v6_ch01E s switch_v6_ch01
E s switch_v6_ch01
 
College Network
College NetworkCollege Network
College Network
 
MikroTik Basic Training Class - Online Moduls - English
 MikroTik Basic Training Class - Online Moduls - English MikroTik Basic Training Class - Online Moduls - English
MikroTik Basic Training Class - Online Moduls - English
 
Computer networking short_questions_and_answers
Computer networking short_questions_and_answersComputer networking short_questions_and_answers
Computer networking short_questions_and_answers
 
Troubleshooting basic networks
Troubleshooting basic networksTroubleshooting basic networks
Troubleshooting basic networks
 
Redes cisco
Redes ciscoRedes cisco
Redes cisco
 
Initial Configuration of Router
Initial Configuration of RouterInitial Configuration of Router
Initial Configuration of Router
 
Cisco router command configuration overview
Cisco router command configuration overviewCisco router command configuration overview
Cisco router command configuration overview
 
Day 25 cisco ios router configuration
Day 25 cisco ios router configurationDay 25 cisco ios router configuration
Day 25 cisco ios router configuration
 
3 Router Configuration - Cisco Packet Tracer
3 Router Configuration - Cisco Packet Tracer 3 Router Configuration - Cisco Packet Tracer
3 Router Configuration - Cisco Packet Tracer
 
Router configuration in packet tracer
Router configuration in packet  tracerRouter configuration in packet  tracer
Router configuration in packet tracer
 
Lesson 1 slideshow
Lesson 1 slideshowLesson 1 slideshow
Lesson 1 slideshow
 
10 More Quotes for Entrepreneurs
10 More Quotes for Entrepreneurs10 More Quotes for Entrepreneurs
10 More Quotes for Entrepreneurs
 
De-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco ServicesDe-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco Services
 
Ir 08
Ir   08Ir   08
Ir 08
 
Teacher management system guide
Teacher management system guideTeacher management system guide
Teacher management system guide
 
Router configuration
Router configurationRouter configuration
Router configuration
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 

Similaire à similarity measure

Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
Rohit Raj
 
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Abebe Admasu
 
Using Consolidated Tabular and Text Data in Business Predictive Analytics
Using Consolidated Tabular and Text Data  in Business Predictive AnalyticsUsing Consolidated Tabular and Text Data  in Business Predictive Analytics
Using Consolidated Tabular and Text Data in Business Predictive Analytics
Bohdan Pavlyshenko
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
showkat27
 

Similaire à similarity measure (20)

Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Adaptive relevance feedback in information retrieval
Adaptive relevance feedback in information retrievalAdaptive relevance feedback in information retrieval
Adaptive relevance feedback in information retrieval
 
LINEAR SEARCH VERSUS BINARY SEARCH: A STATISTICAL COMPARISON FOR BINOMIAL INPUTS
LINEAR SEARCH VERSUS BINARY SEARCH: A STATISTICAL COMPARISON FOR BINOMIAL INPUTSLINEAR SEARCH VERSUS BINARY SEARCH: A STATISTICAL COMPARISON FOR BINOMIAL INPUTS
LINEAR SEARCH VERSUS BINARY SEARCH: A STATISTICAL COMPARISON FOR BINOMIAL INPUTS
 
Lect4
Lect4Lect4
Lect4
 
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
similarities-knn.pptx
similarities-knn.pptxsimilarities-knn.pptx
similarities-knn.pptx
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Using Consolidated Tabular and Text Data in Business Predictive Analytics
Using Consolidated Tabular and Text Data  in Business Predictive AnalyticsUsing Consolidated Tabular and Text Data  in Business Predictive Analytics
Using Consolidated Tabular and Text Data in Business Predictive Analytics
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
 
ppt0320defenseday
ppt0320defensedayppt0320defenseday
ppt0320defenseday
 
19. algorithms and-complexity
19. algorithms and-complexity19. algorithms and-complexity
19. algorithms and-complexity
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.ppt
 

Plus de ZHAO Sam

Plus de ZHAO Sam (7)

Solr installation
Solr installationSolr installation
Solr installation
 
Special issue on Technology Enhanced Learning
Special issue on Technology Enhanced LearningSpecial issue on Technology Enhanced Learning
Special issue on Technology Enhanced Learning
 
国際会議推薦システムAcademic Conference Publishing System
国際会議推薦システムAcademic Conference Publishing System国際会議推薦システムAcademic Conference Publishing System
国際会議推薦システムAcademic Conference Publishing System
 
祝大家新年快樂
祝大家新年快樂祝大家新年快樂
祝大家新年快樂
 
Ubiquitous
UbiquitousUbiquitous
Ubiquitous
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
 
A Real-Time Interactive Shared System for Distance Learning
A Real-Time Interactive Shared System for Distance LearningA Real-Time Interactive Shared System for Distance Learning
A Real-Time Interactive Shared System for Distance Learning
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

similarity measure