SlideShare a Scribd company logo
1 of 17
Download to read offline
1

MATCHING CONCEPTUAL MODELS
(PART OF THE ‘IBIOSEARCH’ PROJECT)

JUNE 9 2008
Quantitative Methods

Ritu Khare
Order of the Presentation
2







Problem and
Background
Research Questions
Initial Dataset
Overall Methodology
Representation of
Dataset A
 Criteria to compare two
entities
 Generation of dataset B
 Multivariate Analysis of
dataset B




Results
Case I
 Case II
 Case III
 Case IV






Inferences
Future Work
References
1. Problem and Background
3



Search Interface is represented as a Conceptual Model
C

A
Search X
A:
B:
Search Y
C:






X

B

The aim is to combine all search interfaces i.e. to
combine several conceptual models.
Hence, matching of models is required.
In this project, focus is on matching of entities.

Y
2. Research Questions
4







Find an Entity Matching Technique(s) to match
entities of two models.
Does this technique (or combination of techniques )
provide a good way to compare two entities?
What other basis of comparison can be used?
3. Initial Dataset A
5




20 Conceptual Models
Expect
Example 1:

Matrix
Domain

DB


Example 2:

BLASTP

Alignments

Accession
No.

Gene
ID

Title

Sequence

Gene Patent

Patent
Sequence
Number

Gene
Name
4. Overall Methodology
6

Conceptual
Models

Representation of Dataset A into
structured tables
Criteria to compare entities from different
models
(Entity Name, Attribute set, Relationship Set)
Generation of Dataset B
Multivariate Analysis of Dataset B

Analysis
Results
4.1 Representation of dataset A
7



Every model is represented as




List of entities

Every Entity in a model is represented as
Entity Name
 List of attributes
 List of relationships




Dataset A has the following columns:
(Model_ID, Entity_name, Attribute_set, Relationship_set)
4.2 Criteria to compare two entities
8




All entities from two different models are compared.
Criteria to compare two entities
Entity Name Similarity
Exact String Matching, Substring Matching
Output: Boolean Variable (True, False)
 Attribute Set Similarity
Jaccard Coefficient
Output: Decimal Number (between 0 and 1)
 Relationship Set Similarity
Jaccard Coefficient
Output: Decimal Number (between 0 and 1)

4.3 Generation of Dataset B
9





Input: 20 Conceptual Models
Algorithm:





Stem Entity Names and Attribute Names (Porter Stemmer)
Compare each pair of Entities from different models based on
the three criteria (Slide 7)

Output: Table (598 records)
Pair#

Name Similarity

Attribute Similarity Relationship Similarity

XYZ

Yes

0.657

0.004
4.4 Multivariate Analysis of dataset B
10




Manually annotate if a pair represents similar entities or not. (“Match”
column)
60 matches and 538 mismatches were found.
Pair#

Name
Sim.

Attribute
Sim.

Relationsh
ip Sim.

XYZ


Match
Yes

Yes

0.657

0.004

Is this a good Classification Model?





Can it correctly identify matching and non-matching pair?
Which technique is suitable to answer these questions?

Binary Logistic Regression


Predictive variables are a combination of continuous and categorical variables.



Name_Sim (Categorical), Attr_Sim (Continuous), Rel_Sim (Continuous)
5. Results
11



Binary Logistic Regression
IV: Name_Sim, Attr_Sim, Rel_Sim
 DV: Match







Case I: IV = Name_Sim
Case 2: IV = Name_Sim, Attr_Sim
Case 3: IV = Name_Sim, Rel_Sim
Case 4: IV = Name_Sim, Attr_Sim, Rel_Sim
5.1 Results: Case 1and Case 2
12

DV=Match, IV=Name_Sim

DV= Match, IV = Name_Sim, Attr_Sim

+ Accuracy increased from 85.6% to
92.6%, Sensibility increased from 0
to 59.3%, FN rate dropped from
100 to 40.7%
+ Variables in the equation for constant
and Sim_name are both significant.
+ Nagelkerke R square = .469
- Specificity decreased from 100 to
98.24%, FP increased improved
from 0 to 1.75%
- -2 Log Likelihood very high = 309.673
- Cox and Snell R squares = .263

+ Accuracy

increased from 85.6% to
92.6%, Sensibility increased from 0
to 59.3%, FN rate dropped from
100 to 40.7%
+ Variables in the equation for constant
and Sim_name are both significant.
+ Nagelkerke R square = .470
- Specificity decreased from 100 to
98.24%, FP rate increased from 0 to
1.75%
- -2 Log Likelihood very high = 309.622
- Cox and Snell R squares = .264
- Variables in the equation for Sim_Attr
is not significant.
5.2 Results: Case 3 and 4
13

DV= Match, IV=Name_Sim, Rel_Sim

DV: Match, IV: Name_Sim, Attr_Sim

+ Accuracy increased from 85.6% to
92.6%, Sensibility increased from 0
to 59.3%, FN rate dropped from
100 to 40.7%
+ Variables in the equation for constant
and Sim_name are both significant.
+ Nagelkerke R square = .470
- Specificity decreased from 100 to
98.24%, FP rate increased from 0 to
1.75%
- -2 Log Likelihood very high = 309.622
- Cox and Snell R squares = .264
- Variables in the equation for Sim_Rel is
not significant.

+ Accuracy increased from 85.6% to
92.6%, Sensibility increased from 0
to 59.3%, FN rate dropped from
100 to 40.7%
+ Variables in the equation for constant
and Sim_name are both significant.
+ Nagelkerke R square = .471
- Specificity decreased from 100 to
98.24%, FP rate increased from 0 to
1.75%
- -2 Log Likelihood very high = 308.818
- Cox and Snell R squares = .265
- Variables in the equation for Sim_Attr,
and Sim_rel are not significant.
6. Inferences
14





Out of the three predictive variables (Name_Sim,
Rel_Sim, and Attr_Sim), only Name_Sim is a good
predictor of actual classes of observations.
The misclassified cases mainly represent those
observations which require some domain knowledge
e.g. BLASTP is same as Protein Sequence; and
TBLASTX is same as Nucleotide Sequence.
7. Future Work
15









Improve Similarity Function
Use of domain dictionaries
Include more number of models
Generate a new classification function
Clustering entities that are found similar
References
16









NAR Journal dataset
Porter’s Stemming Algorithm:
http://tartarus.org/~martin/PorterStemmer/
Sharma, S. (1995), Applied Multivariate Techniques,
John Wiley & Sons, Inc. New York, NY, USA.
INFO 692 Lecture Handouts
17

Thank You
Questions, Comments, Ideas…?

More Related Content

What's hot

Chapter3 hundred page machine learning
Chapter3 hundred page machine learningChapter3 hundred page machine learning
Chapter3 hundred page machine learningmustafa sarac
 
Unit 3-with-privious-quesstions
Unit 3-with-privious-quesstionsUnit 3-with-privious-quesstions
Unit 3-with-privious-quesstionsprabhu teja
 
17.5 introduction to functions
17.5 introduction to functions17.5 introduction to functions
17.5 introduction to functionsGlenSchlee
 
New Approach to Find the Maxima and Minima of a Function
New Approach to Find the Maxima and Minima of a FunctionNew Approach to Find the Maxima and Minima of a Function
New Approach to Find the Maxima and Minima of a Functionijtsrd
 
Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)
Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)
Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)Entrust Datacard
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learningbutest
 
Distributive Property (Algebra 1)
Distributive Property (Algebra 1)Distributive Property (Algebra 1)
Distributive Property (Algebra 1)rfant
 
Cross product
Cross productCross product
Cross productparassini
 
A Hybrid Immunological Search for theWeighted Feedback Vertex Set Problem
A Hybrid Immunological Search for theWeighted Feedback Vertex Set ProblemA Hybrid Immunological Search for theWeighted Feedback Vertex Set Problem
A Hybrid Immunological Search for theWeighted Feedback Vertex Set ProblemMario Pavone
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesKush Kulshrestha
 

What's hot (18)

Unit 2.2
Unit 2.2Unit 2.2
Unit 2.2
 
Chap08
Chap08Chap08
Chap08
 
Chapter3 hundred page machine learning
Chapter3 hundred page machine learningChapter3 hundred page machine learning
Chapter3 hundred page machine learning
 
Unit 3-with-privious-quesstions
Unit 3-with-privious-quesstionsUnit 3-with-privious-quesstions
Unit 3-with-privious-quesstions
 
17.5 introduction to functions
17.5 introduction to functions17.5 introduction to functions
17.5 introduction to functions
 
Lar calc10 ch05_sec2
Lar calc10 ch05_sec2Lar calc10 ch05_sec2
Lar calc10 ch05_sec2
 
New Approach to Find the Maxima and Minima of a Function
New Approach to Find the Maxima and Minima of a FunctionNew Approach to Find the Maxima and Minima of a Function
New Approach to Find the Maxima and Minima of a Function
 
Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)
Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)
Zero to ECC in 30 Minutes: A primer on Elliptic Curve Cryptography (ECC)
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learning
 
Polynomials
PolynomialsPolynomials
Polynomials
 
Distributive Property (Algebra 1)
Distributive Property (Algebra 1)Distributive Property (Algebra 1)
Distributive Property (Algebra 1)
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Lar calc10 ch01_sec3
Lar calc10 ch01_sec3Lar calc10 ch01_sec3
Lar calc10 ch01_sec3
 
APPLICATION OF NUMERICAL METHODS IN SMALL SIZE
APPLICATION OF NUMERICAL METHODS IN SMALL SIZEAPPLICATION OF NUMERICAL METHODS IN SMALL SIZE
APPLICATION OF NUMERICAL METHODS IN SMALL SIZE
 
Stewart calc7e 01_08
Stewart calc7e 01_08Stewart calc7e 01_08
Stewart calc7e 01_08
 
Cross product
Cross productCross product
Cross product
 
A Hybrid Immunological Search for theWeighted Feedback Vertex Set Problem
A Hybrid Immunological Search for theWeighted Feedback Vertex Set ProblemA Hybrid Immunological Search for theWeighted Feedback Vertex Set Problem
A Hybrid Immunological Search for theWeighted Feedback Vertex Set Problem
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning Techniques
 

Viewers also liked

Toward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug LabelsToward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug LabelsThe Children's Hospital of Philadelphia
 
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...The Children's Hospital of Philadelphia
 
8 things you should not do when selecting a prem
8 things you should not do when selecting a prem8 things you should not do when selecting a prem
8 things you should not do when selecting a premKeith Meadows
 
Thepatientoutcomesblog survey results 2012
Thepatientoutcomesblog survey results 2012Thepatientoutcomesblog survey results 2012
Thepatientoutcomesblog survey results 2012Keith Meadows
 
The Diabetes Health Profile - Development and applications
The Diabetes Health Profile - Development and applicationsThe Diabetes Health Profile - Development and applications
The Diabetes Health Profile - Development and applicationsKeith Meadows
 
White paper 5 things you need to know about patient reported outcome (pro) ...
White paper   5 things you need to know about patient reported outcome (pro) ...White paper   5 things you need to know about patient reported outcome (pro) ...
White paper 5 things you need to know about patient reported outcome (pro) ...Keith Meadows
 
The diabetes health profile ebook
The diabetes health profile ebookThe diabetes health profile ebook
The diabetes health profile ebookKeith Meadows
 
A selection of slides from our cognitive interview training workshop
A selection of  slides from our cognitive interview training workshopA selection of  slides from our cognitive interview training workshop
A selection of slides from our cognitive interview training workshopKeith Meadows
 
Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...
Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...
Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...Faldi Dwi Wahyudi
 
DHP manual sample pages 02.11.12
DHP manual sample pages 02.11.12DHP manual sample pages 02.11.12
DHP manual sample pages 02.11.12Keith Meadows
 
Our story of understanding of what its like living with diabetes
Our story of understanding of what its like  living with diabetesOur story of understanding of what its like  living with diabetes
Our story of understanding of what its like living with diabetesKeith Meadows
 
5 tips for_selecting_prom
5 tips for_selecting_prom5 tips for_selecting_prom
5 tips for_selecting_promKeith Meadows
 
Understanding Clinical Forms: Structure Discovery and SNOMED CT Annotation
Understanding Clinical Forms: Structure Discovery and SNOMED CT AnnotationUnderstanding Clinical Forms: Structure Discovery and SNOMED CT Annotation
Understanding Clinical Forms: Structure Discovery and SNOMED CT AnnotationThe Children's Hospital of Philadelphia
 
Young spikes price tag of a nation
Young spikes price tag of a nationYoung spikes price tag of a nation
Young spikes price tag of a nationFaldi Dwi Wahyudi
 

Viewers also liked (20)

Toward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug LabelsToward Creating a gold Standard of Drug Indications from FDA Drug Labels
Toward Creating a gold Standard of Drug Indications from FDA Drug Labels
 
Crowdsourcing in NLP
Crowdsourcing in NLPCrowdsourcing in NLP
Crowdsourcing in NLP
 
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
Exploiting Semantic Structure for Mapping User-specified Form Terms to SNOMED...
 
8 things you should not do when selecting a prem
8 things you should not do when selecting a prem8 things you should not do when selecting a prem
8 things you should not do when selecting a prem
 
Thepatientoutcomesblog survey results 2012
Thepatientoutcomesblog survey results 2012Thepatientoutcomesblog survey results 2012
Thepatientoutcomesblog survey results 2012
 
The Diabetes Health Profile - Development and applications
The Diabetes Health Profile - Development and applicationsThe Diabetes Health Profile - Development and applications
The Diabetes Health Profile - Development and applications
 
White paper 5 things you need to know about patient reported outcome (pro) ...
White paper   5 things you need to know about patient reported outcome (pro) ...White paper   5 things you need to know about patient reported outcome (pro) ...
White paper 5 things you need to know about patient reported outcome (pro) ...
 
The diabetes health profile ebook
The diabetes health profile ebookThe diabetes health profile ebook
The diabetes health profile ebook
 
A selection of slides from our cognitive interview training workshop
A selection of  slides from our cognitive interview training workshopA selection of  slides from our cognitive interview training workshop
A selection of slides from our cognitive interview training workshop
 
Let's Chat The Museum
Let's Chat The MuseumLet's Chat The Museum
Let's Chat The Museum
 
Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...
Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...
Oper Semangat: a campaign to gain Indonesian football supporter's optimist sp...
 
DHP manual sample pages 02.11.12
DHP manual sample pages 02.11.12DHP manual sample pages 02.11.12
DHP manual sample pages 02.11.12
 
Our story of understanding of what its like living with diabetes
Our story of understanding of what its like  living with diabetesOur story of understanding of what its like  living with diabetes
Our story of understanding of what its like living with diabetes
 
5 tips for_selecting_prom
5 tips for_selecting_prom5 tips for_selecting_prom
5 tips for_selecting_prom
 
Understanding Clinical Forms: Structure Discovery and SNOMED CT Annotation
Understanding Clinical Forms: Structure Discovery and SNOMED CT AnnotationUnderstanding Clinical Forms: Structure Discovery and SNOMED CT Annotation
Understanding Clinical Forms: Structure Discovery and SNOMED CT Annotation
 
Young spikes price tag of a nation
Young spikes price tag of a nationYoung spikes price tag of a nation
Young spikes price tag of a nation
 
Mike thelwall ritu
Mike thelwall rituMike thelwall ritu
Mike thelwall ritu
 
Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?Can Clinicians Create High-Quality Databases?
Can Clinicians Create High-Quality Databases?
 
Introduction to Database Research Projects @ CWHR
Introduction to Database Research Projects @ CWHRIntroduction to Database Research Projects @ CWHR
Introduction to Database Research Projects @ CWHR
 
Rassa dikit juga enak
Rassa dikit juga enakRassa dikit juga enak
Rassa dikit juga enak
 

Similar to Matching Conceptual Models Using Multivariate Analysis

Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...csandit
 
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...guest48424e
 
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...萍華 楊
 
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Daniel Katz
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fcZachary Combs
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptxssuser31398b
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search EngineVirenKhandal
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search EngineVirenKhandal
 
Coder Name Rebecca Oquendo
Coder Name  Rebecca Oquendo                                    Coder Name  Rebecca Oquendo
Coder Name Rebecca Oquendo DioneWang844
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxDIPESH30
 
An Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing TheoryAn Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing TheoryAkhil Goyal
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 

Similar to Matching Conceptual Models Using Multivariate Analysis (20)

MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
 
Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...
 
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
 
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
 
Speed Dating SS
Speed Dating SSSpeed Dating SS
Speed Dating SS
 
1624.pptx
1624.pptx1624.pptx
1624.pptx
 
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
 
SEM
SEMSEM
SEM
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
CFA Fit Statistics
CFA Fit StatisticsCFA Fit Statistics
CFA Fit Statistics
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptx
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search Engine
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search Engine
 
ANSWERS
ANSWERSANSWERS
ANSWERS
 
Coder Name Rebecca Oquendo
Coder Name  Rebecca Oquendo                                    Coder Name  Rebecca Oquendo
Coder Name Rebecca Oquendo
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docx
 
An Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing TheoryAn Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing Theory
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Matching Conceptual Models Using Multivariate Analysis

  • 1. 1 MATCHING CONCEPTUAL MODELS (PART OF THE ‘IBIOSEARCH’ PROJECT) JUNE 9 2008 Quantitative Methods Ritu Khare
  • 2. Order of the Presentation 2     Problem and Background Research Questions Initial Dataset Overall Methodology Representation of Dataset A  Criteria to compare two entities  Generation of dataset B  Multivariate Analysis of dataset B   Results Case I  Case II  Case III  Case IV     Inferences Future Work References
  • 3. 1. Problem and Background 3  Search Interface is represented as a Conceptual Model C A Search X A: B: Search Y C:    X B The aim is to combine all search interfaces i.e. to combine several conceptual models. Hence, matching of models is required. In this project, focus is on matching of entities. Y
  • 4. 2. Research Questions 4    Find an Entity Matching Technique(s) to match entities of two models. Does this technique (or combination of techniques ) provide a good way to compare two entities? What other basis of comparison can be used?
  • 5. 3. Initial Dataset A 5   20 Conceptual Models Expect Example 1: Matrix Domain DB  Example 2: BLASTP Alignments Accession No. Gene ID Title Sequence Gene Patent Patent Sequence Number Gene Name
  • 6. 4. Overall Methodology 6 Conceptual Models Representation of Dataset A into structured tables Criteria to compare entities from different models (Entity Name, Attribute set, Relationship Set) Generation of Dataset B Multivariate Analysis of Dataset B Analysis Results
  • 7. 4.1 Representation of dataset A 7  Every model is represented as   List of entities Every Entity in a model is represented as Entity Name  List of attributes  List of relationships   Dataset A has the following columns: (Model_ID, Entity_name, Attribute_set, Relationship_set)
  • 8. 4.2 Criteria to compare two entities 8   All entities from two different models are compared. Criteria to compare two entities Entity Name Similarity Exact String Matching, Substring Matching Output: Boolean Variable (True, False)  Attribute Set Similarity Jaccard Coefficient Output: Decimal Number (between 0 and 1)  Relationship Set Similarity Jaccard Coefficient Output: Decimal Number (between 0 and 1) 
  • 9. 4.3 Generation of Dataset B 9   Input: 20 Conceptual Models Algorithm:    Stem Entity Names and Attribute Names (Porter Stemmer) Compare each pair of Entities from different models based on the three criteria (Slide 7) Output: Table (598 records) Pair# Name Similarity Attribute Similarity Relationship Similarity XYZ Yes 0.657 0.004
  • 10. 4.4 Multivariate Analysis of dataset B 10   Manually annotate if a pair represents similar entities or not. (“Match” column) 60 matches and 538 mismatches were found. Pair# Name Sim. Attribute Sim. Relationsh ip Sim. XYZ  Match Yes Yes 0.657 0.004 Is this a good Classification Model?    Can it correctly identify matching and non-matching pair? Which technique is suitable to answer these questions? Binary Logistic Regression  Predictive variables are a combination of continuous and categorical variables.  Name_Sim (Categorical), Attr_Sim (Continuous), Rel_Sim (Continuous)
  • 11. 5. Results 11  Binary Logistic Regression IV: Name_Sim, Attr_Sim, Rel_Sim  DV: Match      Case I: IV = Name_Sim Case 2: IV = Name_Sim, Attr_Sim Case 3: IV = Name_Sim, Rel_Sim Case 4: IV = Name_Sim, Attr_Sim, Rel_Sim
  • 12. 5.1 Results: Case 1and Case 2 12 DV=Match, IV=Name_Sim DV= Match, IV = Name_Sim, Attr_Sim + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .469 - Specificity decreased from 100 to 98.24%, FP increased improved from 0 to 1.75% - -2 Log Likelihood very high = 309.673 - Cox and Snell R squares = .263 + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .470 - Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75% - -2 Log Likelihood very high = 309.622 - Cox and Snell R squares = .264 - Variables in the equation for Sim_Attr is not significant.
  • 13. 5.2 Results: Case 3 and 4 13 DV= Match, IV=Name_Sim, Rel_Sim DV: Match, IV: Name_Sim, Attr_Sim + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .470 - Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75% - -2 Log Likelihood very high = 309.622 - Cox and Snell R squares = .264 - Variables in the equation for Sim_Rel is not significant. + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .471 - Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75% - -2 Log Likelihood very high = 308.818 - Cox and Snell R squares = .265 - Variables in the equation for Sim_Attr, and Sim_rel are not significant.
  • 14. 6. Inferences 14   Out of the three predictive variables (Name_Sim, Rel_Sim, and Attr_Sim), only Name_Sim is a good predictor of actual classes of observations. The misclassified cases mainly represent those observations which require some domain knowledge e.g. BLASTP is same as Protein Sequence; and TBLASTX is same as Nucleotide Sequence.
  • 15. 7. Future Work 15      Improve Similarity Function Use of domain dictionaries Include more number of models Generate a new classification function Clustering entities that are found similar
  • 16. References 16     NAR Journal dataset Porter’s Stemming Algorithm: http://tartarus.org/~martin/PorterStemmer/ Sharma, S. (1995), Applied Multivariate Techniques, John Wiley & Sons, Inc. New York, NY, USA. INFO 692 Lecture Handouts