SlideShare une entreprise Scribd logo
1  sur  31
Modelling and computing the quality of information in e-science Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Roma, 3/4/07
Quality of data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Data quality control in the data management practice
Common quality issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Taxonomy for data quality dimensions
Our motivation: quality in public e-science data ,[object Object],[object Object],Problem: using third party data of unknown quality may result in misleading scientific conclusions GenBank UniProt EnsEMBL Entrez dbSNP
Some quality issues in biology ,[object Object],[object Object],[object Object],[object Object],[object Object],Each of these issues calls for a separate testing procedure Difficult to generalize
Correctness in biology - examples No false positives: Every protein in the output is actually present in the cell sample Generate peptides peak lists, match peak lists (eg Imprint) Qualitative proteomics: Protein identification No false positives, no false negatives Microarray data analysis Transcriptomics: Gene expression report (up/down-regulation) Functional annotation  f  for  p  correct if function  f  can  reliably  be attributed to  p Manual curation Uniprot protein annotation Correctness Creation process Data type
Defining quality in e-science is challenging ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],“ Quality”    personal criteria for data acceptability
Research goals ,[object Object],[object Object],[object Object],Elicit “nuggets” of latent quality knowledge from the experts ,[object Object],[object Object],[object Object]
Example: protein identification Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry    true positive This evidence is independent of the algorithm / SW package It is  readily available and inexpensive  to obtain Evidence : mass coverage (MC)  measures the amount of protein sequence matched Hit ratio (HR)  gives an indication of the signal to noise ratio in a mass spectrum ELDP  reflects the completeness of the digestion that precedes the peptide mass fingerprinting
Correctness of protein identification Estimator function:  (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives
Quality process components Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Goal: to automatically add the additional filtering step in a principled way ,[object Object],[object Object],[object Object],[object Object],PMF score =  (HR x 100) +  MC +  (ELDP x 10) Quality filtering Quality assertion :
Quality Assertions ,[object Object],[object Object],[object Object],C Quality-equivalent regions      D         B A Actions  associated to regions: Eg accept/reject but possibly more
Layered definition of Quality DB DB Data sources custom quality knowledge Quality Assertions functions QA QA QA Quality Views: definition of  acceptability regions QV QV QV QV quality evidence annotations Env Annotation functions Long-lived reusable  Commodities Expert-defined Dynamic User controlled
Abstract Quality Views ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Computable quality views as commodities ,[object Object],[object Object],[object Object],Abstract quality views binding and compilation Executable Quality process ,[object Object],[object Object],Qurator architectural framework:
Quality hypotheses discovery and testing abstract quality view Quality model Performance assessment Execution on test data Compilation Compilation Targeted Compilation Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Target-specific Quality component Deployment Deployment Deployment ,[object Object],[object Object],[object Object],Quality model definition
Experimental quality ,[object Object],[object Object],   Discovery and validation: “nuggets of quality knowldege” Quality View Model testing Test datasets    Embedding quality views and flow-through testing +
Execution model for Quality views ,[object Object],[object Object],[object Object],Host workflow Abstract Quality view Embedded quality workflow QV compiler D D’ Quality view on D’ Host workflow: D    D’ Qurator quality framework Services registry Services implementation
Example: original proteomics workflow Taverna workflow Quality flow embedding point
Example: embedded quality workflow
Interactive conditions / actions
Generic quality process pattern Collect evidence  - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName=&quot; Coverage “ evidence=&quot; q:Coverage &quot;/>  <var variableName=&quot; PeptidesCount “ evidence=&quot; q:PeptidesCount &quot;/>  </variables> Evaluate conditions Execute actions <action> <filter> <condition> ScoreClass  in {``q:high'', ``q:mid''} and  Coverage  > 12 </condition> </filter> </action> Compute assertions Classifier Classifier Classifier <QualityAssertion serviceName=&quot; PIScoreClassifier &quot;  serviceType=&quot; q:PIScoreClassifier &quot;  tagSemType=&quot; q:PIScoreClassification &quot;  tagName=&quot; ScoreClass &quot; Persistent evidence
Reference (semantic) model quality evidence annotations custom quality knowledge DB DB Env Data sources Annotation functions Quality Assertions functions QA QA QA Quality Views definition of  acceptability regions QV QV QV QV Common  Semantic Model (IQ Ontology)
A semantic model for quality concepts Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF)
Main taxonomies and properties assertion-based-on-evidence:   QualityAssertion    QualityEvidence is-evidence-for:  QualityEvidence    DataEntity Class restriction: MassCoverage       is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier       assertion-based-on-evidence . HitScore PIScoreClassifier       assertion-based-on-evidence . Mass Coverage
The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies:  Unsatisfied input requirements for Quality Assertion
Qurator architecture
Quality-aware query processing
Research issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Publications:  http://www.qurator.org Qurator is registered with  OMII-UK

Contenu connexe

En vedette (6)

Kelas xi bab 4
Kelas xi bab 4Kelas xi bab 4
Kelas xi bab 4
 
Anhoring Script For Annual Function
Anhoring Script For Annual FunctionAnhoring Script For Annual Function
Anhoring Script For Annual Function
 
Emcee Script
Emcee ScriptEmcee Script
Emcee Script
 
Template Script for Emcees
Template Script for EmceesTemplate Script for Emcees
Template Script for Emcees
 
Master of Ceremony Script
Master of Ceremony ScriptMaster of Ceremony Script
Master of Ceremony Script
 
emcee / mc Opening speech example
emcee / mc Opening speech example emcee / mc Opening speech example
emcee / mc Opening speech example
 

Similaire à Invited talk @Roma La Sapienza, April '07

Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
Paolo Missier
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
Neil Swainston
 
2009 12 06 - LOINC Workshop
2009 12 06 - LOINC Workshop2009 12 06 - LOINC Workshop
2009 12 06 - LOINC Workshop
dvreeman
 

Similaire à Invited talk @Roma La Sapienza, April '07 (20)

Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Quality Metrics for Learning Object Metadata
Quality Metrics for Learning Object MetadataQuality Metrics for Learning Object Metadata
Quality Metrics for Learning Object Metadata
 
7171
71717171
7171
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Towards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata QualityTowards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata Quality
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker Strategies
 
QTP AUTOMATION TESTING SYLLABUS
QTP AUTOMATION TESTING SYLLABUSQTP AUTOMATION TESTING SYLLABUS
QTP AUTOMATION TESTING SYLLABUS
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
Quality of Multimedia Experience: Past, Present and Future
Quality of Multimedia Experience: Past, Present and FutureQuality of Multimedia Experience: Past, Present and Future
Quality of Multimedia Experience: Past, Present and Future
 
The Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim ColemanThe Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim Coleman
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
 
Bayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size WebinarBayesian Approaches To Improve Sample Size Webinar
Bayesian Approaches To Improve Sample Size Webinar
 
2009 12 06 - LOINC Workshop
2009 12 06 - LOINC Workshop2009 12 06 - LOINC Workshop
2009 12 06 - LOINC Workshop
 
A CDR implementation based on openEHR ARM persistence method
A CDR implementation based on openEHR ARM persistence methodA CDR implementation based on openEHR ARM persistence method
A CDR implementation based on openEHR ARM persistence method
 
Reproducibility, Quality Control and Importance of Automation
Reproducibility, Quality Control and Importance of AutomationReproducibility, Quality Control and Importance of Automation
Reproducibility, Quality Control and Importance of Automation
 
Qaqc Course Total
Qaqc Course TotalQaqc Course Total
Qaqc Course Total
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 

Plus de Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Plus de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Invited talk @Roma La Sapienza, April '07

  • 1. Modelling and computing the quality of information in e-science Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Roma, 3/4/07
  • 2.
  • 3.
  • 4. Taxonomy for data quality dimensions
  • 5.
  • 6.
  • 7. Correctness in biology - examples No false positives: Every protein in the output is actually present in the cell sample Generate peptides peak lists, match peak lists (eg Imprint) Qualitative proteomics: Protein identification No false positives, no false negatives Microarray data analysis Transcriptomics: Gene expression report (up/down-regulation) Functional annotation f for p correct if function f can reliably be attributed to p Manual curation Uniprot protein annotation Correctness Creation process Data type
  • 8.
  • 9.
  • 10. Example: protein identification Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
  • 11. Correctness of protein identification Estimator function: (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives
  • 12.
  • 13.
  • 14. Layered definition of Quality DB DB Data sources custom quality knowledge Quality Assertions functions QA QA QA Quality Views: definition of acceptability regions QV QV QV QV quality evidence annotations Env Annotation functions Long-lived reusable Commodities Expert-defined Dynamic User controlled
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. Example: original proteomics workflow Taverna workflow Quality flow embedding point
  • 23. Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName=&quot; Coverage “ evidence=&quot; q:Coverage &quot;/> <var variableName=&quot; PeptidesCount “ evidence=&quot; q:PeptidesCount &quot;/> </variables> Evaluate conditions Execute actions <action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 </condition> </filter> </action> Compute assertions Classifier Classifier Classifier <QualityAssertion serviceName=&quot; PIScoreClassifier &quot; serviceType=&quot; q:PIScoreClassifier &quot; tagSemType=&quot; q:PIScoreClassification &quot; tagName=&quot; ScoreClass &quot; Persistent evidence
  • 24. Reference (semantic) model quality evidence annotations custom quality knowledge DB DB Env Data sources Annotation functions Quality Assertions functions QA QA QA Quality Views definition of acceptability regions QV QV QV QV Common Semantic Model (IQ Ontology)
  • 25. A semantic model for quality concepts Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF)
  • 26. Main taxonomies and properties assertion-based-on-evidence: QualityAssertion  QualityEvidence is-evidence-for: QualityEvidence  DataEntity Class restriction: MassCoverage   is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier   assertion-based-on-evidence . HitScore PIScoreClassifier   assertion-based-on-evidence . Mass Coverage
  • 27. The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion
  • 30.
  • 31.

Notes de l'éditeur

  1. From traditional DQ to the biologist’s problem of defining quality based on data semantics
  2. Data produced for the first time Mention evolution of experimental techniques Its production not streamlined No agreement on how to define its quality
  3. Searching for “nuggets of quality knowledge”
  4. Here is the compilation model for mapping bound views to a sub-workflow
  5. Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks
  6. Activated during execution of the quality sub-flow – blocks the workflow for the duration of the interaction
  7. Our quality view specification language allows users to define abstract quality processes. Evidence types are ontology classes. Evidence values are class individuals, which are represented by variables. These variables are bound to values at runtime; the values themselves are either fetched from a repository of persistent annotations, or they are computed on demand by annotation functions. In our use cases, we have found examples of both. This process steps abstracts out from the issue of annotation lifetime Assertions are computed by services, which are represented by ontology classes, too. The tagName is the single output of the service (one for each input data item) Finally, the action step contains the condition/action pairs – here conditions are expressed on the variables introduced earlier, which define the scope. The semantics of the action step is that the expression is evaluated for each data item, and the corresponding action is taken, eg the item is sent to a specific channel
  8. Benefit of this model: Ability to share definitions within a community Consistency checking through reasoning -- cite previous papers? Flexibility
  9. From right to left: Data / knowledge layer Framework services Quality views management Targeted compiler(s)