Founding Director, Artificial Intelligence Institute à University of South Carolina
1 May 2010•0 j'aime•2,010 vues
1 sur 54
Semantic Web for Health Care and Biomedical Informatics
1 May 2010•0 j'aime•2,010 vues
Télécharger pour lire hors ligne
Signaler
Formation
Amit Sheth, "Semantic Web for Health Care and Biomedical Informatics," Keynote at NSF Biomed Web Workshop, Corbett, Oregon, December 4-5, 2007.
http://www.biomedweb.info/2007/
Semantic Web for Health Care and Biomedical Informatics
1. Semantic Web for Health Care and Biomedical Informatics Keynote at NSF Biomed Web Workshop, December 4-5, 2007 Amit P. Sheth [email_address] Thanks Pablo Mendes, Satya Sahoo and Kno.e.sis team; Collaborators at Athens Heart Center (Dr. Agrawal), NLM (Olivier Bodenreider ), CCRC, UGA (Will York), CCHMC (Bruce Aronow)
2.
3. Biomedical Informatics... Medical Informatics Bioinformatics Etiology Pathogenesis Clinical findings Diagnosis Prognosis Treatment Genome Transcriptome Proteome Metabolome Physiome ...ome Genbank Uniprot ...needs a connection Hypothesis Validation Experiment design Predictions Personalized medicine Semantic Web research aims at providing this connection! More advanced capabilities for search, integration, analysis, linking to new insights and discoveries! Pubmed Clinical Trials.gov Biomedical Informatics
4. Evolution of the Web 2007 1997 Web as an oracle / assistant / partner - “ask to the Web” - using semantics to leverage text + data + services + people Web of pages - text, manually created links - extensive navigation Web of databases - dynamically generated pages - web query interfaces Web of services - data = service = data, mashups - ubiquitous computing Web of people - social networks, user-created content - GeneRIF, Connotea
5.
6.
7. Metadata and Ontology: Primary Semantic Web enablers Shallow semantics Deep semantics Expressiveness, Reasoning
8. Characteristics of Semantic Web Self Describing Machine & Human Readable Issued by a Trusted Authority Easy to Understand Convertible Can be Secured The Semantic Web: XML, RDF & Ontology Adapted from William Ruh (CISCO)
11. N-Glycosylation metabolic pathway GNT-I attaches GlcNAc at position 2 UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 GNT-V attaches GlcNAc at position 6 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 N-acetyl-glucosaminyl_transferase_V N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4
12. Opportunity: exploiting clinical and biomedical data Health Information Services Elsevier iConsult Scientific Literature PubMed 300 Documents Published Online each day User-contributed Content ( Informal) GeneRifs NCBI Public Datasets Genome, Protein DBs new sequences daily Laboratory Data Lab tests, RTPCR, Mass spec Clinical Data Personal health history Search, browsing, complex query, integration, workflow, analysis, hypothesis validation, decision support. binary text
26. Deductive Reasoning Protein-Protein Interaction RULE: given that two genes interact with each other, given certain number of parameters being met, we can assert that the gene products also interact with each other IF (x have_common_pathway y) AND (x rdf:type gene) AND (y rdf:type gene) AND (x has_product m) AND (y has_product n) AND (m rdf:type gene_product) AND (n rdf:type gene_product) THEN (m ? n) gene_product gene_product has_product have_common_pathway gene2 gene1 has_product database_identifier 2 associated_with associated_with database_identifier 1 interacts_with
27.
28. Use data to test hypothesis Glycosyltransferase Congenital muscular dystrophy Link between glycosyltransferase activity and congenital muscular dystrophy? Adapted from: Olivier Bodenreider, presentation at HCLS Workshop, WWW07 gene GO PubMed Gene name OMIM Sequence Interactions
29. In a Web pages world… Adapted from: Olivier Bodenreider, presentation at HCLS Workshop, WWW07 Congenital muscular dystrophy, type 1D (GeneID: 9215) has_associated_disease has_molecular_function Acetylglucosaminyl-transferase activity
30. With the semantically enhanced data From medinfo paper. Adapted from: Olivier Bodenreider, presentation at HCLS Workshop, WWW07 SELECT DISTINCT ?t ?g ?d { ?t is_a GO:0016757 . ?g has molecular function ?t . ?g has_associated_phenotype ?b2 . ?b2 has_textual_description ?d . FILTER (?d, “muscular distrophy”, “i”) . FILTER (?d, “congenital”, “i”) } MIM:608840 Muscular dystrophy, congenital, type 1D GO:0008375 has_associated_phenotype has_molecular_function EG:9215 LARGE acetylglucosaminyl- transferase GO:0016757 glycosyltransferase GO:0008194 isa GO:0008375 acetylglucosaminyl- transferase GO:0016758
31.
32.
33. T.Cruzi PSE Query Interface Figure 4: Semantic annotation of ms scientific data
34. N-Glycosylation Process ( NGP ) Cell Culture Glycoprotein Fraction Glycopeptides Fraction extract Separation technique I Glycopeptides Fraction n*m n Signal integration Data correlation Peptide Fraction Peptide Fraction ms data ms/ms data ms peaklist ms/ms peaklist Peptide list N-dimensional array Glycopeptide identification and quantification proteolysis Separation technique II PNGase Mass spectrometry Data reduction Data reduction Peptide identification binning n 1
35. Semantic Annotation Applications Semantic Web Process to incorporate provenance Storage Standard Format Data Raw Data Filtered Data Search Results Final Output Agent Agent Agent Agent Biological Sample Analysis by MS/MS Raw Data to Standard Format Data Pre- process DB Search (Mascot/Sequest) Results Post-process (ProValt) O I O I O I O I O Biological Information
36. ProPreO: Ontology-mediated provenance 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion m/z fragment ion m/z ms/ms peaklist data fragment ion abundance parent ion abundance parent ion charge M ass S pectrometry (MS) Data
42. Extracting the Relationship Diabetes mellitus adversely affects the outcomes in patients with myocardial infarction (MI), due in part to the exacerbation of left ventricular (LV) remodeling. Although angiotensin II type 1 receptor blocker (ARB) has been demonstrated to be effective in the treatment of heart failure, information about the potential benefits of ARB on advanced LV failure associated with diabetes is lacking. To induce diabetes, male mice were injected intraperitoneally with streptozotocin (200 mg/kg). At 2 weeks, anterior MI was created by ligating the left coronary artery. These animals received treatment with olmesartan (0.1 mg/kg/day; n = 50) or vehicle (n = 51) for 4 weeks. Diabetes worsened the survival and exaggerated echocardiographic LV dilatation and dysfunction in MI. Treatment of diabetic MI mice with olmesartan significantly improved the survival rate (42% versus 27%, P < 0.05) without affecting blood glucose, arterial blood pressure, or infarct size. It also attenuated LV dysfunction in diabetic MI. Likewise, olmesartan attenuated myocyte hypertrophy, interstitial fibrosis, and the number of apoptotic cells in the noninfarcted LV from diabetic MI. Post-MI LV remodeling and failure in diabetes were ameliorated by ARB, providing further evidence that angiotensin II plays a pivotal role in the exacerbated heart failure after diabetic MI. Angiotensin II type 1 receptor blocker attenuates exacerbated left ventricular remodeling and failure in diabetes-associated myocardial infarction., Matsusaka H, et. al. ARB causes heart failure
43. Problem – Extracting relationships between MeSH terms from PubMed Biologically active substance Lipid Disease or Syndrome affects causes affects causes complicates Fish Oils Raynaud’s Disease ??????? instance_of instance_of UMLS Semantic Network MeSH PubMed 9284 documents 4733 documents 5 documents
44.
45.
46. Method – Identify entities and Relationships in Parse Tree TOP NP VP S NP VBZ induces NP PP NP IN of DT the NN endometrium JJ adenomatous NN hyperplasia NP PP IN by NN estrogen DT the JJ excessive ADJP NN stimulation JJ endogenous JJ exogenous CC or MeSHID D004967 MeSHID D006965 MeSHID D004717 UMLS ID T147 Modifiers Modified entities Composite Entities
50. Workflow Adaptation Why? New knowledge about treatment found during the execution of the pathway New knowledge about drugs, drug drug interactions
51.
52.
53.
54.
Notes de l'éditeur
Biomedical informatics needs the connection between the macro (medical informatics) and the micro (bioinformatics). Information is found in several sources, from text to structured data. Semantic Web aims to bridge this gap. Semantic Web will provide more advanced capabilities for search, integration, analysis, links to new insights and discoveries. “ Does this gene influence has a causal relationship with this disease?” “ What would be the best gene for me to perform experiments of knock out based on the information we have?” “ What is the probable course that a patient will take if it has these symptoms and this genetic background?”
We see a change of paradigm on the Web. Researchers once had to extensively navigate through pages to obtain the answer to a question. We are getting closer to the time where one can pose a question to the Web and have the solution computed by integrated sources. Some key areas of work include: How to integrate pages, databases, services and human contributions on the Web How to detect and propagate changes, control authorship and trust How to ask questions and visualize the results How to automatically perform knowlege discovery over this global knowledge base
1: the whole pathway is shown from the Dolichol compound over the first sugar: N-Acetyl-D-glucosaminyldiphosphodolichol (or GlcNAc-PP-dol) to the N-Glycan G00022 (KEGG accession No) or (GlcNAc)7 (Man)3 (Asn)1 (just numbers of residues, the glycan doesn’t have a common name, but belongs to a class of “Pentaantennary complex-type sugar chains”). 2. GNT-I (UDP-N-acetyl-D-glucosamine:3-(alpha-D-mannosyl)-beta-D-mannosyl-$glycoprotein 2-beta-N-acetyl-D-glucosaminyltransferase) catalyzes the reaction from 3-(alpha-D-mannosyl)-beta-D-mannosyl-R to 3-(2-[N-acetyl-beta-$D-glucosaminyl]-alpha-D-mannosyl)-beta-D-mannosyl-R 3. GNT-V (UDP-N-acetyl-D-glucosamine:6-[2-(N-acetyl-beta-D-glucosaminyl)-$alpha-D-mannosyl]-glycoprotein $6-beta-N-acetyl-D-glucosaminyltransferase) catalyzes the reaction from 6-(2-[N-acetyl-beta-D-glucosaminyl]-$alpha-D-mannosyl)-beta-D-mannosyl-R to 6-(2,6-bis[N-acetyl-$beta-D-glucosaminyl]-alpha-D-mannosyl)-beta-D-mannosyl-R, which is part of the Glycan G00021 4. The part of the ontology tree just shows where GNT-V is. 5. The GNT-V entry in the ontology shows that N-Glycan_beta_GlcNAc_9 is added with the help of Enzyme GNT-V to a sugar containing the residue N-glycan_alpha_man_4. Why this is important for GLycomics: G00021 is a so-called tetraantennary complex N-Glycan. When the red BlcNAc beta 1-6 is present due to GNT-V, this chain can be extended with polylactosamine. Polylactosamine is found in some metastatic cells. A challenge now is to find out whether this Glycan structure is always made by GNT-V. Then we might be able to tell something about GNT-V and cancer That is where probabilistic reasoning comes into play. Mention that man_4 and glcnac_9 are Contextual residues. Mention GlycoTree
NIDA undertook a project to study the genes implicated in nicotine dependency. The result of this study was a list of genes with their gene symbols, chromosomal location and a brief comment about the gene. These genes were all from humans. The next step in their study is to correlate these genes with biological pathway information to answer a variety of queries such as list of all interactions between genes or ‘hub’ genes i.e. genes that are highly active in terms of participation in pathways or categorize genes by their anatomical or tissue location. Clearly, this required integrating genome and pathway information
We identified the primary biological pathway information sources namely HumanCyc, KEGG and Reactome. The primary genome information sources were Entrez Gene and HomoloGene for homology information. We note that though we started with human genes only, later we added homologues gene records for four model organisms namely zebrafish, fruit fly, mouse and C. elegans. The Gene ontology is mainly a resource for GO annotation information. We needed to integrate these data sources effectively to answer the queries we discussed in the last slide.
Schema integration: As we discussed earlier, we integrate the two knowledge models at the schema level i.e. in terms of classes and relationships. Hence, instead of creating a new class for ‘pathway’ and ‘protein’ we re-used these concepts that were already defined in the BioPAX ontology. Thus these two classes server as anchors between the two schemas and we will a query that uses protein as common class to traverse from genome information to pathway information.
One of the primary advantages of an ontology is the ability to create and execute inference rules that lead to information gain i.e. they make explicit information that could only through human interpretation of actual data. For example, if we revisit the first query, then given that two genes interact with each other, given certain number of parameters being met, we can assert that the gene products also interact with each other. We can formally state the rule as shown.
Here we lay down a scenario in which a user would have to browse through multiple data sources to answer to a query: “ how are glycosyltransferase activity and congenital muscular dystrophy related”?
Here we show a user MANUALLY spotting from a web page the important concepts to answer his or her query.
Once the information is enhanced with ontologies, finding the connections is a matter of querying. No need for extensive navigation in an integrated environment. We show that three datasets (LARGE, MIM and GO) can be integrated to answer the user needs.
A demonstration of how a user interface can benefit from ontologies to guide the user in formulating a query. The ontology schema is shown in the bottom-right corner as a reference to where the program is reading the possible connections between concepts.
Here the query builder in the context of a bigger application (Tcruzi PSE) Also showing different perspectives for results exploration. Graphs are good for finding connections, while charts are good for overview.
By N-glycosylation Process, we mean the identification and quantification of glycopeptides Separation and identification of N-Glycans Proteolysis: treat with trypsin Separation technique I: chromatography like lectin affinity chromatography From PNGase F: we get fractions that contain peptides and glycans – we focus only on peptides. Separation technique II: chromatography like reverse phase chromatography
Core clinical/biomedical problems that we can address today or in future What are the semantic web technologies that can help