Amit Sheth and Susie Stephens, "Semantic Web: Technolgies and Applications for Real-World," Tutorial at 2007 World Wide Web Conference, Banff, Canada.
Tutorial discusses technologies and deployed real-world applications through 2007.
Tutorial description at: http://www2007.org/tutorial-T11.php
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Semantic Web: Technolgies and Applications for Real-World
1. Semantic Web: Technologies and Applications for the Real-World Amit Sheth LexisNexis Ohio Eminent Scholar Kno.e.sis Center Wright State University http:// knoesis.wright.edu Tutorial at WWW2007 by Amit Sheth and Susie Stephens, Banff, Alberta, Canada. May 2007
2. Semantic Web: Technologies and Applications for the Real-World Susie Stephens Principal Research Scientist Tutorial at WWW2007 by Amit Sheth and Susie Stephens, Banff, Alberta, Canada. May 2007
3. Tutorial Outline 4.30-5.00pm Semantic Web in Practice 3.30-4.30pm Real-World Applications (2): Details and War Stories, Case Studies 3.00-3.30pm Break 2.00-3.00pm Real-World Applications (1): Enabling Technologies and Experiences 1.30-2.00pm Introduction to the Semantic Web
42. Ontology Language/ Representation Spectrum Is Disjoint Subclass of with transitivity property Modal Logic Logical Theory Thesaurus Has Narrower Meaning Than Taxonomy Is Sub-Classification of Conceptual Model Is Subclass of DB Schemas, XML Schema First Order Logic Relational Model XML ER Extended ER Description Logic DAML+OIL, OWL, UML RDFS, XTM Syntactic Interoperability Structural Interoperability Semantic Interoperability Expressiveness
56. N. Takahashi and K. Kato , Trends in Glycosciences and Glycotechnology , 15: 235-251 GlycoTree - D -Glc p NAc - D -Glc p NAc - D -Man p -(1-4)- -(1-4)- - D -Man p -(1-6)+ - D -Glc p NAc -(1-2)- - D -Man p -(1-3)+ - D -Glc p NAc -(1-4)- - D -Glc p NAc -(1-2)+
57. Pathway representation in Ontology Evidence for this reaction from three experiments Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia
58. N-Glycosylation metabolic pathway GNT-I attaches GlcNAc at position 2 UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 GNT-V attaches GlcNAc at position 6 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 N-acetyl-glucosaminyl_transferase_V N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4
59. Pathway Steps - Reaction Evidence for this reaction from three experiments Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia
60. Pathway Steps - Glycan Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia Abundance of this glycan in three experiments
61.
62.
63.
64. Extracting a Text Document: Syntactic approach INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997 - 0530 MDT NATIONAL PREPAREDNESS LEVEL II CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have been staffed for structure protection. SIMELS, Galena District, BLM . This fire is on the east side of the Innoko Flats, between Galena and McGr The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is 35% contained, while protection of the historic cabit continues. CHINIKLIK MOUNTAIN, Galena District, BLM . A Type II Incident Management Team (Wehking) is assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire is contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fire burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend, depending on the results of infrared scanning. LAYOUT Date => day month int ‘,’ int
65. Extraction Agent Web Page Enhanced Metadata Asset Taalee Extraction and Knowledgebase Enhancement Taalee Semantic Engine , also IBM Semantics Tools
66. Metadata Extraction and Semantic Enhancement Semantic Enhancement Engine [Hammond, Sheth, Kochut 2002] Also, WebFountatin, KIM from OntoText
71. ProPreO: Ontology-mediated provenance 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion m/z fragment ion m/z ms/ms peaklist data fragment ion abundance parent ion abundance parent ion charge M ass S pectrometry (MS) Data http://knoesis.org/research/bioinformatics
76. Taalee’s Semantic Search Highly customizable, precise and freshest A/V search Taalee A/V Semantic Search Engine Context and Domain Specific Attributes Uniform Metadata for Content from Multiple Sources, Can be sorted by any field Delightful, relevant information, exceptional targeting opportunity
77. Creating a Web of related information What can a context do?
78. Ontology driven Semantic Directory Search for company ‘Commerce One’ Links to news on companies that compete against Commerce One Links to news on companies Commerce One competes against (To view news on Ariba, click on the link for Ariba) Crucial news on Commerce One’s competitors (Ariba) can be accessed easily and automatically
79. What else can a context do? (a commercial perspective) Semantic Enrichment Semantic Targeting
80. Semantic/Interactive Targeting Precisely targeted through the use of Structured Metadata and integration from multiple sources Buy Al Pacino Videos Buy Russell Crowe Videos Buy Christopher Plummer Videos Buy Diane Venora Videos Buy Philip Baker Hall Videos Buy The Insider Video
81. Semantic Application – Equity Dashboard Managing Semantic Content on the Web , Internet Computing 2002 Focused relevant content organized by topic ( semantic categorization ) Automatic Content Aggregation from multiple content providers and feeds Related news not specifically asked for (Semantic Associations) Competitive research inferred automatically Automatic 3 rd party content integration
82.
83.
84.
85.
86.
87. N-Glycosylation Process (NGP) Cell Culture Glycoprotein Fraction Glycopeptides Fraction extract Separation technique I Glycopeptides Fraction n*m n Signal integration Data correlation Peptide Fraction Peptide Fraction ms data ms/ms data ms peaklist ms/ms peaklist Peptide list N-dimensional array Glycopeptide identification and quantification proteolysis Separation technique II PNGase Mass spectrometry Data reduction Data reduction Peptide identification binning n 1
89. ISiS – Integrated Semantic Information and Knowledge System Semantic Annotation Applications Semantic Web Process to incorporate provenance Storage Standard Format Data Raw Data Filtered Data Search Results Final Output Agent Agent Agent Agent Biological Sample Analysis by MS/MS Raw Data to Standard Format Data Pre- process DB Search (Mascot/Sequest) Results Post-process (ProValt) O I O I O I O I O Biological Information
Tasks often require data to be combined on the web (hotel and travel info) Humans can do that easily, even if different terminology is used. However, machines are ignorant. Partial information is unusable, difficult to make sense from an image, drawing analogies is difficult, difficult to combine information, e.g. is <foo:creator> sameas <bar:author>? How do you combine different XML schemas. - The semantic web is an interoperability technology. The traditional approach to interoperability is standardization. But standardization in this sense isn’t attractive, because this means that one has to anticipate everything about about the future, or one has to limit the world somehow. With the semantic web there’s more of a focus on standardizing how to say things, not what to say. There’s less of an emphasis on a priori standardization. The ultimate goal with the sw is serendipitous interoperability. Allowing information from independent sources be combined. Different domains have languages of their own, and their conceptual models of the world differ. SW was designed to accommodate different points of view. Using SW formalisms lowers the threshold for serendipitous reuse. The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. So the semantic web is all about breaking down silos of data, so that people can make better decisions, because they now have a 360 degree view of their operations. The semantic web is about forming a web of data. With the data itself residing in databases, spreadsheets, XML repositories wiki pages or traditional web pages. The semantic web is the glue that can bring data from these different sources together. And it’s an approach that is based on open standards. Labeling data on Web so that both humans and machines can more effectively use them Associating meaning to data that machines can understand so as to achieve lot more automation and off-load more work to machines Exploiting common vocabulary and richer modeling of subject area for much better integration of data
Map data onto an abstract data representation Make the data independent of its internal representation Merge the resulting representations Start making queries on the whole Queries that could not have been done on the individual data sets Use labels on the edges (relations) Use unique identifiers Data export does not always mean physical conversion of the data Relations can be generated on the fly at query time – via SQL bridges, scraping HTML, extracting data from Excel You can export part of the data Can use sameas E.g. integrating information about authors. (ISBN identifiers) Author isA person, so could find more in dbpedia for example. Include full classification of books, geographical location, information on inventories and prices – would then need ontologies and rules The graph representation is independent of the exact structure of for example the relational schema. A change in the local database schema or the XHTML does not affect the whole export step New data and new connections can be added seamlessly, regardless of the structure of other data sources The SemWeb makes this possible
Combine hotel and travel information – hard for machines to do. CoCoDat, cell centered database, NeuronDB
Can I trust (meta)data on the web? Is the author who s/he says s/he is? Can I check his/her credentials? Can I trust the inference engine How to name a full graph Protocols and policies to encode/sign full or partial graphs (blank nodes may be a problem for achieving uniqueness) How to express trust (trust in context) So I want to ground my earlier comments by giving you an overview of the semantic web standards. I’m going to start by giving you a high level overview of the different standards and technologies that make up the semantic web. I then have more slides that cover the key standards in some detail. So the semantic web takes advantage of many mature standards as it builds upon unicode, URIs, and XML. The use of URIs grounds RDF in the web. The semantic web uses XML in a couple of different ways. The semantic web takes advantage of XML namespaces for representing URIs. XML schema data types are used to characterize literals in RDF, and the standard way to serialize RDF is to use XML. RDF has been a W3C standard since 2004. RDF is a standard format for data interchange with the semantic web. RDF Schema is a language for describing RDF vocabularies, and is meant for inferencing. OWL has been a W3C standard since 2004. OWL is a more complex languages for describing RDF vocabularies. Again it is intended for inferencing. W3C formed a working group in the area of rules last Nov. It has been recognized that adding rules to the SW framework would provide additional expressivity. The goal of W3C’s work in this area is to provide a rule interchange format, so that the different rule engines can talk to each other. Gary Hallmark is doing an excellent job representing Oracle in this working group. SPARQL is the standard that W3C are working on for querying graphs. So it’s a bit like SQL for the SW. And Fred Zemke is doing a great job representing Oracle in this group. The upper layers of the stack are areas that still need to be standardized in the future. A proof language is simply a language that let's us prove whether or not a statement is true. The spectrum of the topic of trust is large - ranging from digital signatures and security to social network analysis. A nd cryto refers to the need to have support for security throughout the stack.
- RDF or resource description framework is the main syntactic standard for the Semantic Web. - In the Semantic Web things are referred to as ‘resources’; a resource can be anything that someone might want to talk about, e.g. Shakespeare, Stratford, “the value of п ”. A Subject, Predicate and and Object together form the basic building block for RDF – the triple. Using the triple it is possible to make statements about facts, statements about associations, and statements of statements. Triples become more interesting when more than one triple refers to the same entity. When more than one triple refers to the same thing, sometimes it is convenient to view the triples as a directed graph , in which each triple is an edge from its subject to its object, with the predicate as the label on the edge. One value of representing data in a triple is the ease with which information from multiple sources can be combined. Since information is represented simply as triples, merged information from graphs A and B should be as simple as forming the graph of all of the triples from A and B together. Merging graphs to create a combined graph is a straightforward process, but only when it is known which nodes in each of the source graphs match. That is, the essence of the merge comes down to answering the question, “when is a node in one graph the same node as a node in another graph?” In RDF, this issue is resolved through the use of URIs. A URI provides a global identification for a resource that is common across the web. If two agents on the web want to refer to the same resource, recommended practice on the web is for them to agree to a common URI for that resource. This is not a stipulation that is particular to the Semantic Web, but to the web in general; global naming leads to global network effects. RDF borrows the notion of URI to resolve the identity problem in graph merging. The application is quite simple; a node from one graph is merged with a node from another graph, exactly if they have the same URI. The RDF standard itself takes advantage of the namespace infrastructure to define a small number of standard identifiers in a namespace defined in the standard, a namespace called rdf . rdf:type is a property that provides an elementary typing system in RDF. For example, we can express the relationship between several entities using type information. The subject of rdf:type in these triples can be any identifier, and the object is understood to be a type. There is no restriction on the usage of rdf:type with types; types can have types etc.: rdf:Property is an identifier that is used as a type in RDF to indicate when another identifier is to be used as a predicate, rather than as a subject or an object. We can declare all the identifiers we have used as predicates So using RDF, information can also be assembled into useful blocks of knowledge. Any information expressed in RDF can be connected to any other information expressed in RDF, in much the same manner that any document expressed in HTML can link to any other document in HTML. This makes data very re-useable, and doesn’t trap people into an inflexible data model. Databases actually also use triples, but the components of the triple are a row, a column and the value in the field. RDF triples are very simplistic and flexible and so they can represent data in a table structure, in a tree structure, or in any structure.
RDF is a labeled connection between resources spo Use URIs for naming URIs make the merge possible Machine readable formats Form DLG
This is the simplest form of extra knowledge. Define the terms we can use What restrictions apply What extra relations are there RDF vocabulary description language e.g. use the term novel. Every novel is a fiction The glass palace is a novel Everything in RDF is a resource Classes are a collection of resources Relationships are defined between classes and resources Typing an individual belongs to a particular class – e.g. the glass palace is a novel (or to be more precise ISBN xxx is a novel) Subclassing – the instance of one is also the instance of another – e.g. every novel is a fiction RDFS formalizes these notions Resources and classes are nodes; while types and subClassOf are properties A resource may belong to several classes. The Glass Palace is a novel ad an inventory item. The type information can be very important for applications. It can be used to categorize nodes. Inferred data is not in the original data But can be inferred from the RDFS rules Some RDF environments can return that triple too. RDF semantics has 44 entailment rules If such ad such triples are in the graph, and these triples Do that recursively until the graph doesn’t change This can be done in polynomial time Whether triples are added to the graph, or deduced at run time is an implementation issue Property is a special class. Properties are also resources identified by URIs Properties are constrained by their range and domain Ie what individuals can serve as object and subject There is also a possibility for sub-property All resources bound by the sub are also bound by each other Properties are also resources (named with URIs) Literals may have a data type. Natural language can also be specified RDFS has some predefined classes and properties They are not new concepts in the RDF Model, just resources with agreed semantics e.g. collections (aka lists), containers (sequence, bag, alternatives), reification Terminological assertions and axioms describes the separation of data from schema So working up the stack. RDF provides support for vocabularies through RDFS. RDFS supports hierarchies of classes and properties. This makes it easier to interpret data when people are talking at different levels of abstraction. It also allows the domain and range of properties to be constrained. rdfs:domain always says what class the subject of a triple using that property belongs to. rdfs:range always says what class the object of a triple using that property belongs to. RDFS represent the simplest ontology level. RDF Schema is meant for inferencing. So if flipper is a dolphin. And dolphins are mammals. Then you can infer that flipper is a mammal. RDFS Example: A rdf:type B, B rdfs:subClassOf C => A rdf:type C Ex: Matt rdf:type Father, Father rdfs:subClassOf Parent => Matt rdf:type Parent
So moving up the stack further, you get to Web Ontology Language or OWL. Ontologies are a very general term that is meant to define the concepts and relationships used to describe and represent an area of knowledge. Ontologies are used to classify the terms used in a particular application, characterize possible relationships, and define possible constraints using those relationships. OWL is intended to be used when the information contained in documents needs to be processed by applications, as opposed to situations where the content only needs to be presented to humans. Ontologies can be very complex, or very simple. OWL can be used to explicitly represent the meaning of terms in vocabularies and the relationships between those terms. OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. exactly one), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes. Owl lite, owl dl and owl full have increasing expressivity. OWL lite supports those users primarily needing a classification hierarchy and simple constraints. E.g. while it supports cardinality constraints, it only permits cardinality values of 1 or 0. It provides a quick migration path for thesauri and other taxonomy. It has a lower formal complexity that OWL DL. OWL DL supports those users who want the maximum expressiveness while retaining computational completeness (so all conclusions are guaranteed to be computable) and decidability (all computations will finish in a finite time). OWL DL contains all owl language constructs, but they can be used only under certain restrictions (e.g. while a class can be a subclass of many classes, a class cannot be an instance of another class). OWL DL is so named due to its correspondence with description logics, a field of research that has studied the logics that form the formal foundations of OWL. OWL Full is meant for users who want maximum expressiveness and the syntactic freedom of RDF with no computational guarantees. E.g. in OWL Full a class can be treated simultaneously as a collection of individuals and as an individual in its own right. It is unlikely that any reasoning software will be able to support complete reasoning for every feature of OWL Full. An ontology is an engineering artefact consisting of: A vocabulary used to describe (a particular view of) some domain An explicit specification of the intended meaning of the vocabulary. Often includes classification based information Constraints capturing background knowledge about the domain Ideally, an ontology should: Capture a shared understanding of a domain of interest Provide a formal and machine manipulable model -- An ontology describes the concepts and relationships that are important in a particular domain, providing a vocabulary for that domain as well as a computerized specification of the meaning of terms used in the vocabulary. Ontologies range from taxonomies and classifications, database schemas, to fully axiomatized theories. In recent years, ontologies have been adopted in many business and scientific communities as a way to share, reuse and process domain knowledge. Ontologies are now central to many applications such as scientific knowledge portals, information management and integration systems, electronic commerce, and semantic web services.
One last point that I’d like to make about ontologies, is that it is possible to merge them. This means that it’s possible to take several relatively small and modular ontologies, and join them together. The figure shows a number of different ontologies that could be merged together. And by doing that, it would become possible to query across a number of data sources that are typically in different silos. Merging ontologies is still a fairly manual process, but there are research groups that are focused on at least partially automating the task.
Queries become very important for distributed rdf data The fundamental idea: generalize the approach to graph patterns The pattern contains unbound symbols By binding the symbols (if possible) subgraphs of the RDF graph are selected If there is such a selection, the query returns the bound resources SPARQL is based on similar systems that already existed in some environments SPARQL is a programming language-independent query language Optional patterns; if they match variables,, fine, if not, don’t care Limit the number of returned results; remove duplicated, sort them.. Specify several data sources (via URIs) within the query (essentially a merge) Construct a graph combining a separate pattern and query results Use datatypes and/or language tags when matching a pattern Already many implementations Used locally, ie bound to a programming environment Used remotely, ie over the network on into a database Separate documents define the protocol and the result format SPARQL Protocol for RDF with HTTP and SOAP bindings SPARQL results in XML or JSON formats
GRDDL is a mechanism for G leaning R esource D escriptions from D ialects of L anguages. This GRDDL specification introduces markup based on existing standards for declaring that an XML document includes data compatible with the Resource Description Framework (RDF) and for linking to algorithms (typically represented in XSLT), for extracting this data from the document. The markup includes a namespace-qualified attribute for use in general-purpose XML documents and a profile-qualified link relationship for use in valid XHTML documents. The GRDDL mechanism also allows an XML namespace document (or XHTML profile document) to declare that every document associated with that namespace (or profile) includes gleanable data and for linking to an algorithm for gleaning the data. A corresponding GRDDL Use Case Working Draft provides motivating examples. A GRDDL Primer demonstrates the mechanism on XHTML documents which include widely-deployed dialects known as microformats. A GRDDL Test Cases document illustrates specific issues in this design and provides materials to aid in test-driven development of GRDDL-aware agents. GRDDL formalizes the scrapper approach. The user has to provide dc-extract.xsl and use its conventions (making use of the corresponding meta=s, class id-s, etc). But by using the profile attribute, a client is instructed to find and run the transformation processor automatically. There is a mechanism for XML in general, a transformation can also be defined on a XML schema level. A bridge to microformats.
The Web Services Description Language (WSDL) specifies a way to describe the abstract functionalities of a service and concretely how and where to invoke it. The WSDL 2.0 specification does not include semantics in the description of Web services. Therefore, two services can have similar descriptions while meaning totally different things. Resolving this ambiguity in Web services descriptions is an important step toward automating the discovery and composition of Web services — a key productivity enabler in many domains including business application integration. Semantic Annotations for WSDL and XML Schema ( SAWSDL ) specification that is being developed by this Working Group defines mechanisms using which semantic annotations can be added to WSDL components. SAWSDL does not specify a language for representing the semantic models, e.g. ontologies. Instead, it provides mechanisms by which concepts from the semantic models that are defined either within or outside the WSDL document can be referenced from within WSDL components as annotations. These semantics when expressed in formal languages can help disambiguate the description of Web services during automatic discovery and composition of the Web services. Based on member submission WSDL-S , the key design principles for SAWSDL are: The specification enables semantic annotations for Web services using and building on the existing extensibility framework of WSDL. It is agnostic to semantic representation languages. It enables semantic annotations for Web services not only for discovering Web services but also for invoking them. Based on these design principles, SAWSDL defines the following three new extensibility attributes to WSDL 2.0 elements to enable semantic annotation of WSDL components: an extension attribute, named modelReference , to specify the association between a WSDL component and a concept in some semantic model. This modelReference attribute is used to annotate XML Schema complex type definitions, simple type definitions, element declarations, and attribute declarations as well as WSDL interfaces, operations, and faults. two extension attributes, named liftingSchemaMapping and loweringSchemaMapping , that are added to XML Schema element declarations, complex type definitions and simple type definitions for specifying mappings between semantic data and XML. These mappings can be used during service invocation. Publications
- Another area where semantic web can be used is in enterprise search. - And this is in fact exactly what we’re doing on OTN. We’re currently just using the capability for the press room, but beta testing will be starting soon for more of the site. I believe that semantic search is of most value where you have a steady stream of highly structured information coming in, where the semantic metadata is applied automatically in the background, e.g filling in forms, where the form is backed by an ontology. With OTN we are using RDF to type the content, so that you can dynamically aggregate the site according to your area of interest. So using this approach, you aren’t restricted to say drilling down into the database silo, or the middleware silo. Now you can choose to say view all forums, or all sites about Java or life sciences. Oracle Technology Network (OTN): aggregates many sources of content through a single portal Various feeds are pulled, Oracle’s taxonomy is used for additional annotation Ontologies make it possible to re-group related items; leads to more comprehensible search results and to narrow search Advantages include enhanced search and navigation, better user interface, better queries through SPARQL
Improving the reliability of search results Define content labels: metadata with more information on search results (eg, suitability for mobile devices, disabled users or children) With a proper extension (Segala’s Search Thresher): (1) search results are processed, trying to locate a content label file (in RDF) and (2) pages that contain a content label file are highlighted, additional information is displayed More trust and warning on search results; with proper settings, results can be filtered
University of Texas SAPPHIRE: an integrated biosurveillance system; integrates data from multiple disparate sources Data can be viewed from many different perspectives (disease surveillance, environmental protection, biosecurity and bioterrorism, veterinary surveillance, etc.) SAPPHIRE new data feeds absorbed easily (eg, it had a successful use during the Katrina disaster) Advantages include dynamic adaptability, usage of disparate and heterogeneous data
Ordnance Survey Geographic Referencing Framework Maintains a definitive mapping data of Great Britain (used by the public and private sector) Integrates different, semantically diverse sources of data Develop general ontologies (eg, Hydrology, Administrative Geography, Buildings and Places, etc) to bridge differences Efficient queries via the ontology or directly in RDF Advantages include efficient data integration, data repurposing, better quality control and classification
The complexity of supply chains has increased, they involve many players of differing size and function Support for “Operational Support Systems (OSS)” integration using semantic descriptions of system interfaces and messages Internet Service Providers integrate their OSS with those of BT (via a gateway) Integration of heterogeneous OSS systems of partners The approach reduces costs and time-to-market; ontologies allow for a reuse of services
Tata Consultancy Services Natural Language Interface to Business Applications A domain OWL ontology aids in the retrieval of relevant data and concepts Advantages include distinct semantics for the various concepts in the domain, link in external concepts; the system uses a natural language system to interact “intelligently” with the user
Reduction of Errors in Radiological Procedure Orders Clinical (textual) guidelines exist to assist clinicians for the appropriate radiological procedures Clinical guidelines and protocols (from different sources) can be integrated, bound to (OWL) ontologies SW engines generate recommendations for radiology orders Prevents, e.g., medical errors, spreads the development/maintenance of knowledge bases, traces the provenance of facts and rules in decision making
1 Databases: Focus on local semantics that have only aspects of the real world Typically keep that semantics implicit Use logic structurally Their schemas are not generally reusable Ontologies: Focus on global semantics of the real world Make that semantics explicit Enable machine interpretability by using a logic-based modeling language Are reusable as true models of a portion of the world 1. Intro. On the previous slide, I highlighted some of the differences between OWL lite, OWL DL, and OWL Full. I want to show this slide to help position OWL compared to other technologies in terms of expressivity. 2. Ontologies - So ontologies are about vocabularies and their meanings, with an explicit, expressive, and well-defined semantics, possibly machine-interpretable - Ontologies try to limit the possible formal models of interpretation (semantics) of those vocabularies to the set of meanings a modeler intends, i.e., close to the human conceptualization 3. DB - None of the other &quot;vocabularies&quot; such as database schemas or object models, with less expressive semantics, do that - The approaches with less expressive semantics typically assume that humans will look at the &quot;vocabularies&quot; and supply the semantics via the human semantic interpreter (your mental model) 5. Modeling - A given ontology cannot model completely any given domain - However, in capturing real world semantics, you are thereby enabled to reuse, extend, refine, generalize, etc., that semantic model - You cannot reuse database schemas - You might be able to take a database conceptual schema and use that as the basis of an ontology, but that would still be a leap from an Entity-Relation model to a Conceptual Model (say, UML, i.e., a weak ontology) to a Logical Theory (strong ontology) - But logical and physical schemas are typically pretty useless, since they incorporate non real world knowledge (and in non-machine-interpretable form) By the time you have the physical schema, you just have relations and key information: you've thrown away the little semantics you had at the conceptual schema level
concerning classes and class-hierarchies
Websites are converted into Freedom-internal descriptions. Parts of the instance data is the carbbank id, which is not really considered in this slide. Another part is the IUPAC glycan representation, which is 2 dimensional and ambiguous. This is converted into the linear LINUCS format, which again is a purely syntactical translation of the IUPAC structure. For that reason, ambiguity might still be there in the individual residue descriptions, whereas the structure is unambiguous. LINUCS is then given to the LINUCS to GLYDE converter which analyzes different spellings of monosaccharide residues to find an unambiguous representation, which is then compared to the instances in the KB.
Websites are converted into Freedom-internal descriptions. Parts of the instance data is the carbbank id, which is not really considered in this slide. Another part is the IUPAC glycan representation, which is 2 dimensional and ambiguous. This is converted into the linear LINUCS format, which again is a purely syntactical translation of the IUPAC structure. For that reason, ambiguity might still be there in the individual residue descriptions, whereas the structure is unambiguous. LINUCS is then given to the LINUCS to GLYDE converter which analyzes different spellings of monosaccharide residues to find an unambiguous representation, which is then compared to the instances in the KB.
Websites are converted into Freedom-internal descriptions. Parts of the instance data is the carbbank id, which is not really considered in this slide. Another part is the IUPAC glycan representation, which is 2 dimensional and ambiguous. This is converted into the linear LINUCS format, which again is a purely syntactical translation of the IUPAC structure. For that reason, ambiguity might still be there in the individual residue descriptions, whereas the structure is unambiguous. LINUCS is then given to the LINUCS to GLYDE converter which analyzes different spellings of monosaccharide residues to find an unambiguous representation, which is then compared to the instances in the KB.
Websites are converted into Freedom-internal descriptions. Parts of the instance data is the carbbank id, which is not really considered in this slide. Another part is the IUPAC glycan representation, which is 2 dimensional and ambiguous. This is converted into the linear LINUCS format, which again is a purely syntactical translation of the IUPAC structure. For that reason, ambiguity might still be there in the individual residue descriptions, whereas the structure is unambiguous. LINUCS is then given to the LINUCS to GLYDE converter which analyzes different spellings of monosaccharide residues to find an unambiguous representation, which is then converted to the representation used in GlycO and compared to the instances already present in the KB.
Use of canonical representation – model components that may be used combinatorial manner The rules of combination are also modeled as part of the ontology
A metabolic pathway is (ideally) a chain of reactions leading from one chemical compound to another. (Ideally, because pathways can branch at some points leading to different intermediate structures with the same final structure) This slide shows the description of a reaction between two glycans in GlycO. The reaction adds the glycosyl residue N-glycan a-D-Glcp 71 to the acceptor residue N-glycan a-D-Glcp 70, forming the new lipid linked N-glycan precursor molecule G00008 from the lipid linked N-glycan precursor molecule G10599 (identifiers are from KEGG). The box with the bar graphs shows evidence for this reaction in different experiments. ES embryonic stem cells EB embryonic body cells RA &quot;retinoic acid-treated&quot; cells. ES cells treated this way differentiate differently than cells grown in the absence of LIF, so there are two different kinds of differentiated cells (EB and RA).
1: the whole pathway is shown from the Dolichol compound over the first sugar: N-Acetyl-D-glucosaminyldiphosphodolichol (or GlcNAc-PP-dol) to the N-Glycan G00022 (KEGG accession No) or (GlcNAc)7 (Man)3 (Asn)1 (just numbers of residues, the glycan doesn’t have a common name, but belongs to a class of “Pentaantennary complex-type sugar chains”). 2. GNT-I (UDP-N-acetyl-D-glucosamine:3-(alpha-D-mannosyl)-beta-D-mannosyl-$glycoprotein 2-beta-N-acetyl-D-glucosaminyltransferase) catalyzes the reaction from 3-(alpha-D-mannosyl)-beta-D-mannosyl-R to 3-(2-[N-acetyl-beta-$D-glucosaminyl]-alpha-D-mannosyl)-beta-D-mannosyl-R 3. GNT-V (UDP-N-acetyl-D-glucosamine:6-[2-(N-acetyl-beta-D-glucosaminyl)-$alpha-D-mannosyl]-glycoprotein $6-beta-N-acetyl-D-glucosaminyltransferase) catalyzes the reaction from 6-(2-[N-acetyl-beta-D-glucosaminyl]-$alpha-D-mannosyl)-beta-D-mannosyl-R to 6-(2,6-bis[N-acetyl-$beta-D-glucosaminyl]-alpha-D-mannosyl)-beta-D-mannosyl-R, which is part of the Glycan G00021 4. The part of the ontology tree just shows where GNT-V is. 5. The GNT-V entry in the ontology shows that N-Glycan_beta_GlcNAc_9 is added with the help of Enzyme GNT-V to a sugar containing the residue N-glycan_alpha_man_4. Why this is important for GLycomics: G00021 is a so-called tetraantennary complex N-Glycan. When the red BlcNAc beta 1-6 is present due to GNT-V, this chain can be extended with polylactosamine. Polylactosamine is found in some metastatic cells. A challenge now is to find out whether this Glycan structure is always made by GNT-V. Then we might be able to tell something about GNT-V and cancer That is where probabilistic reasoning comes into play. Mention that man_4 and glcnac_9 are Contextual residues. Mention GlycoTree
A metabolic pathway is (ideally) a chain of reactions leading from one chemical compound to another. (Ideally, because pathways can branch at some points leading to different intermediate structures with the same final structure) This slide shows the description of a reaction between two glycans in GlycO. The reaction adds the glycosyl residue N-glycan a-D-Glcp 71 to the acceptor residue N-glycan a-D-Glcp 70, forming the new lipid linked N-glycan precursor molecule G00008 from the lipid linked N-glycan precursor molecule G10599 (identifiers are from KEGG). The box with the bar graphs shows evidence for this reaction in different experiments. ES embryonic stem cells EB embryonic body cells RA &quot;retinoic acid-treated&quot; cells. ES cells treated this way differentiate differently than cells grown in the absence of LIF, so there are two different kinds of differentiated cells (EB and RA).
This slide shows a glycan in a pathway. Interesting here is the partonomy. You can see that the glycopeptide G10695 is composed of the glycan moiety N-Glycan G10695 and a representative peptide moiety. Representative means, that we don’t specify which peptide moiety it exactly is. Interesting for glycomics researchers is the fact that they are looking at a glycopeptide. The glycan moiety itself is composed of parts (the carbohydrate residues), which in turn are composed of atoms.
Ontology-mediated provenance: use of ontology to construct and process provenance metadata.
Zoom into examples
Additional details can be found in http://www.semagix.com/documents/SemagixCIRASFinal_004.pdf
By N-glycosylation Process, we mean the identification and quantification of glycopeptides Separation and identification of N-Glycans Proteolysis: treat with trypsin Separation technique I: chromatography like lectin affinity chromatography From PNGase F: we get fractions that contain peptides and glycans – we focus only on peptides. Separation technique II: chromatography like reverse phase chromatography
Over the last decade there have been tremendous advances in biology. The human genome has been sequenced. It is possible to do large scale experiments to look at the level at which all genes are active within a single cell. Technologies are provided that allow the gene profile of individuals to be determined. Another area where there have been tremendous advances is in the area of imaging. This is 2 fold. There are high throughput screening approaches of assays in the lab. But there have also been huge advances in medical images, for example, MRI approaches are much better for brain scans. New approaches have been discovered for switching off individual genes, so that we can better understand their functionality. All these advances are allowing companies to approach tailored therapeutics.
Tailored therapeutics is the right dose to the right patient at the right time Benefits of better risk profile Benefits of cost – trials can be more effective Some drugs will make it to market now that would have failed otherwise But it is the end of the blockbuster era.
In order to embrace tailored therapeutics it is necessary to integrate data all of the way from basic research, through prototype design to design, pre-clinical and clinical. This is because it’;s necessary to combine genetic information with information about how the different subsets will respond to disease. This is proving to be a challenge for the pharma industry as much data has historically been held in silos. The need for better integration in drug discovery and development is so acute that the US’s FDA has written a report called innovation or stagnation, that specifically encourages companies to better use their data.
Described the 3 main data integration approaches But the right approach would depend upon your data environment How stable is the data, how much do you need to share it with collaborators, how important is performance, how much can you control the data (e.g. format, identifiers), does data provenance matter to you, are you looking to performance simple queries or look for complex patterns, how critical is security, are you working with many data types?
It’s a hybrid world – lots of relational, XML, and increasingly RDF/OWL. Keep data in existing representation Map to RDF. Many tools for mapping relational and XML to RDF (D2RQ, Squirrel RDF, DartGrid, GRDDL) Use ontologies as the semantic integration layer
- To date we have used SemWeb for TAT
For companies to fully benefit from the Semantic Web approach, new software tools need to be introduced at various layers of an application stack. Databases, middleware, and applications must be enhanced in order be able to work with RDF, OWL and SPARQL. As application stacks become Semantic Web enabled, they form a flexible information bus to which new components can be added. As the Semantic Web continues to mature, there is an increasing range of data repositories available that are able to store RDF and OWL. The earliest implementations of RDF repositories, or triple stores as they are also known, were primarily open source. The technology providers adopted varying architectural approaches. For example, 3store ( http:// threestore.sourceforge.net / ) employs an in-memory representation, while Sesame ( http:// sourceforge.net /projects/sesame/ ) and Kowari ( http:// kowari.org ) use a disk-based solution. More recently a number of commercial triple stores have become available. These include RDF Gateway from Intellidimension ( http:// www.intellidimension.com / ), Virtuoso from OpenLink ( http:// virtuoso.openlinksw.com ), Profium Metadata Server from Profium ( http:// www.profium.com/index.php?id =485 ), and the Oracle Database from Oracle ( http:// www.oracle.com/technology/tech/semantic_technologies ). An interesting trend is that solutions from both Oracle and OpenLink enable RDF/OWL to be stored in a triple store alongside relational and XML data. This type of hybrid solution enables a single query to span multiple different data representations.
Underlying technologies: Relational Database (DB2, Oracle, MySQL) - RDF triples stored in a table (subject, predicate, object, named graph) Save space by normalizing URIs and strings to integer ids. Extra tables for history, ACLs, replication - J2EE (Jetty, Tomcat, WebSphere) Jetty: Standalone server, checkout from CVS and run for testing WAS: Enterprise-ready Web-application server for real deployment - JMS Server (Active MQ, WebSphere MQ) pub-sub messaging used for real-time notifications of triple updates. Named Graphs A named graph is the logical unit of RDF storage in Boca. The named graph is the first-order unit of data access in the Boca programming model. Each triple exists in exactly one named graph The same S,P,O in two different graphs implies two separate statements. Adding and removing triples is done in the context of a named graph Each named graph has a metadata graph, containing information such as ACLs Named graphs can be exposed via URLs, Web Services, LSIDs Replication Boca clients have a persistent local RDF store that mirrors a subset of the triples on the Boca server. Replicated subset specified by: Triple patterns; e.g. (<http://tdwg.org/meetings/GUID-2#>, <http://tdwg.org/preds/hasParticipant>,*) Named graph URIs Triple patterns within named graphs When a replication is initiated, the service computes what has changed in the subset based on pattern and graph subscriptions. Replication can work as a background process on the client, or be explicitly initiated. Applications can query/write against graphs in the local and server models. Notification Updates to named graphs on server are published in near real-time to clients. Local replicas can be kept up-to-date between replications. Notification is central to distributed RDF applications Ex: workflow, collaboration Access Controls Boca uses can have the following system-wide permissions: canInsertNamedGraphs -- a user must have this permission in order to create a new named graph (i.e. insert statements into a graph that does not yet exist in the system) Boca users can have the following per-named-graph permissions (these apply also to the system graph): canRead -- a user with this permission may view the triples in the named graph and in its metadata graph canAdd -- a user with this permission may insert new triples into the named graph canRemove -- a user with this permission may remove triples from the named graph canChangeNamedGraphACL -- a user with this permission may change the ACL triples in the metadata graph canRemoveNamedGraph -- a user with this permission may entirely remove the named graph from the system Versioning SVN-like approach to versioning When a triple is added to or removed from a named graph, a new revision of that named graph is created. Simple API for reading old revisions Provides a straightforward mechanism for concurrent distributed computing. When a client submits an update to a named graph, it may specify the version number that it currently has. The update will fail if the graph has been more recently modified. Query Users may query Boca in a variety of ways. Query the complete database Query a subset of named graphs Query a particular named graph Query the local store
Maintains fidelity (user-specified lexical form) Supports long literal values In Oracle 10g release 2, there is support for both RDF and RDFS. The implementation takes advantage of the database’s object relation capabilities. All RDF triples are parsed out of documents and stored in the system as entries in tables under the MDSYS schema. An RDF triple is treated as one database object. A single RDF document that contains multiple triples will, therefore, result in many database objects. T he user interacts with the triples at an object level, and the underlying implementation is transparent. A key feature of Oracle’s RDF storage is that subject and object nodes are stored only once, regardless of the number of times they participate in triples. Subject and object nodes are reused, if they already exist in the database. A new link, however, is always created whenever a new triple is inserted. When a triple is deleted from the database, the corresponding link is directly removed. However, the nodes attached to this link are not removed if there is at least one other link connected to them. A key component of the RDF Data Model is also its ability to store the RDF graphs in separate models. Thereby, allowing the graphs to be partitioned. Bulk load and incremental DML load. In order to be able to query the RDF data in the database we have extended SQL. We have done this using the databases table functions capability. The query capability has been designed to meet most of the requirements identified by W3C in SPARQL for graph querying. The table function allows a graph query to be embedded in a SQL query. It has the ability to search for an arbitrary pattern against the RDF data, including inferencing, based on RDF, RDFS, and user-defined rules.
D2R MAP is a declarative language to describe mappings between relational database schemata and OWL/RDFS ontologies. The mappings can be used by a D2R processor to export data from a relational database into RDF. The mapping process executed by the D2R processor has four logical steps: Selection of a record set from the database using SQL 2. Grouping of the record set by the d2r:groupBy columns. 3. Creation of class instances and identifier construction. 4. Mapping of the grouped record set data to instance properties. As Semantic Web technologies are getting mature, there is a growing need for RDF applications to access the content of non-RDF, legacy databases without having to replicate the whole database into RDF. D2RQ is a declarative language to describe mappings between relational database schemata and OWL/RDFS ontologies. The mappings allow RDF applications to access the content of huge, non-RDF databases using Semantic Web query languages like SPARQL.
Although the ability to store data in RDF and OWL is very important, it is expected that much data will remain in relational and XML data formats. Consequently, tools have been developed that allow data in these formats to be made available as RDF. The most commonly used approach for mapping between relational representations and RDF is D2RQ ( http://sites.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/ ), which treats non-RDF relational databases as virtual RDF graphs. Other toolkits for mapping relational data to RDF include SquirrelRDF ( http:// jena.sourceforge.net/SquirrelRDF / ) and DartGrid ( http://ccnt.zju.edu.cn/projects/dartgrid/dgv3.html ). Alternative approaches for including relational data within the Semantic Web include FeDeRate ( http://www.w3.org/2003/01/21-RDF-RDB-access/ ) that provides mappings between RDF queries and SQL queries, and SPASQL (http: //www.w3.org/2005/05/22-SPARQL-MySQL/XTech.html ) which eliminates the query-rewriting phase by modifying mySQL to allow parsing SPARQL directly within the relational database server. It is also important to be able to map XML documents and XHTML pages into RDF. W3C will soon have a candidate recommendation, Gleaning Resource Descriptions from Dialects of Languages (GRDDL) ( http://www.w3.org/2004/01/rdxh/spec ), that enables XML documents to designate how they can be translated into RDF, typically by indicating the location of an XML Stylesheet Translations (XSLT). Work is also underway to develop mapping tools from other data representations to RDF ( http://esw.w3.org/topic/ConverterToRdf ). The ability to map data into RDF from relational, XML, and other data formats, combined with the data integration capabilities of RDF, makes it exceptionally well positioned as a common language for data representation and query.
Protégé is a free, open-source platform that provides a growing user community with a suite of tools to construct domain models and knowledge-based applications with ontologies. At its core, Protégé implements a rich set of knowledge-modeling structures and actions that support the creation, visualization, and manipulation of ontologies in various representation formats. Protégé can be customized to provide domain-friendly support for creating knowledge models and entering data. Further, Protégé can be extended by way of a plug-in architecture and a Java-based Application Programming Interface (API) for building knowledge-based tools and applications. SWOOP, meant for rapid and easy browsing and development of OWL ontologies. The GrOWL project aims to reconcile modern ontology languages with the philosophy of semantic networks and concept maps, by providing a set of graphical idioms that cover almost every knowledge modelling construct and presents a &quot;knowledge space&quot; to users in an intuitive graphical translation. GrOWL provides a graphical browser and an editor of OWL ontologies that can be used stand-alone or embedded in a web browser. While it is in alpha stage, it is already being used in a number of projects, including the Ecosystem Services Database and SEEK . This page introduces GrOWL and its rapidly expanding feature list . GrOWL is open source, written in Java, and it can be downloaded in the download section of this site. TopBraid Composer™ is an enterprise-class platform for developing Semantic Web ontologies and building semantic applications. Fully compliant with W3C standards, Composer offers comprehensive support for developing, managing and testing configurations of knowledge models and their instance knowledge bases. Composer provides a flexible and extensible framework with a published API for developing semantic client/server or browser-based solutions, that can integrate disparate applications and data sources. IODT is a toolkit for ontology-driven development. This toolkit includes EMF Ontolgy Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva). OntoTrack is a new browsing and editing &quot;in one view&quot; ontology authoring tool for OWL (currently a subset of OWL Lite called OWL Lite-). It combines a sophisticated graphical layout with mouse enabled editing features optimized for efficient navigation and manipulation of large ontologies. SemanticWorks™ 2007 provides powerful, easy-to-use functionality for: (from Altova) Visual creation and editing of RDF, RDF Schema (RDFS), OWL Lite, OWL DL, and OWL Full documents using an intuitive, visual interface and drag-and-drop functionality Syntax checking to ensure conformance with the RDF/XML specifications Auto-generation and editing of RDF/XML or N-triples formats based on visual RDF/OWL design Printing the graphical RDF and OWL representations to create documentation for Semantic Web implementations Typical tasks in Ontology Engineering: author concept descriptions refine the ontology manage errors integrate different ontologies (partially) reuse ontologies These tasks are highly challenging ; need for: tool & infrastructure support design methodologies
Bossam Bossam , a rule-based OWL reasoner (free, well-documented, closed-source). FaCT++ FaCT++ is an OWL DL Reasoner implemented in C++ KAON2 KAON2 is an an infrastructure for managing OWL-DL , SWRL , and F-Logic ontologies. it is capable of manipulating OWL-DL ontologies; queries can be formulated using SPARQL. Pellet Pellet is an open-source Java based OWL DL reasoner. It can be used in conjunction with both Jena and OWL API libraries; it can also be downloaded and be included in other applications. RacerPro RacerPro is an OWL reasoner and inference server for the Semantic Web
Zitgist (Zitgist LLC, USA) SWSE (DERI, Ireland) Swoogle (UMBC)
If a company has been able to consistently assign URIs across all of their data sources, then integrating data is straightforward with RDF. However, if different departments initially assigned different URIs to the same entity, then OWL constructs can be used to relate equivalent URIs. Several vendors provide middleware designed for such ontology-based data integration. These include OntoBroker (http://www.ontoprise.de/content/e1171/e1231/), TopBraid Suite (http://www.topquadrant.com/tq_topbraid_suite.htm), Protégé (http://protege.stanford.edu/), and Cogito Data Integration Broker (http://www.cogitoinc.com/databroker.html). A number of organizations have developed toolkits for building Semantic Web applications. The most commonly used is the Jena framework for Java-based Semantic Web applications (http://jena.sourceforge.net/). It provides a programmatic environment for reading, writing, and querying RDF and OWL. A rule-based inference engine is provided, as well as standard mechanisms for using other reasoning engines such as Pellet (http://pellet.owldl.com/) or FaCT++ (http://owl.man.ac.uk/factplusplus/). Sesame is another frequently used Java-based Semantic Web application toolkit that provides support for both query and inference (http://www.openrdf.org/about.jsp). A further commonly used application is the Haystack Semantic Web Browser that is based on the Eclipse platform (http://haystack.csail.mit.edu/staging/eclipse-download.html). There are also many programming environments that are available for developing Semantic Web applications in a range of languages (http://esw.w3.org/topic/SemanticWebTools). Ontology based integration Protégé is a free, open-source platform that provides a growing user community with a suite of tools to construct domain models and knowledge-based applications with ontologies. At its core, Protégé implements a rich set of knowledge-modeling structures and actions that support the creation, visualization, and manipulation of ontologies in various representation formats. Protégé can be customized to provide domain-friendly support for creating knowledge models and entering data. Further, Protégé can be extended by way of a plug-in architecture and a Java-based Application Programming Interface (API) for building knowledge-based tools and applications. TopBraid Composer™ is an enterprise-class platform for developing Semantic Web ontologies and building semantic applications. Fully compliant with W3C standards, Composer offers comprehensive support for developing, managing and testing configurations of knowledge models and their instance knowledge bases. Composer provides a flexible and extensible framework with a published API for developing semantic client/server or browser-based solutions, that can integrate disparate applications and data sources. RDF Browsers Longwell is a web-based RDF-powered highly-configurable faceted browser. SW Browsers Haystack webMethods - Using Semantic technologies (RDF/OWL) this metadata allows developers to find assets by context (rather than just name) and to perform complex searches to ascertain relationships and dependencies between assets. In a SOA environment, where services might have complicated interdependencies with other services, this capability is vital for allowing a developer to understand the impact if a given service is changed (for example, knowing which business process depends on the VerifyCredit service). This ability to provide context helps organizations to manage the complexity of their environment, which is especially important as more moving parts are introduced. Exhibit is a lightweight structured data publishing framework that lets you create web pages with support for sorting, filtering, and rich visualizations by writing only HTML and optionally some CSS and Javascript code. It's like Google Maps and Timeline , but for structured data normally published through database-backed web sites. Exhibit essentially removes the need for a database or a server side web application. Its Javascript-based engine makes it easy for everyone who has a little bit of knowledge of HTML and small data sets to share them with the world and let people easily interact with them. Haystack is a tool designed to let individuals manage all their information in ways that make the most sense to them. By removing arbitrary barriers created by applications that handle only certain information &quot;types&quot; and that record only a fixed set of relationships defined by the developer, we aim to let users define whichever arrangements of, connections between, and views of information they find most effective. Such personalization of information management will dramatically improve everyone's ability to find what they need when they need it.
Options 1 and 2 don’t scale! Option3 – use conventions e.g. class names or meta elements. Similar to what microformats do (without referring to RDF though). They many not extract RDF but use the data directly instead in Web 2.0 applications, but the application is not that different. Other applications may extract it to yield RDF (e.g. RSS 1.0) 4. GRDDL formalizes the scrapper approach. The user has to provide dc-extract.xsl and use its conventions (making use of the corresponding meta=s, class id-s, etc). But by using the profile attribute, a client is instructed to find and run the transformation processor automatically. There is a mechanism for XML in general, a transformation can also be defined on a XML schema level. A bridge to microformats. 5. RDFa extends (X)HTML a bit by defining general attributes to add metadata to any element (a bit like the class in microformats, but via dedicated properties). Provides an almost complete serialization of RDF in XHTML. The same mechanism can also be used for general XML content. It’s like microformats but with more rigor and more generic. Makes it easy to mix vocabularies, which is not that easy with microformats. It can easily be combined with GRDDL. Both GRDDL and RDFa aim at ‘binding’ existing structural data with RDF. GRDDL brings structured data to RDF. RDFa brings RDF to structured data. The same URI may be interpreted as a web page to be displayed by a browser, or as RDF data to be integrated. Most data is in relational databases. RDFying them is impossible. Bridges are being defined – a layer between RDF and the database, rdb tables are mapped to rdf graphs on the fly, in some cases the mapping is generic (columns represent properties, cells are eg literals or references to other tables via blank nodes), in other cases separate mapping files define the details. This is a very important source of RDF data. Part of the metadata information is present in tools.. But thrown away at output. E.g. photoshop cs stores metadata in RDF. RSS1.0 is a huge amount of RDF.
DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.