Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Triplestore and SPARQL

Prochain SlideShare
SPARQL Cheat Sheet
SPARQL Cheat Sheet
Chargement dans…3

Consultez-les par la suite

1 sur 26 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (19)


Similaire à Triplestore and SPARQL (20)

Triplestore and SPARQL

  1. 1. TRIPLESTORE AND SPARQL Lino Valdivia Jr 04.06.2013
  2. 2. OUTLINE The Semantic Web RDF SPARQL Triplestores Apache Jena DBPedia Conclusions Demo1: Apache Jena API Demo2: DBPedia
  3. 3. THE SEMANTIC WEB Most of the data in the web consists of unstructured or semi-structured data  HTML documents  Multimedia: images, video streams, audio files  Meant to read and processed by humans What if we can structure and add metadata to this “Web of Documents”, and make them understandable by machines?  Metadata → meaning, or semantics  Machines can perform new tasks that used to require human intervention This is the motivation behind the Semantic Web!  The term “Semantic Web” was initially coined by Tim Berners-Lee: “a web of data that can be processed directly and indirectly by machines.”
  4. 4. THE SEMANTIC WEB “The Semantic Web is a web of data…[it] provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.” [w3.org] For the Semantic Web to happen, we would need 1. A way to structure and link data in a standardized way 2. A way to describe the relationships of these data in a common way 3. A way to query that linked data 4. A way to infer something from that linked data (by applying a set of rules) but we will only focus on #1 and #3
  5. 5. RDF: A WAY TO STRUCTURE AND LINK DATA RDF = Resource Description Framework, a standard way for applications to represent information that can then be shared and processed A resource can be anything that is identifiable: a user, a coffee cup, a picture of your cat, a bank statement RDF provides a way to model data by breaking it down into three components: The subject The object The predicate (aka the property).
  6. 6. RDF AS A GRAPH Consider the following statement: Jordi lives in Barcelona  Subject: Jordi  Object: Barcelona  Predicate: lives-in (or, to be more precise, address-city) RDFs are typically represented as a labeled directed graph:  The arrow points from the subject to the object Jordi Barcelo na address- city
  7. 7. RDFS AND URIS Resources must be identifiable, and RDF uses Uniform Resource Identifier (URI) references. E.g. Jordi = http://example.org/Jordi URIs <> URLs!!! RDF graphs are typically shown with the URIs for the subject, object, and predicate: The RDF graph can also be rewritten in text as: <http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> . As you may have guessed, RDF is more machine-friendly than human-friendly! http://...Jord i http://.../Barcel ona http://.../address -city
  8. 8. RDF: RESOURCES AND LITERALS The object of a triple in RDF can either be a resource (identified by URIs) or a literal (values such as strings and numbers): We can represent the RDF graph above as text as: <http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> . <http://example.org/Jordi> <http://example.org/firstname> “Jordi” . <http://example.org/Jordi> <http://example.org/age> “37” . This textual representation is also known as Terse RDF Triple Language, or Turtle for short. http://...Jord i http://.../Barcel ona http://...address- city “Jordi” 37 http://...agehttp://...firstna me
  9. 9. RDF: PREFIXES Prefixes can be used to simplify representations, either in graphs: prefix ex: http://example.org or in Turtle: @prefix ex:<http://example.org/> . ex:Jordi ex:address-city ex:Barcelona . ex:Jordi ex:firstname “Jordi” . ex:Jordi ex:age “37” . Now that we have a way to structure and link our data, we want to be able to query it for information. ex:Jordi ex:Barcelona ex:address-city “Jordi” 37 ex:ageex:firstname
  10. 10. SPARQL: A WAY TO QUERY LINKED DATA SPARQL = SPARQL Protocol and RDF Query Language SPARQL 1.1 became a W3C Recommendation on March 2013! Example: given our RDF graph, show all users who live in Barcelona: PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . }
  11. 11. SPARQL AND GRAPH PATTERNS The statements in the WHERE clause form a graph pattern, which is matched against subgraphs in the RDF graph to form the solution. PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . } ex:Jordi ex:Barcelon a ex:address-city “Jordi ” 37 ex:ageex:firstna me ex:Badalon a ex:Josep ex:address-city
  12. 12. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: FILTERs, ORDER BYs, LIMITs, and OFFSETs: Show the names of users who live in Barcelona and are less than 40 years old, starting from the 11th to 40th user: PREFIX ex: <http://example.com/> SELECT ?lname ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . ?user ex:lastname ?lname . ?user ex:age ?age FILTER (?age < 40) } ORDER BY ?lname LIMIT 30 OFFSET 10
  13. 13. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: Alternative matches using UNION, for those cases where resources in the expected result set may match multiple patterns: Show the first names of users who live in Barcelona or in Badalona: PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:firstname ?fname . { { ?user ex:address-city ex:Barcelona . } UNION { ?user ex:address-city ex:Badalona . } } }
  14. 14. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: OPTIONAL matches, for those cases where not all resources in the expected result set do not have to match a pattern: Show the first names of users who live in Barcelona and their profile pic image, if they have one: PREFIX ex: <http://example.com/> SELECT ?fname ?ppic FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . OPTIONAL { ?user ex:ppic ?ppic . } }
  15. 15. SPARQL: THE SELECT OPERATION SPARQL SELECT operations also support: Set inclusion (IN/NOT IN) GROUP BY, HAVING, and aggregate functions such as COUNT and AVG (new in SPARQL 1.1) Subqueries (new in SPARQL 1.1)
  16. 16. SPARQL: OTHER OPERATIONS Aside from SELECTs for querying, SPARQL also has CONSTRUCT – creates a single RDF graph from the result of a query by combining (i.e. applying set union on) the resulting triples ASK – returns a Boolean that indicates whether the query is resolvable or not DESCRIBE – returns an RDF graph that describes the result (as determined by the query service) INSERT/DELETE – adds or removes triples from the graph (new in SPARQL 1.1) Graph management operations (CREATE, DROP, COPY, MOVE, ADD) (new in SPARQL 1.1)
  17. 17. TRIPLESTORES The statements in an RDF graph (subject-predicate-object) are also known as triples, and the specialized database used for storing them are called triplestores. Triplestores vs Graph Databases – What’s the diff? Triplestores are especially designed to store RDF graphs, which are labeled directed graphs On the other hand, graph databases can store any kind of graph (unlabeled, undirected, weighted, etc.) Graph databases don’t have a standard query language (Cypher?) Triplestores must support SPARQL Triplestores are optimized for graph pattern matching, and may lack the full capabilities of graph DBs But graph databases can be used to implement a triplestore (see Sequeda, J. (2013, January 31) Introduction to Triplestores)
  18. 18. SPARQL AND CYPHER SPARQL: PREFIX ex: <http://example.com/> SELECT ?fname FROM <users.rdf> WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . } Cypher: MATCH user–[:ex_firstname]->fname, user-[:ex_address-city]->city WHERE city.uri = “ex:Barcelona” RETURN fname ex:Jordi ex:Barcelon a ex:address-city “Jordi ” 37 ex:ageex:firstna me
  19. 19. TRIPLESTORE IMPLEMENTATIONS Native Triplestores Sesame BigData Meronymy Apache Jena TDB Graph DB-based AllegroGraph Oracle Spatial and Graph (formerly Oracle Semantic Technologies) Relational DB-based Apache Jena SDB IBM DB2
  20. 20. APACHE JENA Born in HP Labs in 2000, became a top-level Apache project in April 2012 The Jena Framework includes A Java API for working with RDF models A SPARQL query processor An efficient disk-based native triplestore A rule-based inference engine that can be used with RDF-based ontologies A server for accepting SPARQL queries over HTTP (a SPARQL endpoint)
  21. 21. APACHE JENA: RDF API The Statement interface represents triples, while the Model interface represents the whole RDF graph Given a Statement, one could invoke  getSubject(), which would return a Resource  getPredicate(), which would return a Property  getObject(), which would return an RDFNode (which can be a Resource or a Literal) To create our example basic RDF graph: Model model = ModelFactory.createDefaultModel(); Resource j = model.createResource(“http://example.org/Jordi”); Resource bcn = model.createResource(“http://example.org/Barcelona”); Property addrCity = model.createProperty(“ex”, “address-city”); // This automatically creates a Statement in the associated model. j.addProperty(addrCity, bcn);
  22. 22. APACHE JENA: ARQ API Jena also provides an API called ARQ for programmatically executing SPARQL queries. To execute a given query on our example graph: String queryString = “...”; Query query = QueryFactory.create(queryString); // Associate a query execution context against our model. QueryExecution qe = QueryExecutionFactory.create(query, model); ResultSet rs = qe.execSelect(); // ResultSet acts like an Iterator. for (; rs.hasNext();) { QuerySolution qs = rs.nextSolution(); RDFNode r = qs.get(“fname”); // You can get a variable by name. // Do what you want with it. } // Always good to close resources when done. qe.close();
  23. 23. APACHE JENA: TDB Jena’s native triplestore implementation is called TDB and consists of The node table stores resources, predicates (relationships), and literals maps nodes to internal node ids, and vice versa node ids are 8 bytes (64 bits) long The triple indexes stores 3 indexes into the node table The prefixes table maps prefixes to URIs TDB also supports ACID transactions using write-ahead logging. But no transaction is needed if there is only one single writer (even with multiple concurrent readers)
  24. 24. RDF/SPARQL IN ACTION: DBPEDIA.ORG DBPedia describes itself as a “crowdsourced community effort to extract structured information from Wikipedia”  1.89 billion triples localized in 111 languages  English dataset contains 3.77 million topics Imagine if you can ask Wikipedia…  Which towns in Cataluña have a population between 10,000 and 50,000 people?  What are the birthdays of all blues guitarists who were born in Chicago?  (sample query from DBPedia.org wiki) Show me all soccer players who played as goalkeeper for a club that has a stadium with more than 40,000 seats and who are born in a country with more than 10 million inhabitants DBPedia also provides a SPARQL endpoint, so other websites can query its data and get results that are continuously updated DBPedia also contains geo-coordinates obtained from other sources (e.g. Geonames, Eurostat, CIA World Fact Book) – this opens the possibility for location-based applications from mobile devices
  25. 25. CONCLUSIONS The Semantic Web – Web 3.0? RDF and SPARQL are key technologies in the W3C’s vision of the web of tomorrow Companies like Google, Tesco, and Best Buy already produce RDF content! Add some SPARQL to your projects! Source: w3.org
  26. 26. BIBLIOGRAPHY Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web. http://www.scientificamerican.com/article.cfm?id=the-semantic-web W3 Consortium. (2004, February 10). RDF Primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ W3 Consortium. (2013, March 21). SPARQL 1.1 Query Language http://www.w3.org/TR/sparql11-query/ Sequeda, J. (2013, January 31) Introduction to Triplestores http://semanticweb.com/introduction-to-triplestores_b34996 Apache Jena http://jena.apache.org/ DBPedia http://dbpedia.org/

Notes de l'éditeur

  • A URI can be used to identify a resource by name or location (or both). If it specifies a location, it’s referred to as a URL. When used as a name, it’s referred to as an URN.
  • Officially the W3C proposes RDF/XML as the syntax to use when serializing RDF graphs, but Turtle was found to be easier and manageable to edit. RDF/XML’s “lack of transparency and readability might have been a factor inhibiting rapid adoption of RDF” [Shadbolt, N; Hall, W; Berners-Lee, T, The Semantic Web Revisited, 2006]
    Turtle is related to two other notations for triples, N-Triples and N3, following this relation
    N-Triples  Turtle  N3
    N-Triples is more minimalistic, while N3 can be used to express more than just RDF (http://en.wikipedia.org/wiki/Turtle_%28syntax%29)
  • By default the statements in the WHERE clause are conjunctions (AND)
  • Graph CREATE: creates an empty named graph
    Graph DROP: removes a named graph
    Graph COPY(g1, g2): overwrites the contents of g2 with the contents of g1 (similar to DROP g2 followed by INSERT ALL (g1, g2) – g1 is not modified by this operation)
    Graph MOVE(g1, g2): overwrites the contents of g2 with the contents of g1, then g1 is DROPped
    Graph ADD(g1, g2): inserts tuples from g1 into g2 – g1 is not modified by this operation
  • Cypher is a de-facto standard, but is still mostly associated with Neo4J
  • There are similarities, but Cypher has a lot of other features suitable for graph databases (e.g. find shortest path, find nodes that are n hops away from the start node, etc.)
    Note that most SPARQL queries expect to scan the graph for the result, while most Cypher queries typically specify a start node. This is not really an issue since specifying a start node is optional in Cypher anyway (http://docs.neo4j.org/chunked/milestone/query-start.html)
  • Apache Jena provides its own native triplestore implementation as well as an API for leveraging relational stores (PostgreSQL, MySQL, Oracle, Microsoft SQL Server, etc)
    Sesame is a Java-based implementation maintained by openrdf.org
  • Jena entered Apache incubation in November 2010
  • The RDF API is part of Jena’s Core API jar
  • The QueryExecutionFactory interface also has methods for binding a Query to a SPARQL (HTTP) endpoint, allowing applications to query remote triplestores
  • The node table is stored as a sequential access file (for NodeId -> Node mappings) and a B+Tree (for Node -> NodeId)
    In write-ahead logging, changes to be made to the database are recorded in logs (in the form of redo and undo logs). In Jena, modifications made in a txn are written to a journal (a redo log), which is later committed to disk.
    Jena TDB has been tested to hold up to 1.7B triples (http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29)
  • DBPedia uses OpenLink Virtuoso as its triplestore
  • Other SW technologies:
    RDF Schema and Web Ontology Language (OWL) provide a richer set of semantics (vocabularies) for describing a group of related concepts: genealogies (e.g. isMother, hasChildren), application-specific class hierarchies (through rdfs:type, rdfs:subClassOf, etc),
    Rule Interchange Format (RIF) to facilitate the exchange of rules across different systems (rule engines)
    Trust and Provenance (how can we establish that an RDF source is trustworthy? Can you prove how derived (inferred) semantics were obtained?)