The aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. The LarKC platform is available at larkc.sourceforge.net. This talk, is part of a tutorial for early users of the LarKC platform, and describes the platform architecture.
5. WSDL descriptionPlug-in Plug-in description + URI getIdentifier() + QoSInformationgetQoSInformation() Plug-ins are assembled into Workflows, to realise a LarKC Experiment or Application Plug-ins are identified by a URI (Uniform Resource Identifier) Plug-ins provide MetaData about what they do (Functional properties): e.g. type = Selecter Plug-ins provide information about their behaviour and needs, including Quality of Service information (Non-functional properties): e.g. Throughput, MinMemory, Cost,… Plug-ins can be provided with a Contract that tells them how to behave (e.g. Contract : “give me the next 10 results”) and Context information used to store state between invocations 3
16. LarKC Plug-in API: SELECT Selecter + SetOfStatements select(SetOfStatementstheSetOfStatements, Contract contract, Contextcontext) SELECT: Given a set of statements (e.g. a number of RDF Graphs) will choose a selection/sample from this set Collection of RDF Graphs Triple Set (Merged) Collection of RDF Graphs Triple Set (10% of each) Collection of RDF Graphs Triple Set (N Triples) 7
17. LarKC Plug-in API: REASON Reasoner + VariableBindingsparqlSelect(SPARQLQuerytheQuery, SetOfStatementstheSetOfStatements, Contract contract, Context context) + SetOfStatementssparqlConstruct(SPARQLQuerytheQuery, SetOfStatementstheSetOfStatements, Contract contract, Context context) + SetOfStatementssparqlDescribe(SPARQLQuerytheQuery, SetOfStatementstheSetOfStatements, Contract contract, Context context) + BooleanInformationSetsparqlAsk(SPARQLQuerytheQuery, SetOfStatementstheSetOfStatements, Contract contract, Context context) REASON: Executes a query against the supplied set of statements SPARQL Query Variable Binding (Select) SPARQL Query Set of statements (Construct) SPARQL Query Set of statements (Describe) SPARQL Query Boolean (Ask) 8
18. LarKC Plug-in API: DECIDE Decider + VariableBindingsparqlSelect(SPARQLQuerytheQuery, QoSParameterstheQoSParameters) + SetOfStatementssparqlConstruct(SPARQLQuerytheQuery, QoSParameterstheQoSParameters) + SetOfStatementssparqlDescribe(SPARQLQuerytheQuery, QoSParameterstheQoSParameters) + BooleanInformationSetsparqlAsk(SPARQLQuerytheQuery, QoSParameterstheQoSParameters) DECIDE: Builds the workflow and manages the control flow Scripted Decider: Predefined workflow is built and executed Self-configuring Decider: Uses plug-in descriptions (functional and non-functional properties) to build the workflow 9
22. participants modified plug-ins, modified workflowsStandard Open Environment: subversion connection, command line build, or eclipse, netbeans soon? Plug-in API Decider Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Selecter Query Transformer Identifier Reasoner Info. Set Transformer Plug-in Registry Pipeline Support System 10
23.
24. Plug-in API enables interoperability (between plug-in and platform and between plug-ins)
25. Plug-ins I/O abstract data structures of RDF triples => flexibility for assembling plug-ins and for plug-in writers
27. LarKC Architecture Application Plug-in API Decider Pipeline Support System Plug-in Registry Plug-in API Platform Utility Functionality Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager APIs Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Plug-ins Query Transformer Identifier Selecter Reasoner Info. Set Transformer Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API External systems External data sources Data Layer API Data Layer RDF Store RDF Store RDF Store RDF Doc RDF Doc RDF Doc LarKC Plug-in API 12
28. Plug-in API Decider Decider Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Selecter Query Transformer Identifier Reasoner Info. Set Transformer Info Set Transformer Identifier Selecter Query Transformer Reasoner Plug-in Registry Workflow Support System RDF Store What does a workflow look like? 13
29. What Does a Workflow Look Like? Plug-in API Default Graph Decider Decider RDF Graph RDF Graph Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager RDF Graph RDF Graph RDF Graph Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Selecter Query Transformer Identifier Reasoner Info. Set Transformer RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph Info Set Transformer Identifier Selecter Query Transformer Reasoner Plug-in Registry Workflow Support System RDF Graph RDF Graph Data Layer Data Layer Data Layer Data Layer RDF Store RDF Graph 14
30. LarKC Data Model :Transport By Reference Labeled Set: Pointers to data Dataset: Collection of named graphs Default Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph Current Scale: O(1010) triples 15
31. What Does a Workflow Look Like? Plug-in API Default Graph Decider Decider RDF Graph RDF Graph Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager RDF Graph RDF Graph RDF Graph Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Selecter Query Transformer Identifier Reasoner Info. Set Transformer RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph Info Set Transformer Identifier Selecter Query Transformer Reasoner Plug-in Registry Workflow Support System RDF Graph RDF Graph Data Layer Data Layer Data Layer Data Layer RDF Store RDF Graph 16
32. What Does a Pipeline Look Like? Plug-in API Decider Decider Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Info Set Transformer Identifier Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Selecter Query Transformer Identifier Reasoner Info. Set Transformer Identifier Info Set Transformer Identifier Selecter Query Transformer Reasoner Plug-in Registry Wlorkflow Support System Data Layer Data Layer Data Layer Data Layer RDF Store 17
33. Remote and Heterogeneous Plug-ins Remote Plug-in Manager TRANSFORM TRANSFORM IDENTIFY IDENTIFY Adaptor SPARQL- GATE API SPARQL SPARQL-CycL SPARQL External or non-Java Code Research Cyc GATE Data Layer SINDICE Medical Data 18
34. What Does a Workflow Look Like? Plug-in API Decider Decider Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Reasoner Info Set Transformer Identifier Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Selecter Query Transformer Identifier Reasoner Info. Set Transformer Info Set Transformer Identifier Info Set Transformer Identifier Selecter Query Transformer Reasoner Plug-in Registry Workflow Support System Data Layer Data Layer Data Layer Data Layer RDF Store 19
47. Application Plug-in API Decider Pipeline Support System Plug-in Registry Platform Utility Functionality Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager Plug-in Manager APIs Plug-in API Plug-in API Plug-in API Plug-in API Plug-in API Plug-ins Query Transformer Identifier Selecter Reasoner Info. Set Transformer Data Layer API Data Layer RDF Store RDF Store RDF Store RDF Doc RDF Doc RDF Doc LarKC Data Layer 22 External systems Data Layer API External data sources Data Layer
48. LarKC Data Layer 23 Main goal: The LarKC Data Layer supports all LarKC plug-ins with respect to: storage, retrieval and light-weight inference on top of large volumes of data automates the exchange of RDF data by reference and by value offers other utility tools to manage data (e.g. merger) Labeled Set Default Graph Dataset RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph RDF Graph
49. The implementation of the data layer was evaluated against Well-known benchmarks: LUBM (Lehigh Univ. Benchmark) and BSBM (Berlin SPARQL Benchmark), and Two views to the web of linked data used in LarKC: PIKB (Pathway and Interaction Knowledge Base) and LDSR (Linked Data Semantic Repository) Loading: 15B statements at 18 KSt/sec. on $10,000 server 1B statements at 66 KSt/sec. on $2,000 desktop Reasoning & Materialization: LUBM: 21 KSt/sec for 1BSt and 10 KSt/sec for 7B expl. statements LDSR: 14 KSt/sec for 357M expl. statements PIKB: 10 KSt/sec for 1.5B expl. Statements Competitive with State of the Art 24 LarKC Data Layer Performance
51. Inference with both LDSR and PIKB prove to be much more complex than LUBM, because The datasets are much better interconnected There are plenty of owl:sameAs links OWL vocabulary is used disregarding its formal semantics E.g. in DBPedia there are skos:broader cycles of categories with length 180 Optimizations of the handling of owl:sameAs are crucial PIKB: 1.47B explicit statements + 842M inferred LDSR loaded in 7 hours on desktop: Number of imported statements (NIS): 357M Number of new inferred statements: 512M Number of stored statements (NSS): 869M Number of retrievable statements (NRS): 1.14B owl:sameAs optimisation allowed reducing the indices by 280M statements 26 LarKC Data Layer Evaluation: Linked Data
58. Active and Ready for the Public 2170 check-outs 1380 commits 23 users of code repository LarKC + Alpha Plus Early Adopters Workshop branch 20 downloads of alpha 1 public release since 30th May 2009. 28
59. Lessons Learned (1/2) API Design Types of Plug-ins: 5 (+1 => 2 types of TRANSFORM) I/O data structures more abstract => more flexibility for assembling plug-ins and for plug-in writers Test API Implementation Validation and refinement of API (introduction of ‘Contract’ and ‘Context’ parameters) Transforming Cyc into LarKC Platform Minimization and reorganization of Cyc code as a basis for the LarKC Platform Plug-ins and Use cases implementation Feedback collected, as our first early adopters, on different topics (how-to guidelines, context parameter, plug-ins types, data caching,…) 29
60. Lessons Learned (2/2) Licensing: Licensing policies aligned with partners’ and project’s interests => maximize openess and external contributions without preventing from exploitation Components’ licenses monitoring to avoid conflicts MaRVIN and IBIS: strategy applicable to large-scale deployment, autonomous and symmetric nodes, asynchronous communication between nodes, well-balanced load needed abstraction layer hiding resources heterogeneity (IBIS) 30
61. Project Timeline 42 0 6 18 33 10 Use Cases V2 Use Cases V3 Use Cases V1 Plug-ins Surveys (plug-ins, platform) & Requirements (use cases) Offer computing resources Monitoring & instrumentation Anytime behaviour Prototype Internal Release Public Release Final Release Data caching 14 31
We’ve implemented a platform that realizes the goal of the proposal: Supporting the experimentation that allow massive, and necessarily incomplete, reasoning over web scale data. Most of the work will be in the plug-ins, and we’ve already got interesting ones that we’ll demonstrate, but to support them, we’ve added services that support the plug-ins in as lightweight a fashion as possible, but no more. This includes the workflow support system, that allows the plug-ins to execute, in the right order; a plug-in management system that integrates the plug-ins with the platform and a plug-in registry, which supports meta-reasoning and quality of service, and a data-layer, designed to make handling massive data practical.We also provide a default RDF store, and default meta-reasoning support, and are currently working on the first versions of parallelisation support.
GiveexampleofMetaData (include in slides) andQoSinfo => isthisincluded in WP1 ppt?? GiveexampleofContractandContextparameters"are they web services?".At the moment they are not and much of the WSDL parts are empty.The reason to use WSDL at all is that WSMO-Lite SAWSDL with a WSMO-Lite ontology and SAWSDL is an extension of WSDL.And anyway, maybe they will be full-fledged WSDL web services one day.
"What is a triple pattern?"
Better example for the last bullet would be foaf vocabulary to facebook vocabulary.
"What is a triple pattern?“During the implementation of the first prototype it was realised that there are essentially two types of transform components in a workflow. The first prototype workflow used the Sindice [10] Web service to ‘identify’ RDF resources on the Web that could be used to answer the input SPARQL query. However, the Sindice service comes in two forms – triple pattern search and keyword search – neither of which can use the input SPARQL query directly. Indeed similar services such as SWOOGLE and Watson also use a variety of input data forms. Hence it became clear that a transformation of the input SPARQL query is required and to facilitate this, a new plug-in interface was created, ‘QueryTransformer’, as a special case of TRANSFORM plug-in. Originally, it was planned for plug-ins to accept and return certain data structures that were identified from the proposed LarKC data model. For example, it made sense that a SELECT plug-in would accept a collection of RDF graphs (data-set) and return a subset of these triples (triple-set). However, this approach meant that it became impossible to wire together two select components in series in a workflow without significant extra programming. So after several revisions of the API, it was realised that from a plug-in’s point of view, the type of the data structures used as input was not relevant. The plug-in just needs to be able to access and process the triples. Therefore, the plug-in interfaces were modified to accept and return only the most abstract data structures containing RDF triples, thus imposing less restrictions on how plug-ins are assembled in a workflow and giving plug-in writers greater freedom to return RDF triples in data-structures appropriate for the algorithm encapsulated within their plug-in. Ensuring compatibility between plug-ins will be done by the DECIDER plug-ins and/or workflow configurators, based on plug-ins metadata (plug-ins description through plug-in annotation language).
Remember here the overal architecture picture from Michael´s presentation, in order to introducte the details in next slides (APIs,…) Mention shortly what we will explain later and where in the platform it is located: APIs, parallelization/distribution (where they are “hidden” within the arch picture)Purple = Platform Utility FunctionalityGreen = APIsBlue = Plug-ins (Not sure if Data Layer should be viewed as a plug-in and thus blue)Orange = external systemsRed = external data sources
LarKC workflows are more like work flows.
LarKC workflows are more like work flows.
The LarKC data model allows triple sets to be physically moved between plug-ins, but this can be expensive, especially during identification and selection, so the data layer also supports the transfer of references to named sets of triples in an RDF store or out in the web.
LarKC workflows are more like work flows,Data transfer can be virtualised
LarKC workflows are more like work flows,Data transfer can be virtualised
- heterogenous: - heterogenous data: TRANSFORM, combine text and triples, (WP7 – gate and medical data) combine different vocabularies - heterogenous code: wrappers, combine new & legacy, java & non-java, local code and calling a web-service, etc
LarKC workflows are more like work flows,Data transfer can be virtualised
Meta-reasoning can dynamically construct, or reorder workflows. A decider, reasoning about the contents of the plug-in registry, here, constructs two different workflows to answer the same query, when provided with two different sets of plugins, A, and B.Logical representation of plug-in’s meta-dataPlug-in rolesDescription of inputs and outputsLogical representation automatically extracted using only the functions from the API and Java classesCan automatically assemble API v0.2 plug-ins into working workflowUsing predefined rules for composing plug-insFast – can be done in on the flyOngoing:Adding QoS parameters to the meta-dataUse QoS parameters at assembling and modifying pipe-lines
Remember here the overal architecture picture from Michael´s presentation, in order to introducte the details in next slides (APIs,…) Mention shortly what we will explain later and where in the platform it is located: APIs, parallelization/distribution (where they are “hidden” within the arch picture)Purple = Platform Utility FunctionalityGreen = APIsBlue = Plug-ins (Not sure if Data Layer should be viewed as a plug-in and thus blue)Orange = external systemsRed = external data sources
The data layer API gives you powerful (maybe too powerful tools) to manipulate different structured of RDF like: Merge arbitrary sets of RDF types (e.g., dataset with RDF published on remote URI) and thread them as a single RDF data unit to be consumed by the plugins Execute SPARQL queries over any type of RDF structureBe warned that some of this methods are too powerful, because they try to guarantee the complete results (no SPARQL distribution is used just the data is replicated temporary locally to execute the query, which make take a lot of IO and CPU)
Berlin SPARQL Benchmark (BSBM)Lehigh University Benchmark (LUBM)Linked Data Semantic Repository (LDSR)Pathway and Interaction Knowledge Base (PIKB) Uniprot (only curated entries; schema) Entrez - Gene (complete dataset; custom schema) BioPAX - Cancer Cell Map (BioPAX distribution) BioPAX - NCI Pathway Interaction Database (BioPAX distribution) BioPAX - Reactome (BioPAX distribution) BioPAX - BioGRID (complete dataset schema aligned to BioPAX) BioPAX - iProClass (complete dataset; custom schema) Gene Ontology (complete dataset; original schema) NCBI Taxonomy (complete dataset; custom schema)
D5.5.2 presents update on the state of the art in scalable RDF engines, as a basis for evaluation of the results of OWLIMThe map presents the loading speed of few of the most scalable repositories in relation to the size of the dataset and the complexity of the loading. The best published evaluation results have been used for each system. For OWLIM, ORACLE and DAML DB, loading includes forward-chaining and materialization. This diagram shows results up to 1.5billion explicit statementsOne can see that the results for loading are comparable, taking into account that the engines differ in features. Taking BigOWLIM’s results, one can observe how the difference in the semantics supported can alter the loading time almost by a factor of three. Overal, the evaluation demonstrated that LarKC data layer is very well positioned with respect to the other outstanding engines in the highly competitive niche of the so-called semantic repositories
The results of loading LDSR and PIKB are presented on the first “bubble-chart” – the bubbles are bigger than those for LUBM, to indicate higher complexity. Generally, the notion of “reason-able views” makes reasoning with linked data feasibleThe Linked Data Semantic Repository (LDSR) is discussed in WP2The Pathway Interaction KB (PIKB) is presented in WP7AThere will be demos based on LDSR and PIKB at the “demo market”
LarKC API DefinitionV0.1: non-streaming execution onlyV0.2: from non-streaming execution to streaming anytime behaviourV0.3: integration of Data Layer API <= Current Stable VersionImplementation of two test-rigs in order to validate the API and the general LarKC ideas Scripted DECIDE PlatformSelf-configuring DECIDE PlatformThey only differ in their code for the Decider plug-in, and in some minimal wrapping code for each plug-in to register the required information about itself in the meta-knowledge-base. All the other code is exactly the same between the two test-rigs, giving us confidence that indeed plug-ins will be re-usable under different Decider plug-ins.What we want to achieve in the next year wrt parallelization and distribution. Mention we have achieved gross-grain distribution and explain how, concrete technologies (IBIS,…). Layered architecture (implementation oriented), according to updated slide presented in EAW by Alexey.): Details of kind of parallelization,… concrete details on parallelization techniques,… to speed up performance… Give concrete details on how to apply concrete technologies. Distributed Data LayerData Streaming between remote componentsCaching, data warming/coolingMonitoring / InstrumentationFurther investigation and application of parallelization and distribution techniques to different types of distributed environments (high-performance grid, desktop grid, etc.)Further investigation and application of parallelization “within plug-ins” techniquesArchitecture refinementRequirements traceability (and possible update)