Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case

Getting the best of Linked Data and
Property Graphs: rdf2neo and the
KnetMiner Use Case
Marco Brandizi
marco.brandizi@rothamsted.ac.uk
Find these slides at:
https://www.slideshare.net/mbrandizi

A short story about Gene Knowledge
www.knetminer.org

<concept>
<id>1</id>
<pid>Q75WV3</pid>
<description/>
<elementOf>
<idRef>UNIPROTKB-SwissProt</idRef>
</elementOf>
<ofType>
<idRef>Protein</idRef>
</ofType>
<evidences>
<evidence>
<idRef>IMPD</idRef>
</evidence>
</evidences>
<conames>
<concept_name>
<name>Probable trehalose-phosphate phosphatase 1</name>
<isPreferred>true</isPreferred>
</concept_name>
…
<cc>
<id>Protein</id>
<fullname>Protein</fullname>
<description>
A protein is comprised of one or more Polypeptides
and potentially other molecules.
</description>
<specialisationOf>
<idRef>MolCmplx</idRef>
</specialisationOf>
</cc>
<relation>
<fromConcept>1</fromConcept>
<toConcept>3</toConcept>
<ofType>
<idRef>participates_in</idRef>
</ofType>
<evidences>
<evidence>
<idRef>ECO:0000316</idRef>
</evidence>
</evidences>
<relgds/>
</relation>
<concept>
<id>3</id>
<pid>GO:0009651</pid>
<description>response to salt stress</description>
<ofType><idRef>BioProc</idRef></ofType>
<coaccessions>
<concept_accession>
<accession>GO:0009651</accession>
<elementOf><idRef>GO</idRef></elementOf>
<ambiguous>false</ambiguous>
</concept_accession>
</coaccessions>
</concept>
A short story about Gene Knowledge
www.ondex.org

A short story about Gene KnowledgeCan we improve? Graph DBs?
Query Languages? Open Data?
FAIR?
Sure! RDF! OWL!
Triple Store! SPARQL!
Uhm, we’ve tried that,
but…
I can feel what you mean,
but, it’s
not so difficult, let me…
Look! I’ve seen this Neo4j! It
has relations with properties!
Uhm… well… yeah, but no data
format, bad with ontologies, No
URIs/merging…
And look how cool a browser!
Oh, yes, that’s cool, but
maybe not the most
important thing…And Cypher is a
breeze!
Uhm… let me try. Oh, cool,
but UNION sucks, and…
And has graph algorithms!
And devs got the APIs in
minutes!
Uhm… Are Jena/RDF4J
that harder?
… …
Source: https://digiday.com/uk/weve-created-monster-publishers-vent-ad-tech-frustration

Why not Taking the Best of Both
Worlds?

Comparing Functionality
• Data ELT and Integration
• See our example: https://github.com/Rothamsted/bioknet-
onto/tree/master/examples/bmp_reg_human
• Semantic Web is focused on standardised data sharing
• Neo4j doesn’t have a data format, focused on backing applications
• URI-based merging in RDF
• CONSTRUCT-based data transformations in Sem Web (including tools like TARQL)
• MATCH/CREATE in Cypher, but not the same
• Query languages
• Cypher considered compact and simple to learn
• SPARQL better at complex graph patterns with branches
• Cypher very good at chain patterns

Query Performance
Details at: https://github.com/Rothamsted/graphdb-benchmark

Query Performance: Graph Traversal

Query Performance: Branch Union

Conclusions
• Hybrid architectures might be good at getting the best of both
• They’re feasible, performance are acceptable with both technologies
• rdf2neo can help you with keeping everything aligned to a conceptual
data model
• Helps with Linked Data and FAIR Principles
• Please checkout GitHub, get in touch (especially if you’re on
agriculture/plant biology)
• It comes with some overhead. You might need just one half
• Whatever you do, follow LOD/FAIR

Acknowledgements
Ajit Singh
Software Engineer
Monika Mistry
Master Student, Data Curator
Keywan Hassani-Pak
KnetMiner Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
William Brown
IT Admin

And You All!
Marco Brandizi
marco.brandizi@rothamsted.ac.ukFind these slides at:

Cypher vs SPARQL
Proteins->Reactions->Pathways:
// chain of paths, node selection via property (exploits indices)
MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] ->
(pway:Path{ title: ‘apoptosis’ })
// further conditions, not always so performant
WHERE prot.name =~ ‘(?i)^DNA.+’
// Usual projection and post-selection operators
RETURN prot.name, pway
// Relations can have properties
ORDER BY csby.pvalue
LIMIT 1000
Proteins->Reactions->Pathways:
// Single-path (or same-direction branching) easy to write
MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
- [:part_of*1..3] -> (pway:Path)
RETURN ID(prot), ID(pway) LIMIT 1000
// Very compact forms available, depending on the data
MATCH (prot:Protein) - (pway:Path) RETURN pway

select distinct ?prot ?pway {
where {
# Branch 1
?prot kb:pd_by|kb:cs_by ?react.
?prot a kb:Protein.
?react a kb:Reaction.
?react kb:part_of ?pway.
?pway a kb:Path.
}
union { # Branch 2
?prot ^kb:ac_by|kb:is_a ?enz.
?prot a kb:Protein.
?enz a kb:Enzyme.
{ # Branch 2.1
?enz kb:ac_by|kb:in_by ?comp.
?comp a kb:Compound.
?comp kb:cs_by|kb:pd_by ?trns
?trns a kb:Transport
} union {
# Branch 2.2
?enz ^kb:ca_by ?trns.
?comp a kb:Compound.
?trns a kb:Transport
}
?trns kb:part_of ?pway.
?pway a kb:Path.
}
} LIMIT 1000
Cypher vs SPARQL

Loading Performance
Details at: https://github.com/Rothamsted/graphdb-benchmarks

Conclusions
Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores
Data xchg format
- No official one, just Cypher,
Support for GraphML, RDF
+/- Focus on backing applications
+ Focus on data sharing standards
Data model
+ Relations with properties
- Metadata/schemas/ontologies management
- Relations cannot have properties (reification
required)
+ Metadata/schemas/ontologies as first citizen
and standardised OWL
Performance + complex graph traversals + Comparable in most cases
Query Language
+ Cypher is easier (eg, compact, implicit elems)? -
Expressivity issues (unions)
- No standard QL (but efforts in progress, eg,
OpenCypher)
- SPARQL is Harder? (URIs, namespaces,
verbosity) + SPARQL More expressive
Standardisation,
openness
+/- (TinkerPop is open, Neo4J isn’t)
+ Commercial support
+ More alive and up-to date (e.g., support for
Hadoop, nice Neo4j browser, easy installation)
+ Natively open, many open implementations
- Instability and many short-lived prototypes
- Advancements seems to be slowing down
+ Some nice open and commercial browser
(LODEStar,
Scalability, big data
+/- Commercial support to clustering/clouds for
Neo4J + Open support in TinkerPop
+ Load Balancing/Cluster solutions, Commercial
Cloud support (eg GraphDB) + SPARQL Over
TinkerPop (via SAIL inteface)

Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case

Similaire à Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case (20)

Plus de Rothamsted Research, UK

Plus de Rothamsted Research, UK (20)

Dernier

Dernier (20)

Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner Use Case

Notes de l'éditeur