Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© Insight 2014. All Rights Reserved
Processing Life Science Data at Scale –
using Semantic Web Technologies
Ali Hasnain, N...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Presenters and Contributors
Ali Hasnain
Dietrich Re...
•Motivation
•Query Federation
•Infrastructure
•Graph Aggregation
•Visualization
•Hands On Session
Agenda
© Insight 2014. All Rights Reserved
Session 0: Motivation
The Web is evolving...
WWW (Tim Berners-
Lee)
“There was a second
part of the dream […]
we could then use
computers to hel...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved 6 of 46
Two Key Ingredients
1. RDF – Resource Descr...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Why Graphs?
TGFβ-3
transforming growth
factor, beta...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Why Graphs?
TGFβ-3
transforming growth
factor, beta...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Linked Open Data Cloud
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Life Sciences….
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Relevant trends ongoing in the public
• “Health wea...
Distributed Biomedical Repositories (Demo)
http://safe.sels.insight- 12
Insight Centre for Data Analytics
© Insight 2014. All Rights Reserved
Session 1: Accessing Data- Query
Federation
Ali Hasnain, Naoise Dunne, Dietrich Rebhol...
© Insight 2014. All Rights Reserved
Distributed Biomedical Repositories
© Insight 2014. All Rights Reserved
SPARQL Query Federation Approaches
•SPARQL Endpoint Federation (SEF)
•Linked Data Fede...
© Insight 2014. All Rights Reserved
SPARQL Query Federation Approaches
© Insight 2014. All Rights Reserved
SPARQL Endpoint Federation Approaches
• Most commonly used approaches
• Make use of SP...
© Insight 2014. All Rights Reserved
Linked Data Federation Approaches
• Data needs not be exposed via SPARQL endpoints
• U...
© Insight 2014. All Rights Reserved
Query federation on top of Distributed Hash
Tables
• Uses DHT indexing to federate SPA...
© Insight 2014. All Rights Reserved
Hybrid of SEF+LDF
Federation over SPARQL endpoints and Linked Data
Can potentially dea...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
SPARQL Endpoint Federation
S1 S2 S3 S4
RDF RDF RDF ...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their ...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their ...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their ...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their ...
© Insight 2014. All Rights Reserved
SPARQL Query Federation Engine
•FedX
•SPLENDID
•HiBISCuS+FedX
•HiBISCuS+SPLENDID
•ANAP...
© Insight 2014. All Rights Reserved
Overview of Implementation details of Federated
Sparql Query Engines
© Insight 2014. All Rights Reserved
System Features of Federated Sparql Query
Engines
© Insight 2014. All Rights Reserved
System’s Support for SPARQL Query Construct
© Insight 2014. All Rights Reserved
Session 2: Infrastructure
trends in big linked data infrastructure
© Insight 2014. All Rights Reserved
Mixed data sources and query tools
Cancer Atlas
Curated
Federated
cluster
(Indexed)
HB...
© Insight 2014. All Rights Reserved
Introduction
Cloud computing frameworks tailored for managing and
analyzing big data-s...
© Insight 2014. All Rights Reserved
What is Big Data?
“Big Data is characterized by High Volume, Velocity and
Variety requ...
© Insight 2014. All Rights Reserved Infographic summerizing 4 Vs
© Insight 2014. All Rights Reserved
The nature of applications
What is the nature of application this kind of infrastructu...
© Insight 2014. All Rights Reserved
Using spark
© Insight 2014. All Rights Reserved
How can I get me one?
How to enable your research with cutting edge
distributed comput...
© Insight 2014. All Rights Reserved
Distributed
High Utilisation & Scalability
The Rise of distributed Datacenter Schedule...
© Insight 2014. All Rights Reserved
The qualities of a big data Infrastructure
• Distributed
• Data and its processing is ...
© Insight 2014. All Rights Reserved
Why Datacenter schedulers?
Schedulers run your Distributed Apps
• are an operating sys...
© Insight 2014. All Rights Reserved
Benefits of using a Scheduler
• Efficiency - best use of computing resources
• Agility...
© Insight 2014. All Rights Reserved
Datacenter schedulers
Schedulers help you focus on your own work and not the
infrastru...
© Insight 2014. All Rights Reserved Quick history of distributed schedulers
2004 mapreduce
paper
2004 Google
Borg
2011 Had...
© Insight 2014. All Rights Reserved
Hadoop
Monolithic scheduler: Original
opensource datacenter scheduler
• jobs are batch...
© Insight 2014. All Rights Reserved
Mesos
2 level scheduler : More flexible
• Can Schedule many kinds of applications
• Fr...
© Insight 2014. All Rights Reserved
Schedulers Recap
Scheduler allow you to use your cluster as one machine
• ease operati...
© Insight 2014. All Rights Reserved
Managing Failure in
Infrastructure
“Everything fails all the time”
Werner Vogels (CTO ...
© Insight 2014. All Rights Reserved
Handling Failure
The harsh reality:
all distributed infrastructures must deal with fai...
© Insight 2014. All Rights Reserved
Fallacies of distributed computing
• The most common misconceptions that lead to failu...
© Insight 2014. All Rights Reserved
How successful infrastructure avoids failure
No Single point of failure
multiple maste...
© Insight 2014. All Rights Reserved
Operationally efficient
Using Containers for better Quality of Service
© Insight 2014. All Rights Reserved
Containers
Containers run your application in isolation in a portable
and repeatable f...
© Insight 2014. All Rights Reserved
Why Containers
Containers Are:
• Small (footprint)
• Portable
• Fast
Containers Allow:...
© Insight 2014. All Rights Reserved
Containers vs. VMs
Virtual Machines
emulate a virtual hardware,
require considerable
o...
© Insight 2014. All Rights Reserved
Containers vs. VMs
© Insight 2014. All Rights Reserved
Containers
But which container to use.
Many choices exist…
LXC, Docker, Rocket
© Insight 2014. All Rights Reserved
Docker - our container of choice
Why Docker?
• Best of breed at the moment
• integrate...
© Insight 2014. All Rights Reserved
Container Standardisation
But choosing a container is not a commitment...
Looks like s...
© Insight 2014. All Rights Reserved
Recap: Containers
Containers Are:
• Small (footprint), Portable, Fast
• Allow you to r...
© Insight 2014. All Rights Reserved
Putting it all together
Insights Linked Data infrastructure
© Insight 2014. All Rights Reserved
Linked Data Infrastructure
Mesos
Mesos - scheduler short jobs Mesos - scheduler long r...
© Insight 2014. All Rights Reserved
Linked Data Infrastructure
Mesos
Linux Server Linux Server Linux Server
Mesos - resour...
© Insight 2014. All Rights Reserved
Recap
To provide linked data at scale and with the right service
mix, infrastructures ...
© Insight 2014. All Rights Reserved
Thank you
Q & A
© Insight 2014. All Rights Reserved
Graph Aggregation at scale
Naoise Dunne, Ali Hasnain, Dietrich Rebholz-Schuhmann
© Insight 2014. All Rights Reserved
Linked Data
A method of publishing structured
data so that it can be interlinked.
• No...
© Insight 2014. All Rights Reserved
Linked data as a graph
Linked data can be considered a Heterogeneous Graph
What do we ...
© Insight 2014. All Rights Reserved graphs within a Heterogeneous Graph
© Insight 2014. All Rights Reserved
Comparing Aggregation
Approaches
© Insight 2014. All Rights Reserved
Linked data as a graph
As Linked data graphs share graph structure and can be
connecte...
© Insight 2014. All Rights Reserved
Comparing Graph Query Approaches
Gremlin - graph traversal
A “pure” graph traversal la...
© Insight 2014. All Rights Reserved
Comparing Graph Query Approaches
Gremlin - graph traversal
A “pure” graph traversal la...
© Insight 2014. All Rights Reserved
Other Graph Query Languages
Gremlin -
graph traversal
A “pure” graph traversal languag...
© Insight 2014. All Rights Reserved
Working with linked data with Gremlin
© Insight 2014. All Rights Reserved
How to Aggregate Graphs
with Sparql?
© Insight 2014. All Rights Reserved
What is Graph Aggregation
Condenses a large graph into a structurally similar but
smal...
© Insight 2014. All Rights Reserved
What is Graph Aggregation
PREFIX : <http://example.org/>
PREFIX rdf:<http://www.w3.org...
The Famous LOD Cloud
http://lod-cloud.net/
The Famous LOD Cloud
(from a different Angel)
Gagg: Two-steps Aggregation
● Relation & measure
● Subject dimension(s)
& measure
● Object dimension(s) &
measure
Uses agg...
© Insight 2014. All Rights Reserved
How do we Aggregate
Linked data at scale
© Insight 2014. All Rights Reserved
Scaling SPARQL
So you still want to use sparql at scale
How do we query RDF data with ...
© Insight 2014. All Rights Reserved
Complementing Sparql with Graph algorithm
When not SPARQL
• For distributed computing,...
© Insight 2014. All Rights Reserved
Thank you
Q & A
© Insight 2014. All Rights Reserved
Sparql Federation example
Q: How are the protein targets of the
gleevec drug different...
© Insight 2014. All Rights Reserved
Why use Graph algorithms
Polynomial time
A lot of queries on linked data can be expres...
© Insight 2014. All Rights Reserved
More pure Dijkstra using Cypher
MATCH (from: Location {LocationName:"x"}), (to: Locati...
© Insight 2014. All Rights Reserved
Dijkstra Psudocode
dist[s]←0 (distance to
source vertex is zero)
forall v ∈ V–{s}
do d...
© Insight 2014. All Rights Reserved
Thank you
Q & A
© Insight 2014. All Rights Reserved
Visualization
Ali Hasnain, Naoise Dunne, Dietrich Rebholz-Schuhmann
© Insight 2014. All Rights Reserved
Visualization
• Visualize your Data!
• Visualize to easily Query your Data
• Example T...
© Insight 2014. All Rights Reserved
ReVeaLD Search Platform
ReVeaLD :- Real-Time Visual Explorer and Aggregator of Linked ...
© Insight 2014. All Rights Reserved
ReVeaLD Search Platform
Availability:http://140.203.155.7:31005/explorer
Demo: https:/...
© Insight 2014. All Rights Reserved
DSL Visual Representation
Concept Map Visualization is used.
© Insight 2014. All Rights Reserved
Visual Query Builder
© Insight 2014. All Rights Reserved
Visual Query Model
© Insight 2014. All Rights Reserved
FedViz: A Visual Interface for SPARQL Queries
Formulation and Execution
FedViz is an o...
© Insight 2014. All Rights Reserved
FedViz: A Visual Interface for SPARQL Queries Formulation
and Execution
© Insight 2014. All Rights Reserved
Using FedViz: Step by Step
© Insight 2014. All Rights Reserved
Hands On Session
Naoise Dunne, Ali Hasnain
© Insight 2014. All Rights Reserved
Hands on session
This tutorial is available from
https://goo.gl/JjfJ0f
Prochain SlideShare
Chargement dans…5
×

Processing Life Science Data at Scale - using Semantic Web Technologies

The life sciences domain has been one of the early adopters
of linked data and, a considerable portion of the Linked Open Data cloud is comprised of datasets from Life Sciences Linked Open Data (LSLOD). The deluge of biomedical data in the last few years, partially caused by the advent of high-throughput gene sequencing technologies, has been a primary motivation for these e fforts. This success has lead to the growth in size of data sets and to the need for integrating multiples of these data-sets. This growth requires large scale distributed infrastructure and specifi c techniques for managing large linked data graphs. Especially in combination with Semantic Web and Linked Data technologies these promises to enable the processing of large as well as semantically heterogeneous data sources and the capturing of new knowledge from those. In this tutorial we present the state of the art in large data processing, as well as the amalgamation with Linked Data and Semantic Web technologies for better knowledge discovery and targeted applications. We aim to provide useful information for the Knowledge Acquisition research community as well as the working Data Scientist.

  • Identifiez-vous pour voir les commentaires

Processing Life Science Data at Scale - using Semantic Web Technologies

  1. 1. © Insight 2014. All Rights Reserved Processing Life Science Data at Scale – using Semantic Web Technologies Ali Hasnain, Naoise Dunne, Dietrich Rebholz-Schuhmann
  2. 2. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Presenters and Contributors Ali Hasnain Dietrich Rebholz-Schuhmann Naoise Dunne
  3. 3. •Motivation •Query Federation •Infrastructure •Graph Aggregation •Visualization •Hands On Session Agenda
  4. 4. © Insight 2014. All Rights Reserved Session 0: Motivation
  5. 5. The Web is evolving... WWW (Tim Berners- Lee) “There was a second part of the dream […] we could then use computers to help us analyse it, make sense of what we re doing, where we individually fit in, and how we can better work together.”
  6. 6. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved 6 of 46 Two Key Ingredients 1. RDF – Resource Description Framework Graph based Data – nodes and arcs • Identifies objects (URIs) • Interlink information (Relationships) 2. Vocabularies (Ontologies) • provide shared understanding of a domain • organise knowledge in a machine-comprehensible way • give an exploitable meaning to the data
  7. 7. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Why Graphs? TGFβ-3 transforming growth factor, beta 3 Homo sapiens CCCCGGCGCAGCGCGGCCGCA GCAGCCTCCGCCCCCCGCACGG TGTGAGCGCCCGACGCGGCCG AGGCGG … 14q24 nci:has_description nih:sequence nih:organism nih:location nih:organism TGFβ-3 Platelet activation, signalling,aggregation Response to elevated platelet cytosol Ca2+ Platelet degranulation rea:process rea:process rea:process Gene Database Pathway Database
  8. 8. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Why Graphs? TGFβ-3 transforming growth factor, beta 3 Homo sapiens CCCCGGCGCAGCGCGGCCGCA GCAGCCTCCGCCCCCCGCACGG TGTGAGCGCCCGACGCGGCCG AGGCGG … 14q24 nci:has_description nih:sequence nih:organism nih:location nih:organism TGFβ-3 Platelet activation, signalling,aggregation Response to elevated platelet cytosol Ca2+ Platelet degranulation rea:process rea:process rea:process Gene Database Pathway Database
  9. 9. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Linked Open Data Cloud
  10. 10. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Life Sciences….
  11. 11. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Relevant trends ongoing in the public • “Health wearables” • Generating data, private data, data streams • Personalized medicine • Genomics information available, but also biomarker analysis • Health-related, sports, nutrition, aging • Treatment according to specific genomic parameters • Systems medicine, translational medicine
  12. 12. Distributed Biomedical Repositories (Demo) http://safe.sels.insight- 12 Insight Centre for Data Analytics
  13. 13. © Insight 2014. All Rights Reserved Session 1: Accessing Data- Query Federation Ali Hasnain, Naoise Dunne, Dietrich Rebholz-Schuhmann
  14. 14. © Insight 2014. All Rights Reserved Distributed Biomedical Repositories
  15. 15. © Insight 2014. All Rights Reserved SPARQL Query Federation Approaches •SPARQL Endpoint Federation (SEF) •Linked Data Federation (LDF) •Distributed Hash Tables (DHTs) •Hybrid of SEF+LDF
  16. 16. © Insight 2014. All Rights Reserved SPARQL Query Federation Approaches
  17. 17. © Insight 2014. All Rights Reserved SPARQL Endpoint Federation Approaches • Most commonly used approaches • Make use of SPARQL endpoints URLs • Fast query execution • RDF data needs to be exposed via SPARQL endpoints • E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD etc.
  18. 18. © Insight 2014. All Rights Reserved Linked Data Federation Approaches • Data needs not be exposed via SPARQL endpoints • Uses URI lookups at runtime • Data should follow Linked Data principles • Slower as compared to previous approaches • E.g., LDQPS, SIHJoin, WoDQA etc.
  19. 19. © Insight 2014. All Rights Reserved Query federation on top of Distributed Hash Tables • Uses DHT indexing to federate SPARQL queries • Space efficient • Cannot deal with whole LOD • E.g., ATLAS
  20. 20. © Insight 2014. All Rights Reserved Hybrid of SEF+LDF Federation over SPARQL endpoints and Linked Data Can potentially deal with whole LOD E.g., ADERIS-Hybrid
  21. 21. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved SPARQL Endpoint Federation S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimzer Integrator Rewrite query and get Individual Triple Patterns Identify capable source against Individual Triple Patterns Generate optimized sub-query Exe. Plan Integrate sub- queries results Execute sub- queries Curtsey Muhammad Saleem (AKSW)
  22. 22. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 Source Selection Curtsey Muhammad Saleem (AKSW)
  23. 23. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 TP3 = S1 Source Selection Curtsey Muhammad Saleem (AKSW)
  24. 24. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 TP3 = S1 TP4 = S4 Source Selection Curtsey Muhammad Saleem (AKSW)
  25. 25. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 TP3 = S1 TP4 = S4 TP5 = S1 S2 S4-S9 Source Selection Total triple pattern-wise sources selected = 1+1+1+1+8 => 12 Curtsey Muhammad Saleem (AKSW)
  26. 26. © Insight 2014. All Rights Reserved SPARQL Query Federation Engine •FedX •SPLENDID •HiBISCuS+FedX •HiBISCuS+SPLENDID •ANAPSID •BioFed •LHD •DARQ
  27. 27. © Insight 2014. All Rights Reserved Overview of Implementation details of Federated Sparql Query Engines
  28. 28. © Insight 2014. All Rights Reserved System Features of Federated Sparql Query Engines
  29. 29. © Insight 2014. All Rights Reserved System’s Support for SPARQL Query Construct
  30. 30. © Insight 2014. All Rights Reserved Session 2: Infrastructure trends in big linked data infrastructure
  31. 31. © Insight 2014. All Rights Reserved Mixed data sources and query tools Cancer Atlas Curated Federated cluster (Indexed) HBASE Distributed Database DrugBank Sparql 1.1 Ad hoc Federated Federation RDD and MapReduce Mongo Distributed Database Gremlin Titian Distributed Graph Database
  32. 32. © Insight 2014. All Rights Reserved Introduction Cloud computing frameworks tailored for managing and analyzing big data-sets are powering ever larger clusters of computers. This presentation will describe the infrastructure that is required to serve a linked data flavour of big data
  33. 33. © Insight 2014. All Rights Reserved What is Big Data? “Big Data is characterized by High Volume, Velocity and Variety requiring specific Technology and Analytical Methods for its transformation into Value” - Gartner • Volume - Data is too big to fit on even largest server • Velocity - Data need handling based on speed • Variety - heterogeneous Data - comes in many forms • Veracity - (sometimes) Quality of the data
  34. 34. © Insight 2014. All Rights Reserved Infographic summerizing 4 Vs
  35. 35. © Insight 2014. All Rights Reserved The nature of applications What is the nature of application this kind of infrastructure enables Zebra fish (Jeremy Freeman) • needed to map and manipulate the brain at scale to understand Zebrafish neural activity • Scales too great for one processor
  36. 36. © Insight 2014. All Rights Reserved Using spark
  37. 37. © Insight 2014. All Rights Reserved How can I get me one? How to enable your research with cutting edge distributed computing with little effort
  38. 38. © Insight 2014. All Rights Reserved Distributed High Utilisation & Scalability The Rise of distributed Datacenter Schedulers
  39. 39. © Insight 2014. All Rights Reserved The qualities of a big data Infrastructure • Distributed • Data and its processing is shared on large cluster multiple cheap commodity servers • loose coupling, isolation, location transparency, data locality & app-level composition • High Utilisation • The infrastructure give the best use of computing resources • Resilience (handling failure) • The infrastructure stays responsive in the face of failure and can “heal” • Scalability • grows to meet demand (Elastic), responsive under varying workload (Load balanced) • Operationally efficient • Needs to be highly automated, be very easy to maintain
  40. 40. © Insight 2014. All Rights Reserved Why Datacenter schedulers? Schedulers run your Distributed Apps • are an operating system kernel for the cloud • Schedulers coordinate execution of work on cluster • help you to get as many compute resources as you want whenever you want it • Abstract some scalability and load balancing issues
  41. 41. © Insight 2014. All Rights Reserved Benefits of using a Scheduler • Efficiency - best use of computing resources • Agility - change your application mix with no turnaround • Scalability - grow to the current demand of your app • Modularity - 2 level schedulers have plugin frameworks that allow quick repurposing of core and no reliance on one vendor (more later)
  42. 42. © Insight 2014. All Rights Reserved Datacenter schedulers Schedulers help you focus on your own work and not the infrastructure. “its great to be able to focus on what it is you want to be doing rather than worrying about how do you get what it is you need in order to be able to get stuff done” - John Wilkes (Google)
  43. 43. © Insight 2014. All Rights Reserved Quick history of distributed schedulers 2004 mapreduce paper 2004 Google Borg 2011 Hadoop1.0 2003 Google filesystem 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2008 Hadoop released 2013 Yarn 2010 Spark Paper 2010 Nexus (Mesos) 2005 Hadoop started 2013 Mesos Released 2011 Mesos Paper 2014 Kubernetes 2014 Google Omega paper History of Datacenter Schedulers 2003 Slurm
  44. 44. © Insight 2014. All Rights Reserved Hadoop Monolithic scheduler: Original opensource datacenter scheduler • jobs are batched and executed • Designed only to run Mapreduce jobs • No concurrency between apps • Evolving into yarn Hadoop Linux Server Linux Server hadoop- resource management mesos slavemesos slave Linked datat m/r job Linked datat m/r job Linked datat app
  45. 45. © Insight 2014. All Rights Reserved Mesos 2 level scheduler : More flexible • Can Schedule many kinds of applications • Frameworks (such as spark) are delegated the per application scheduling • Mesos responsible for resource distribution between applications and enforcing overall fairness • Very modular, due to 2 level scheduling. frameworks manage apps as they like Mesos Linux Server Linux Server Mesos - resource management Mesos - scheduler jobs framework chronos mesos slave framework spark framework marathon mesos slave Hadoop M/R job Linked data job Linked datat app
  46. 46. © Insight 2014. All Rights Reserved Schedulers Recap Scheduler allow you to use your cluster as one machine • ease operations • provide elasticity and load balancing • Can run both batch and longer running jobs • Are Efficient, Agile, scalable and modular
  47. 47. © Insight 2014. All Rights Reserved Managing Failure in Infrastructure “Everything fails all the time” Werner Vogels (CTO Amazon)
  48. 48. © Insight 2014. All Rights Reserved Handling Failure The harsh reality: all distributed infrastructures must deal with failure for the designers of applications running on distributed infrastructure (even Mesos), a great number of design mistakes need to be avoided...
  49. 49. © Insight 2014. All Rights Reserved Fallacies of distributed computing • The most common misconceptions that lead to failure of Network infrastructure: • The network is reliable. • Latency is zero. • Bandwidth is infinite. • The network is secure. • Topology doesn’t change. • There is one administrator. • Transport cost is zero. • The network is homogeneous All prove to be false in the long run and all cause big trouble and painful learning experiences.
  50. 50. © Insight 2014. All Rights Reserved How successful infrastructure avoids failure No Single point of failure multiple masters, multiple copies of data, and redundancy on all services. Use elections between masters, use distributed locks and thresholds. Design Applications to Expect Failure Applications should continue to function even if the underlying physical hardware fails. Evolving infrastructure eases this by the use schedulers and container managers. Special Channel for failures Provides the means to delegate errors as messages on their own channel or service. Techniques include log aggregation and shared monitoring services.
  51. 51. © Insight 2014. All Rights Reserved Operationally efficient Using Containers for better Quality of Service
  52. 52. © Insight 2014. All Rights Reserved Containers Containers run your application in isolation in a portable and repeatable fashion “Because of the way that ... containers separate the application constraints from infrastructure concerns, we help solve that dependency hell.” - Docker
  53. 53. © Insight 2014. All Rights Reserved Why Containers Containers Are: • Small (footprint) • Portable • Fast Containers Allow: • Resiliency • can be redeployed in seconds • Operationally efficiency • The infrastructure can “heal” • Scalability • on a single server allows resources (such as CPU) to be dialed up or down
  54. 54. © Insight 2014. All Rights Reserved Containers vs. VMs Virtual Machines emulate a virtual hardware, require considerable overhead in CPU, Disk and Memory Containers use shared operating systems much more efficient than hypervisors in system resource terms
  55. 55. © Insight 2014. All Rights Reserved Containers vs. VMs
  56. 56. © Insight 2014. All Rights Reserved Containers But which container to use. Many choices exist… LXC, Docker, Rocket
  57. 57. © Insight 2014. All Rights Reserved Docker - our container of choice Why Docker? • Best of breed at the moment • integrates natively with Mesos and Kubernetes • Has great infrastructure including • Docker registry for looking up containers • Docker compose for combining containers Alternatives • Rockit - More secure as uses init.d but Linux only
  58. 58. © Insight 2014. All Rights Reserved Container Standardisation But choosing a container is not a commitment... Looks like standardisation around the corner via open container: https://www.opencontainers.org/ Initiative Sponsors: Apcera, AT&T, AWS, Cisco, ClusterHQ, CoreOS, Datera, Docker, EMC, Fujitsu, Google, Goldman Sachs, HP, Huawei, IBM, Intel, Joyent, Kismatic, Kyup, the Linux Foundation, Mesosphere, Microsoft, Midokura, Nutanix, Oracle, Pivotal, Polyverse, Rancher, Red Hat, Resin.io, Suse, Sysdig, Twitter, Verizon, VMWare
  59. 59. © Insight 2014. All Rights Reserved Recap: Containers Containers Are: • Small (footprint), Portable, Fast • Allow you to repeatedly deploy applications • Work well with schedulers such as Mesos • Help with Resiliency and scalability
  60. 60. © Insight 2014. All Rights Reserved Putting it all together Insights Linked Data infrastructure
  61. 61. © Insight 2014. All Rights Reserved Linked Data Infrastructure Mesos Mesos - scheduler short jobs Mesos - scheduler long run jobs Spark Fwk Chronos Fwk Marathon Framework OS Monitor Mesos Monitor Linux Server Linux Server Linux Server Mesos - resource management mesos client Docker mesos client Docker mesos client Docker Resources cpu mem disk Managed by Mesos Applications work with frameworks to get resources they need Frameworks Negotiate with mesos to run their jobs Datastores HDT, Neo4JgraphX Granatum RevealedGraph Jobs Docker manages isolation on Linux servers
  62. 62. © Insight 2014. All Rights Reserved Linked Data Infrastructure Mesos Linux Server Linux Server Linux Server Mesos - resource management Mesos - scheduler short jobs Mesos - scheduler long run jobs Spark Fwk Chronos Fwk Datastores HDT, Neo4J Marathon graphX mesos client Docker OS Monitor Mesos Monitor mesos client Docker mesos client Docker We use graph X for large graph batch jobs We use both HDT(RDF Store) Neo4J (Graph) Granatum Revealed We deploy specialised linked data applications to cluster
  63. 63. © Insight 2014. All Rights Reserved Recap To provide linked data at scale and with the right service mix, infrastructures need to consider: • Services to help application to be Distributed Scalable • High Utilisation of computing resources • Know how you will handle failure • Operationally efficient We suggest, using schedulers such as mesos with containers (Docker), use suitable frameworks (GraphX/Spark) and datastores (Neo4J)
  64. 64. © Insight 2014. All Rights Reserved Thank you Q & A
  65. 65. © Insight 2014. All Rights Reserved Graph Aggregation at scale Naoise Dunne, Ali Hasnain, Dietrich Rebholz-Schuhmann
  66. 66. © Insight 2014. All Rights Reserved Linked Data A method of publishing structured data so that it can be interlinked. • Normally represented as RDF • Normally queried using SPARQL 4 principles of linked data 1. Use URIs to name (identify) things. 2. Use HTTP URIs so can be looked up 3. Provide metadata about what thing is. - use open standards RDF, SPARQL, etc. 4. Link to other things using their HTTP URI- based names
  67. 67. © Insight 2014. All Rights Reserved Linked data as a graph Linked data can be considered a Heterogeneous Graph What do we mean by Heterogeneous Graph? • Linked data graphs share graph structure but have mixed characteristics • Nodes (Vertices) contain different data • Links (Edges) Mix of directed and undirected • Linked Data Graph can be weighted or not • Can have a mix of classes and types from differing Ontologies What do these graphs look like?
  68. 68. © Insight 2014. All Rights Reserved graphs within a Heterogeneous Graph
  69. 69. © Insight 2014. All Rights Reserved Comparing Aggregation Approaches
  70. 70. © Insight 2014. All Rights Reserved Linked data as a graph As Linked data graphs share graph structure and can be connected we can reason over them using graph algorithms. Popular approaches to querying graphs are: • Declarative pattern matching - SPARQL • graph traversal languages - Cypher, Gremlin • distributed graph data structures - GraphX
  71. 71. © Insight 2014. All Rights Reserved Comparing Graph Query Approaches Gremlin - graph traversal A “pure” graph traversal language based on xpath, allows powerful expression of graph algorithms, but at cost of conciseness Characteristics • Great for expressing most graph algorithms • can be scaled • uses adjacency tables Pros great at localized searches (shortest path for instance) interoperable with most database Cons Can be difficult to understand Graph X - message passing DSL rather than language. For “programmers” Scala, Java and Python APIs. Characteristics • Resilient Distributed Property Graph • index-free adjacency • Vertex/Edge table Pros Best (in list) distributed execution Powerful mix of table and traversal has a set of optimised operators Cons Not a database, data needs to be loaded and stored separately SPARQL -declarative Declarative language, allowing better expression and abstracting the writer from optimisation problems Characteristics Stores Triples (tuples) Vendor normally optimise popular queries. Often built on RDBMS Pros Is Expressive language Easy to express search patterns Cons Difficult to scale Poor at traversal
  72. 72. © Insight 2014. All Rights Reserved Comparing Graph Query Approaches Gremlin - graph traversal A “pure” graph traversal language based on xpath, allows powerful expression of graph algorithms, but at cost of conciseness Graph X - message passing DSL rather than language. Scala, Java and Python APIs. SPARQL -declarative is a declarative language, allowing better expression and abstracting the writer from optimisation problems Abstract Concrete Considerable work (for vendors) to scale Little work to scale Optimised for aggregation Optimised for connections Global Local
  73. 73. © Insight 2014. All Rights Reserved Other Graph Query Languages Gremlin - graph traversal A “pure” graph traversal language based on xpath, allows powerful expression of graph algorithms, but at cost of conciseness Alternatives Network X A python DSL for graphs Graph X - Message passing DSL rather than language. For “programmers” Scala, Java and Python APIs. Very scaleable. Alternatives GraphLab very powerful commercial product Giraffe A hadoop API for graphs SPARQL - Declarative Declarative language, allowing better expression and abstracting the writer from optimisation problems Alternatives Cypher Has pattern matching constructs, can do much the same as sparql on Neo4J database
  74. 74. © Insight 2014. All Rights Reserved Working with linked data with Gremlin
  75. 75. © Insight 2014. All Rights Reserved How to Aggregate Graphs with Sparql?
  76. 76. © Insight 2014. All Rights Reserved What is Graph Aggregation Condenses a large graph into a structurally similar but smaller graph by collapsing vertices and edges Imagine the LOD cloud, and how the aggregate may look
  77. 77. © Insight 2014. All Rights Reserved What is Graph Aggregation PREFIX : <http://example.org/> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> CONSTRUCT { _:b0 a ?t1; :count COUNT(?sub.s) . _:b1 a ?t2; :count COUNT(?obj.o). _:b2 a rdf:Statement; rdf:predicate ?p; rdf:subject _:b0; rdf:object _:b1; :count ?prop_count } WHERE { GRAPH_AGGREGATION { ?s ?p ?o {?s a ?t1} GROUP BY ?t1 AS ?sub {?o a ?t2} GROUP BY ?t2 AS ?obj } }
  78. 78. The Famous LOD Cloud http://lod-cloud.net/
  79. 79. The Famous LOD Cloud (from a different Angel)
  80. 80. Gagg: Two-steps Aggregation ● Relation & measure ● Subject dimension(s) & measure ● Object dimension(s) & measure Uses aggregation functions and a template similar to CONSTRUCT queries
  81. 81. © Insight 2014. All Rights Reserved How do we Aggregate Linked data at scale
  82. 82. © Insight 2014. All Rights Reserved Scaling SPARQL So you still want to use sparql at scale How do we query RDF data with SPARQL at Big data scale? Clustering • Some vendors/platforms provide clustered triple-stores Federation Became available in Sparql 1.1 with SERVICE keyword • Federation emerged with great promises. • For distributed computing, sparql federation has limitations (for now). • Query planning for SPARQL is NP Complete
  83. 83. © Insight 2014. All Rights Reserved Complementing Sparql with Graph algorithm When not SPARQL • For distributed computing, declarative languages such as SPARQL are (for now) problematic • you cannot know if query is NP or EXP time. • Difficult to create query plans especially with fast or changing graphs • Have to rely on federation which is still being researched Why Graph Traversal • Proven to scale • Graphs traversal lower level but easier to tune so that it is in P time. • Most popular algorithm are local (as in) only query neighbouring nodes at any time -thus easier to break up across compute nodes
  84. 84. © Insight 2014. All Rights Reserved Thank you Q & A
  85. 85. © Insight 2014. All Rights Reserved Sparql Federation example Q: How are the protein targets of the gleevec drug differentially expressed, which pathways are they involved in? PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#> PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/> PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX sio: <http://semanticscience.org/resource/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT distinct ?dbXref (str(?pathwayname) as ?pathname) ?factorLabel WHERE { # query chembl for gleevec (CHEMBL941) protein targets ?act a cco:Activity; cco:hasMolecule chembl_molecule:CHEMBL941 ; cco:hasAssay ?assay . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?dbXref . ?targetcmpt cco:taxonomy . ?dbXref a cco:UniprotRef # query for pathways by those protein targets SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> { ?protein rdf:type biopax3:Protein . ?protein biopax3:memberPhysicalEntity [biopax3:entityReference ?dbXref] . ?pathway biopax3:displayName ?pathwayname . ?pathway biopax3:pathwayComponent ?reaction . ?reaction ?rel ?protein . } # get Atlas experiment plus experimental factor where protein is expressed SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?probe atlasterms:dbXref ?dbXref . ?value atlasterms:isMeasurementOf ?probe . ?value atlasterms:hasFactorValue ?factor . ?value rdfs:label ?factorLabel . } }
  86. 86. © Insight 2014. All Rights Reserved Why use Graph algorithms Polynomial time A lot of queries on linked data can be expressed as well known graph traversal algo. that work in P time such as Dijkstra for positive weighted directed graph. Because these alog are localised suit distributed computed.
  87. 87. © Insight 2014. All Rights Reserved More pure Dijkstra using Cypher MATCH (from: Location {LocationName:"x"}), (to: Location {LocationName:"y"}) , paths = allShortestPaths((from)-[:CONNECTED_TO*]->(to)) WITH REDUCE(dist = 0, rel in rels(paths) | dist + rel.distance) AS distance, paths RETURN paths, distance ORDER BY distance LIMIT 1 The other approach used an inbuilt function - this shows a closer approximation of the actual algorithm
  88. 88. © Insight 2014. All Rights Reserved Dijkstra Psudocode dist[s]←0 (distance to source vertex is zero) forall v ∈ V–{s} do dist[v]←∞ (set all other distances to infinity) S←∅ (S,the set of visited vert is initially empty) Q←V (Q,the queue initially contains all vertices) whileQ≠∅ (while the queue is not empty) dou← mindistance(Q,dist) (select the element of Q with the min.distance) S←S∪{u} (add u to list of visited vertices) forall v ∈ neighbors[u] do if dist[v]>dist[u]+w(u,v) (if new shortest path found) then d[v]←d[u]+w(u,v) (set new value of shortest path) (if desired,add trace back code) returndist
  89. 89. © Insight 2014. All Rights Reserved Thank you Q & A
  90. 90. © Insight 2014. All Rights Reserved Visualization Ali Hasnain, Naoise Dunne, Dietrich Rebholz-Schuhmann
  91. 91. © Insight 2014. All Rights Reserved Visualization • Visualize your Data! • Visualize to easily Query your Data • Example Tools • ReVeaLD • FedViz
  92. 92. © Insight 2014. All Rights Reserved ReVeaLD Search Platform ReVeaLD :- Real-Time Visual Explorer and Aggregator of Linked Data, is a user- driven domain-specific search platform. Intuitively formulate advanced search queries using a click-input-select mechanism Visualize the results in a domain–suitable format. Assembly of the query is governed by a Domain Specific Language (DSL), which in this case is the Granatum Biomedical Semantic Model (CanCO)
  93. 93. © Insight 2014. All Rights Reserved ReVeaLD Search Platform Availability:http://140.203.155.7:31005/explorer Demo: https://www.youtube.com/watch?v=6HHK4ASIkJM&hd=1
  94. 94. © Insight 2014. All Rights Reserved DSL Visual Representation Concept Map Visualization is used.
  95. 95. © Insight 2014. All Rights Reserved Visual Query Builder
  96. 96. © Insight 2014. All Rights Reserved Visual Query Model
  97. 97. © Insight 2014. All Rights Reserved FedViz: A Visual Interface for SPARQL Queries Formulation and Execution FedViz is an online application that provides Biologist a flexible visual interface to formulate and execute both federated and non-federated SPARQL queries. It translates the visually assembled queries into SPARQL equivalent and execute using query engine (FedX). Availability: http://srvgal86.deri.ie/FedViz/index.html
  98. 98. © Insight 2014. All Rights Reserved FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
  99. 99. © Insight 2014. All Rights Reserved Using FedViz: Step by Step
  100. 100. © Insight 2014. All Rights Reserved Hands On Session Naoise Dunne, Ali Hasnain
  101. 101. © Insight 2014. All Rights Reserved Hands on session This tutorial is available from https://goo.gl/JjfJ0f

×